Reinforcement Learning-Based Adaptive Position Control Scheme for Uncertain Robotic Manipulators with Constrained Angular Position and Angular Velocity

Xie, Zhihang; Lin, Qiquan

doi:10.3390/app13031275

Open AccessArticle

Reinforcement Learning-Based Adaptive Position Control Scheme for Uncertain Robotic Manipulators with Constrained Angular Position and Angular Velocity

by

Zhihang Xie

^* and

Qiquan Lin

School of Mechanical Engineering, Xiangtan University, Xiangtan 411105, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(3), 1275; https://doi.org/10.3390/app13031275

Submission received: 6 December 2022 / Revised: 11 January 2023 / Accepted: 16 January 2023 / Published: 18 January 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Aiming at robotic manipulators subject to system uncertainty and external disturbance, this paper presents a novel adaptive control scheme that uses the time delay estimation (TED) technique and reinforcement learning (RL) technique to achieve a good tracking performance for each joint of a manipulator. Compared to conventional controllers, the proposed control scheme can not only handle the system parametric uncertainty and external disturbance but also guarantee both the angular positions and angular velocities of each joint without exceeding their preset constraints. Moreover, it has been proved by using Lyapunov theory that the tracking errors are uniformly ultimately bounded (UUB) with a small bound related to the parameters of the controller. Additionally, an innovative RL-based auxiliary term in the proposed controller further minimizes the steady state tracking errors, and thereby the tracking accuracy is not compromised by the lack of asymptotic convergence of tracking errors. Finally, the simulation results validate the effectiveness of the proposed control scheme.

Keywords:

reinforcement learning; robotic manipulator; adaptive control

1. Introduction

Robotic manipulators have been extensively applied in assisting or even replacing humans to perform the various tasks such as assembling [1], machine operation [2], deburring [3], drilling [4], transportation [5] and manufacturing [6,7]. In order to successfully perform such industrial tasks, it is essential to well control the motion of robotic manipulators. However, the highly nonlinear and uncertain dynamics of robotic manipulators result in the inapplicability of many traditional controllers designed for linear systems such as linear quadratic control (LQC) [8,9] and linear H∞ control [10,11]. Hence, many researchers have been motivated to place efforts on designing advanced controllers for robotic manipulators.

State feedback linearization requiring a known dynamic model can transform the nonlinear system to a linear form [12]. However, the dynamic model of robotic manipulators is always uncertain because of unknown external disturbances and uncertain system parameters. To handle the issue of uncertainty, many efforts have been made such as sliding mode control (SMC) [13,14,15,16,17,18], fuzzy logic system (FLS)-based control [19,20,21], neural network (NN)-based control [22,23,24,25,26], and disturbance observer (DOB)-based control [27,28,29]. More precisely, Zhang et al. [13] propose a fixed time sliding mode control for uncertain robotic manipulators in which a conservative switching gain requiring the upper bound of lumped uncertainty is used. In [21], an adaptive controller-based T-S (Takagi-Sugeno) fuzzy system is designed, and the modified T-S fuzzy system can efficiently approximate the unknown model of robotic manipulators. Hu et al. [26] present a multiple-layer neural network-based controller that can achieve a high accuracy of motion control of robotic manipulators subject to unknown disturbances. In [28], a high-order-sliding-mode differentiator (HOSMD)-based estimator is designed to compensate the mismatched uncertainties without the need of assuming a bounded uncertainty.

Besides the uncertainty of the dynamic model, state constraint is also a common problem of robotic manipulators, which has attracted much attention. For example, Sun et al. [30] proposed an adaptive neural network control scheme considering full-state constraints of robotic manipulators. In [31], an adaptive fuzzy control scheme that can guarantee the constrained output is presented. Yu et al. [32] designed an adaptive fuzzy control scheme working with a disturbance observer for the manipulators with full-state constraints. Nevertheless, those controllers can only achieve either the constrained tracking error of joint angle or the constrained error between the actual angular velocity and the virtual control signal. In other words, controllers in [30,31,32] do not guarantee the angular position and angular velocity for each joint of the robotic manipulator, within the preset constraints. Recently, Yang et al. [33] designed a new adaptive control scheme that can guarantee the angular position for each joint of a robotic manipulator to never exceed the preset constraints. However, the constrained angular velocity of each joint is not guaranteed.

In the light of reviewing the existing literature, the following issues need to be further solved:

In order to safely perform robotic manipulators, both the angular position and angular velocity of each joint of robotic manipulators should be controlled to not exceed the preset constraints. More precisely, the angular position (rotation angle) of each joint should be always within a reasonable range to have no risk on physically breaking the joint. Similarly, the angular velocity of each joint should not exceed its maximum related to the maximum rotational speed of the driving motor;
For some existing controllers (e.g., [9,23,26,30,31,32]), the tracking accuracy could be compromised due to the bounded result of tracking errors and uncertainty-estimation errors. Therefore, it is needed to avoid the loss of tracking accuracy caused by the lack of asymptotic convergence of tracking errors.

To handle the above issues, this paper proposes a new adaptive control scheme that utilizes time delay estimation (TDE) and reinforcement learning (RL) for

n

-link robotic manipulators. In the proposed control scheme, the multiple-input-and-multiple-output (MIMO) robotic system is initially decomposed into

n

single-input-single-output (SISO) subsystems by TDE. Each subsystem is with an unknown bounded TDE error. After that, the novel virtual control law for each subsystem is designed, which can not only achieve the boundness of tracking errors for each joint in the presence of TDE error, but can also guarantee both the angular position and angular velocity for each joint to be not exceeding the preset constraints. To improve the tracking accuracy, an RL-based term in the virtual control law is designed, which automatically learns the optimal parameters of the controller in the different system states. RL is an artificial intelligence technique that gradually explores the optimal policy by interacting with the environment, which has attracted many interests in the control of robotic manipulators such as [34,35,36]. Particularly, the RL-based term in this paper is designed to avoid the violation of the boundness of tracking errors, even if a bad policy is tried by RL, which ensures a safe environment for implementing RL. The tracking errors are proven to be uniformly ultimately bounded (UUB) via Lyapunov theory. Simulation results indicate the proposed control scheme can achieve a high tracking accuracy in the presence of model uncertainty and unknown disturbances.

The major merits of this paper include the following points:

Compare with some existing research [30,31,32,33], in addition to the basic achievement of the uniformly ultimately bounded (UUB) tracking error for each joint in the presence of TDE error; the control scheme can guarantee both the angular position and angular velocity for each joint to be not exceeding the preset constraints;
The novel adaptive gain in (13) results in the smooth control torques to reduce the chattering effect caused by switching term in (10). Meanwhile, an RL-based term can effectively improve the tracking accuracy, which thereby reduces the possible steady-state tracking errors caused by the lack of asymptotic convergence;
The mathematical expression of the controller is simple, meanwhile, any prior knowledge of upper bounds caused by an imprecise model are unnecessary in our control scheme.

The rest of this paper is organized as follows: in Section 2, the dynamics model of n-link robotic manipulators is given, and the control objective is described. In Section 3, the RL-based adaptive control scheme is proposed and the proof of stability is given. In Section 4, the numerical simulation is conducted to verify the effectiveness of the proposed controller. The conclusion is given in Section 5.

2. Dynamical Model and Problem Statement

The dynamic model of n-link robot manipulators is shown as the following:

M (q (t)) \ddot{q} (t) + C (q (t), \dot{q} (t)) \dot{q} (t) + G (q (t)) + F (\dot{q} (t)) = τ (t) + τ_{d} (t)

(1)

where

M (q (t))

∈ R^n×n is the inertia matrix,

q (t) = {[q_{1} (t), q_{2} (t), .., q_{i} (t), .., q_{n} (t)]}^{T}

∈ Rⁿ is the vector of angular positions of joints of manipulator.

C (q (t), \dot{q} (t))

∈ R^n×n is the Coriolis and centrifugal matrix.

G (q (t))

∈ Rⁿ is the gravity vector.

F (\dot{q} (t))

∈ Rⁿ is the vector of friction.

τ

(t) ∈ Rⁿ is the vector of torques applied on the joints.

τ_{d} (t) \in R^{n}

is the external disturbance.

The model (1) can be further written as (2) to indicate the system uncertainty.

\ddot{q} (t) = M {(q (t))}^{- 1} [- C (q (t), \dot{q} (t)) \dot{q} (t) - G (q (t)) - F (\dot{q} (t)) + τ_{d} (t)] + [M {(q (t))}^{- 1} - \hat{M} {(q (t))}^{- 1}] τ (t) + \hat{M} {(q (t))}^{- 1} τ (t) = Γ (t) + \hat{M} {(q (t))}^{- 1} τ (t)

(2)

where

\hat{M} (q (t))

is the estimation of

M (q (t))

.

Γ (t) = M {(q (t))}^{- 1} [- C (q (t), \dot{q} (t)) \dot{q} (t) - G (q (t)) - F (\dot{q} (t)) + τ_{d} (t)] + [M {(q (t))}^{- 1} - \hat{M} {(q (t))}^{- 1}] τ (t)

is the system uncertainty.

The vector of error between the desired angular position and actual angular position is defined in (3):

e (t) = q (t) - q_{d}

(3)

where

q_{d} = {[q_{d 1}, q_{d 2}, \dots, q_{d n}]}^{T} \in R^{n}

is the vector of desired angular position of joints.

e = {[e_{1}, e_{2}, \dots, e_{n}]}^{T} \in R^{n}

is the tracking error vector.

The main control objective is to design the torque

τ (t)

that can drive all the joints of the robotic manipulator system (1) to approach their desired angular positions. Meanwhile, both the angular positions and angular velocities for each joint should be guaranteed to not exceed the given constraints. The control objective can be described by (4)–(6).

| | e (t) | | \leq σ, \forall t \geq 0

(4)

| q_{i} (t) | < ε_{i}, \forall t \geq 0, i = 1, 2, .., n

(5)

| {\dot{q}}_{i} (t) | < Λ_{i}, \forall t \geq 0, i = 1, 2, .., n

(6)

where

σ^{*} > σ \geq 0

, and

σ^{*} > 0

is a positive constant,

ε_{i} > 0

is a positive constant referring to the angular position constraint of the

i^{t h}

joint, and

Λ_{i} > 0

is a positive constant, meaning the angular velocity constraint of the

i^{t h}

joint.

Remark 1.

The positive constant

σ^{*}

is related to the initial state of the system and parameters of the controller.

σ

reflects the tracking accuracy. Notably,

σ (t \to \infty) = 0

means the asymptotic convergence of tracking errors.

Remark 2.

In this paper, we consider the angular position tracking problem for each joint of the manipulator. Hence,

q_{d i}

is a constant for

i = 1, 2, .., n .

, which means

{\dot{q}}_{d i} = 0

. The angular trajectory tracking problem will be considered in our future work.

3. Controller Design and Stability Analysis

In this part, the adaptive RL-based controller working with TDE is developed. After that, the stability is proven by using Lyapunov theory.

3.1. Controller Design

The TDE technique is applied to handle the system uncertainty in (2):

\hat{Γ} (t) \approx Γ (t - L) = \ddot{q} (t - L) - {\hat{M}}^{- 1} (q (t - L)) τ (t - L)

(7)

where

\hat{Γ} (t) = {[{\hat{Γ}}_{1}, {\hat{Γ}}_{2}, .., {\hat{Γ}}_{n}]}^{T} \in R^{n}

is the estimate of

Γ (t)

.

L > 0

is the sampling time of TDE.

Lemma 1.

[14] The TDE error of robotic manipulator (2) is bounded such that

| Γ_{i} (t) - {\hat{Γ}}_{i} (t) | \leq Γ_{i}^{*}

(for

i = 1, 2, \dots, n

) if the following condition is satisfied:

| | I - M^{- 1} (q (t)) \hat{M} {(q (t - L)) | |}_{2} < 1

(8)

where

Γ_{i}^{*}

is an unknown positive constant.

The control law working with TDE technique is designed as follows:

τ (t) = {\hat{M}}^{} (q (t)) [- \hat{Γ} (t) + u (t)]

(9)

where

u (t) = {[u_{1} (t), u_{2} (t), \dots, u_{n} (t)]}^{T} \in R^{n}

is the virtual control law.

The virtual control law in (9) is designed as follows:

u_{i} = - {\hat{d}}_{i} s g n ({\dot{q}}_{i}) - \frac{1}{1 + k_{y i} C_{i}} [k_{p i} e_{i} + k_{d i} {\dot{q}}_{i} + k_{s i} A_{i} q_{i} + k_{s i} B_{i} z_{i} + λ_{i} {\dot{q}}_{i} + λ_{i} Λ_{i} t a n h (\frac{e_{i}}{₵_{i}})] i = 1, 2, \dots, n

(10)

where

k_{p i}

,

k_{s i}

,

k_{d i}, ₵_{i}

and

k_{y i}

are the positive constants determined by users.

λ_{i} > 0

is a positive variable determined by the fuzzy reinforcement learning mechanism.

A_{i} = \frac{ε_{i}^{2} z_{i}^{2}}{{(ε_{i}^{2} - q_{i}^{2})}^{2}}, B_{i} = \frac{ε_{i}^{2}}{ε_{i}^{2} - q_{i}^{2}}, C_{i} = \frac{2 Λ_{i}^{2}}{{(Λ_{i}^{2} - {\dot{q}}_{i}^{2})}^{2}}

(11)

where

ε_{i} > 0

is the restricted upper bound of angular position of the

i^{t h}

joint.

Λ_{i} > 0

is the restricted upper bound of angular velocity of the

i^{t h}

joint.

And the variable

z_{i}

is defined as follows:

z_{i} = e_{i} + \int_{0}^{t} η_{i} (Θ) d Θ, η_{i} = - β_{i} z_{i}

(12)

where

β_{i}

is a positive constant.

The

{\hat{d}}_{i}

is used to handle the bounded TDE error and the update law of

{\hat{d}}_{i}

is designed as follows.

{\dot{\hat{d}}}_{i} = {\begin{matrix} ψ_{i} (1 + k_{y i} C_{i}) | {\dot{q}}_{i} |, i f ({\hat{d}}_{i} \leq 0) o r (Ω_{i} > {\bar{Ω}}_{i}) \\ - ψ_{i} \frac{δ_{i}}{(1 + k_{y i} C_{i}) | {\dot{q}}_{i} |}, i f ({\hat{d}}_{i} > 0) a n d (Ω_{i} \leq {\bar{Ω}}_{i}) \end{matrix}

(13)

where

ψ_{i}

,

δ_{i}

and

{\bar{Ω}}_{i}

are positive constants. The variable

Ω_{i}

is defined as (14).

Ω_{i} = \frac{1}{2} {\dot{q}}_{i}^{2} + \frac{1}{2} k_{p i} e_{i}^{2} + \frac{1}{2} k_{s i} \frac{ε_{i}^{2} z_{i}^{2}}{ε_{i}^{2} - q_{i}^{2}} + k_{y i} \frac{{\dot{q}}_{i}^{2}}{Λ_{i}^{2} - {\dot{q}}_{i}^{2}} + λ_{i} ₵_{i} \ln [\cosh (\frac{e_{i}}{₵_{i}})]

(14)

Remark 3.

{\hat{d}}_{i}

is to guarantee the stability in the presence of the bounded TDE errors. A conservative update law of

{\hat{d}}_{i}

is to monotonously increase the value of

{\hat{d}}_{i}

such that

{\dot{\hat{d}}}_{i} = ψ_{i} (1 + k_{y i} C_{i}) | {\dot{q}}_{i} |

. Although such adaptive law can achieve the bounded tracking errors and angular velocities, the great value of

{\hat{d}}_{i}

could result in the chattering effect on the calculated control torques. Therefore, a novel adaptive law for

{\hat{d}}_{i}

shown in (13) is proposed to mitigate the chattering effect by decreasing the value of

{\hat{d}}_{i}

without the breach of stability of system. The proof will be given later. Moreover,

{\hat{d}}_{i} > 0

holds because

{\hat{d}}_{i} \leq 0

leads to

{\dot{\hat{d}}}_{i} = ψ_{i} (1 + k_{y i} C_{i}) | {\dot{q}}_{i} | \geq 0

.

Remark 4.

Similar to [33], the terms

A_{i} q_{i}

and

B_{i} z_{i}

in (10) are to guarantee the angular position of each joint of manipulator to not exceed the preset constraint

\pm ε_{i}

. While the term

\frac{1}{1 + k_{y i} C_{i}}

in (10) is to guarantee the angular velocity of each joint to not exceed the preset constraint

\pm Λ_{i}

, which was not achieved in [33]. The proof will be detailed later.

3.2. Stability Analysis

Theorem 1.

If the initial angular position and velocity of all joints are within their preset constraints such that

| q_{i} (0) | < ε_{i}

and

| {\dot{q}}_{i} (0) | < Λ_{i}

, and (8) in lemma 1 holds, the control law consisting of (7) and (9)–(14) can achieve the uniformly ultimately bounded tracking errors of robotic manipulator system (1). Meanwhile, the angular velocity and angular position of each joint of manipulator are within the preset constraints such that

| q_{i} (t) | < ε_{i}

and

| {\dot{q}}_{i} (t) | < Λ_{i}

,

\forall t > 0

. Namely, the control target (4)–(6) is achieved.

Proof.

By inserting (9) into (2) and using the fact of

{\hat{M}}^{- 1} \hat{M} = I

with

I = d i a g ([1, 1, .., 1]) \in R^{n \times n}

, the MIMO robotic manipulator system can be decoupled into

n

uncertain SISO subsystems (15).

{\ddot{q}}_{i} (t) = u_{i} (t) + d_{i} (t) i = 1, 2, \dots, n

(15)

where

q_{i} \in q

is the angular position of the

i^{t h}

joint.

u_{i}

is designed in (10).

d_{i} (t) = Γ_{i} (t) - {\hat{Γ}}_{i} (t)

is the TDE error.

The following Lyapunov candidate is designed:

V = \sum_{i = 1}^{n} V_{i} = \sum_{i = 1}^{n} [\frac{1}{2} {\dot{q}}_{i}^{2} + \frac{1}{2} k_{p i} e_{i}^{2} + \frac{1}{2 ψ_{i}} {\tilde{d}}_{i}^{2} + k_{s i} \frac{ε_{i}^{2} z_{i}^{2}}{2 (ε_{i}^{2} - q_{i}^{2})} + k_{y i} \frac{{\dot{q}}_{i}^{2}}{Λ_{i}^{2} - {\dot{q}}_{i}^{2}} + λ_{i} Λ_{i} ₵_{i} \ln (\cos h (\frac{e_{i}}{₵_{i}}))]

(16)

where

{\tilde{d}}_{i} = Γ_{i}^{*} - {\hat{d}}_{i}

.

Γ_{i}^{*}

is the upper bound TDE error defined in Lemma 1.

Remark 5.

Clearly,

k_{s i} \frac{ε_{i}^{2} z_{i}^{2}}{2 (ε_{i}^{2} - q_{i}^{2})} > 0

holds as long as

| q_{i} | < ε_{i}

.

k_{y i} \frac{{\dot{q}}_{i}^{2}}{Λ_{i}^{2} - {\dot{q}}_{i}^{2}} > 0

as long as

| {\dot{q}}_{i} | < Λ_{i}

. Furthermore,

Λ_{i} ₵_{i} l n (c o s h (\frac{e_{i}}{₵_{i}})) \geq 0

holds because of the fact of

c o s h (\cdot) \geq 1

and the fact of

l n (x) \geq 0

with

x \geq 1

. Therefore, the Lyapunov candidate (16) is positive defined as long as

| q_{i} | < ε_{i}

and

| {\dot{q}}_{i} | < Λ_{i}

.

Taking the derivative of the Lyapunov function (16) with respect to the time t and using (11), (12) and (15), we have:

\begin{array}{l} \dot{V} = \sum_{i = 1}^{n} [{\dot{q}}_{i} {\ddot{q}}_{i} + k_{p i} e_{i} {\dot{q}}_{i} - \frac{1}{ψ_{i}} {\tilde{d}}_{i} {\dot{\hat{d}}}_{i} + k_{s i} A_{i} q_{i} {\dot{q}}_{i} + k_{s i} B_{i} z_{i} {\dot{z}}_{i} + k_{y i} C_{i} {\dot{q}}_{i} {\ddot{q}}_{i} + λ_{i} Λ_{i} t a n h (\frac{e_{i}}{₵_{i}}) {\dot{q}}_{i}] \\ = \sum_{i = 1}^{n} {\dot{q}}_{i} [(1 + k_{y i} C_{i}) (u_{i} + d_{i}) + k_{p i} e_{i} + k_{s i} A_{i} q_{i} + k_{s i} B_{i} z_{i} + λ_{i} Λ_{i} t a n h (\frac{e_{i}}{₵_{i}})] \\ - \sum_{i = 1}^{n} (\frac{1}{ψ_{i}} {\tilde{d}}_{i} {\dot{\hat{d}}}_{i} + k_{s i} B_{i} z_{i} η_{i}) \end{array}

(17)

Substituting (10) and into (17) with the fact of

C_{i} > 0

as long as

| {\dot{q}}_{i} | < Λ_{i}

:

\dot{V} = \sum_{i = 1}^{n} {\dot{q}}_{i} (1 + k_{y i} C_{i}) [- {\hat{d}}_{i} s g n ({\dot{q}}_{i}) + d_{i}] - \sum_{i = 1}^{n} \frac{1}{ψ_{i}} {\tilde{d}}_{i} {\dot{\hat{d}}}_{i} - \sum_{i = 1}^{n} k_{s i} B_{i} β_{i} z_{i}^{2} - \sum_{i = 1}^{n} (λ_{i} + k_{d i}) {\dot{q}}_{i}^{2} \leq \sum_{i = 1}^{n} | {\dot{q}}_{i} | (1 + k_{y i} C_{i}) (Γ_{i}^{*} - {\hat{d}}_{i}) - \sum_{i = 1}^{n} \frac{1}{ψ_{i}} {\tilde{d}}_{i} {\dot{\hat{d}}}_{i} = \sum_{i = 1}^{n} [| {\dot{q}}_{i} | (1 + k_{y i} C_{i}) - \frac{1}{ψ_{i}} {\dot{\hat{d}}}_{i}] (Γ_{i}^{*} - {\hat{d}}_{i})

(18)

Therefore, for each

V_{i}

, we can derive (19) by combining (13) and (18):

{\dot{V}}_{i} \leq {\begin{matrix} 0, i f ({\hat{d}}_{i} \leq 0) o r (Ω_{i} \geq {\bar{Ω}}_{i}) \\ x_{i} (Γ_{i}^{*} - {\hat{d}}_{i}), i f ({\hat{d}}_{i} > 0) a n d (Ω_{i} < {\bar{Ω}}_{i}) \end{matrix}

(19)

where

x_{i} = | {\dot{q}}_{i} | (1 + k_{y i} C_{i}) + \frac{δ_{i}}{| {\dot{q}}_{i} | (1 + k_{y i} C_{i})}

. Clearly,

x_{i} \geq 2 \sqrt{δ_{i}} > 0

holds.

In addition, the

V_{i}

defined in (16) can be written as:

V_{i} = Ω_{i} + \frac{1}{2 ψ_{i}} {\tilde{d}}_{i}^{2}

(20)

where

Ω_{i}

is defined in (14).

To prove the Lyapunov function,

V_{i}

is bounded by a positive constant and we assume a sufficiently large constant

V_{i}^{*}

. Clearly, the sufficiently large

V_{i}^{*}

requires at least one of the terms (

Ω_{i}

or

{\tilde{d}}_{i}^{2}

) to be sufficiently large. If

Ω_{i}

is sufficiently large such that

Ω_{i} > {\bar{Ω}}_{i}

, then

{\dot{V}}_{i} \leq 0

holds according to (19). If

{\tilde{d}}_{i}^{2} = {(Γ_{i}^{*} - {\hat{d}}_{i})}^{2}

is sufficiently large such that

{\tilde{d}}_{i}^{2} > {(2 Γ_{i}^{*})}^{2}

, then

{\hat{d}}_{i} > 3 Γ_{i}^{*}

will hold because of the facts of

Γ_{i}^{*} > 0

(Lemma 1) and

{\hat{d}}_{i} \geq 0

(remark 3), which further means

Γ_{i}^{*} - {\hat{d}}_{i} < 0

holds. As a result, according to (19) and the fact of

x_{i} \geq 2 \sqrt{δ_{i}} > 0

,

{\dot{V}}_{i} \leq 0

can hold by

Ω_{i} > {\bar{Ω}}_{i}

or

{\tilde{d}}_{i}^{2} > {(2 Γ_{i}^{*})}^{2}

, which means

{\dot{V}}_{i} \leq 0

holds if

V_{i} \geq {\bar{Ω}}_{i} + \frac{1}{2 ψ_{i}} {(2 Γ_{i}^{*})}^{2}

holds. It is thereby easy to conclude that

V_{i}

is bounded by

V_{i}^{*}

such that

V_{i} \leq V_{i}^{*} = \max {V_{i} (0), {\bar{Ω}}_{i} + \frac{1}{2 ψ_{i}} {(2 Γ_{i}^{*})}^{2}}

.

Therefore, the tracking error of the

i^{t h}

subsystem is bounded because of the fact of

\frac{1}{2} k_{p i} e_{i}^{2} \leq V_{i}

.

| e_{i} | \leq \sqrt{\frac{2}{k_{p i}} \max {V_{i} (0), ({\bar{Ω}}_{i} + \frac{1}{2 ψ_{i}} (2 Γ_{i}^{*})^{2})} i = 1, 2, .., n

(21)

Then, the norm of the vector of the tracking error is thereby bounded.

| | e (t) | | \leq \sqrt{\sum_{i = 1}^{n} \frac{2}{k_{p i}} \max {V_{i} (0), [{\bar{Ω}}_{i} + \frac{1}{2 ψ_{i}} (2 Γ_{i}^{*})^{2}]}

(22)

Moreover, the bounded

V_{i}

implies all terms in

V_{i}

are bounded. The (23) therefore holds, which means

| q_{i} | < ε_{i}

and

| {\dot{q}}_{i} | < Λ_{i}

hold.

\frac{ε_{i}^{2} z_{i}^{2}}{2 (ε_{i}^{2} - q_{i}^{2})}, \frac{{\dot{q}}_{i}^{2}}{Λ_{i}^{2} - {\dot{q}}_{i}^{2}} \in L_{\infty} i = 1, 2, \dots, n

(23)

□

3.3. Fuzzy Q Reinforcement Learning Mechanism Determining Parameters of Controller

In this section, a fuzzy Q reinforcement learning mechanism is designed to tune the parameter

λ_{i}

to improve the tracking accuracy. The motivations of RL are detailed in remark 7.

Lemma 2.

If (24) holds, the tracking error

e_{i}

defined in (3) will asymptotically converge to zero with the converging rate satisfying

| {\dot{q}}_{i} | < Λ_{i}

.

0 = {\dot{q}}_{i} + Λ_{i} t a n h (\frac{e_{i}}{₵_{i}}), i = 1, 2, .. n

(24)

Proof.

A simple Lyapunov function is given in (25).

V_{i} = \frac{1}{2} e_{i}^{2}

(25)

Combining (24) and derivative of (25), we can obtain (26).

{\dot{V}}_{i} = - Λ_{i} e_{i} t a n h (\frac{e_{i}}{₵_{i}}) \leq - Λ_{i} ₵_{i} t a n h^{2} (\frac{e_{i}}{₵_{i}}) \leq 0

(26)

{\dot{V}}_{i} \leq 0

implies

V_{i} \in L_{\infty}

. And then, (27) can be obtained by integration on both sides of (26).

V_{i} (\infty) - V_{i} (0) \leq - Λ_{i} ₵_{i} \int_{0}^{\infty} t a n h^{2} (\frac{e_{i}}{₵_{i}})

(27)

In the light of Barbalet’s Lemma, (27) implies

t a n h (\frac{e_{i} (\infty)}{₵_{i}}) = 0

, which means

e_{i} (\infty) = 0

. Moreover, it is clear that

| {\dot{q}}_{i} | = | Λ_{i} t a n h (\frac{e_{i}}{₵_{i}}) | < Λ_{i}

due to the fact of

\tanh (\cdot) < 1

.

□

Remark 6.

The asymptotic convergence of tracking errors in Lemma 2 is stronger than the boundness of tracking errors that we achieved in Theorem 1, which means a better tracking accuracy. It is because asymptotic convergence means

e_{i}

is eventually going to zero, while boundedness only implies

| e_{i} |

bounded by a positive constant. Therefore, the proposed controller can be improved with a better tracking accuracy by finding the optimal parameters of the proposed controller that are able to minimize

| {\dot{q}}_{i} + Λ_{i} t a n h (\frac{e_{i}}{₵_{i}}) |

.

The converging behaviour of tracking errors of (24) is visualized in Figure 1.

Lemma 3.

For a moment

t^{*}

, if the system states

(q_{i} (t^{*}), {\dot{q}}_{i} (t^{*}))

satisfy (5) and (6) as well as

{\dot{q}}_{i} (t^{*}) + Λ_{i} t a n h (\frac{e_{i} (t^{*})}{₵_{i}}) \neq 0

holds, there always exists the optimal parameter

λ_{i}^{*} > 0

to decrease

{({\dot{q}}_{i} + Λ_{i} t a n h (\frac{e_{i}}{₵_{i}}))}^{2}

. Namely, (28) holds.

\frac{d [({\dot{q}}_{i} + Λ_{i} t a n h (\frac{e_{i}}{₵_{i}}))^{2}]}{d t} |_{t = t^{*}} < 0

(28)

Proof.

Combining (10) and (15), we can obtain the 1st derivative of

{\dot{q}}_{i} + Λ_{i} t a n h (\frac{e_{i}}{₵_{i}})

with respect to time, shown as (29).

\frac{d ({\dot{q}}_{i} + Λ_{i} t a n h (\frac{e_{i}}{₵_{i}}))}{d t} = - {\hat{d}}_{i} s g n ({\dot{q}}_{i}) + d_{i} (t) + \frac{Λ_{i}}{₵_{i}} \frac{\partial t a n h (\frac{e_{i}}{₵_{i}})}{\partial \frac{e_{i}}{₵_{i}}} {\dot{q}}_{i} - \frac{1}{1 + k_{y i} C_{i}} [k_{p i} e_{i} + k_{d i} {\dot{q}}_{i} + k_{s i} A_{i} q_{i} + k_{s i} B_{i} z_{i}] - \frac{1}{1 + k_{y i} C_{i}} [λ_{i} {\dot{q}}_{i} + λ_{i} Λ_{i} t a n h (\frac{e_{i}}{₵_{i}})]

(29)

Then, we define a negative changing rate of

{({\dot{q}}_{i} + Λ_{i} t a n h (\frac{e_{i}}{₵_{i}}))}^{2}

at moment

t^{*}

in (30).

\frac{d}{d t} {[{\dot{q}}_{i} (t) + Λ_{i} t a n h (\frac{e_{i} (t)}{₵_{i}})]}^{2} |_{t = t^{*}} < 0

(30)

Using (5), (6), (29) and (30), a solution of

λ_{i}^{*}

satisfying (30) is obtained in (31).

λ_{i}^{*} > \frac{1}{| {\dot{q}}_{i} + Λ_{i} t a n h (\frac{e_{i}}{₵_{i}}) |} [(1 + k_{y i} C_{i}) ({\hat{d}}_{i} + Γ_{i}^{*} + \frac{Λ_{i}^{2}}{₵_{i}}) + k_{p i} | e_{i} | + k_{d i} Λ_{i} + k_{s i} | A_{i} | ε_{i} + k_{s i} | B_{i} | | z_{i} |]

(31)

Notably,

{\hat{d}}_{i}

,

| e_{i} |

,

| z_{i} |

,

| A_{i} |

and

| B_{i} |

are all bounded because of the bounded Lyapunov function (16). Therefore, a finite real solution of

λ_{i}^{*}

satisfying (31) exists as long as

{\dot{q}}_{i} + Λ_{i} t a n h (\frac{e_{i}}{₵_{i}}) \neq 0

.

□

Remark 7.

According to Lemma 3 and Remark 6, the optimal parameters

λ_{i}^{*}

leading to a decrease of

| {\dot{q}}_{i} + Λ_{i} t a n h (\frac{e_{i}}{₵_{i}}) |

at the moment

t^{*}

can improve the tracking accuracy. Although a large enough

λ_{i}

can be selected to satisfy (31), an inappropriately large

λ_{i}

could lead to a significant chattering of control torques and thereby compromise the tracking accuracy. Moreover, an optimal

λ_{i}

is hard to be deterministically found due to the complexity of the system and the unknown TDE errors. Therefore, a fuzzy Q RL mechanism is designed to automatically determine the optimal

λ_{i}^{*}

.

Fuzzy Q learning is a common version of RL applicable on continuous systems, which can explore the optimal policy by interacting with the environment [37]. In this paper, the

λ_{i}

is to be tuned by a fuzzy Q learning mechanism according to the tracking error

e_{i}

and the angular velocity

{\dot{q}}_{i}

. The linguistic rules to determine

λ_{i}

can be given as the following form:

IF e_{i} (k) is ℒ_{e, i} AND {\dot{q}}_{i} (k) is ℒ_{\dot{q, i}}, THEN λ_{i} (k) is ℒ_{λ, i}

(32)

ℒ_{e, i}

,

ℒ_{\dot{q, i}}

and

ℒ_{λ, i}

are the linguistic descriptions of tracking error

e_{i}

, angular velocity

{\dot{q}}_{i}

and parameter

λ_{i}

respectively.

k

is the current moment. The linguistic description could be “small”, “medium” and “big”. As a result, an example of linguistic rule could be: IF

e_{i} (k)

is

s m a l l

AND

{\dot{q}}_{i} (k)

is

s m a l l

, THEN

λ_{i} (k)

is

s m a l l

.

Some intuitive linguistic rules can be given as:

IF $e_{i} (k)$ is small, and ${\dot{q}}_{i} (k)$ is small THEN $λ_{i} (k)$ is small
IF $e_{i} (k)$ is large, and ${\dot{q}}_{i} (k)$ is large THEN $λ_{i} (k)$ is large
IF $e_{i} (k)$ is large, and ${\dot{q}}_{i} (k)$ is small THEN $λ_{i} (k)$ is large
IF $e_{i} (k)$ is small, and ${\dot{q}}_{i} (k)$ is large THEN $λ_{i} (k)$ is small

Remark 8.

To carry out the linguistic inference shown by (32) in a numerical form, fuzzy logic inference is required. In detail, initially, numerical variables

e_{i} (k)

and

{\dot{q}}_{i} (k)

at the moment

k

are fuzzified to the firing rates of the linguistic descriptions by the triangular membership function shown in Figure 2. After that, a group of firing rates of fuzzy rules are obtained by fuzzy reasoning. Then, the numerical values of

λ_{i} (k)

are calculated by the defuzzification according to the firing rates of all fuzzy rules and the numerical value of the action corresponding to each fuzzy rule.

The parameters of membership function shown in Figure 2 are defined as follows:

L i n (e_{i}) = {ζ_{1, 1}^{(i)}, \dots, ζ_{1, 𝒶}^{(i)}, \dots, ζ_{1, A}^{(i)}}, 𝒶 = 1, 2, \dots, A

(33)

L i n ({\dot{q}}_{i}) = {ζ_{2, 1}^{(i)}, \dots, ζ_{2, 𝒷}^{(i)}, \dots, ζ_{2, ℬ}^{(i)}}, 𝒷 = 1, 2, \dots, ℬ

(34)

where

A

is the number of fuzzy sets (

ζ_{1, 𝒶}

) for the fuzzy input

e_{i}

, and

ℬ

is the number of fuzzy sets (

ζ_{1, 𝒷}

) for the fuzzy input

{\dot{q}}_{i}

.

The

𝓃

-th fuzzy rule in fuzzy Q learning is defined as follows:

R_{𝓃, i} : IF 𝓈_{1, i}^{(k)} is L_{1, i}^{(k)} and 𝓈_{2, i}^{(k)} is L_{2, i}^{(𝓃)} and \dots \dots and 𝓈_{𝓂, i}^{(k)} is L_{𝓂, i}^{(𝓃)}, THEN u_{i}^{(𝓃)} \in U_{i, 𝓃} that u_{i}^{(𝓃)} = u_{i, 1}^{(𝓃)} with 𝓆_{i}^{(k)} (𝓃, 1) or u_{i}^{(𝓃)} = u_{i, 2}^{(𝓃)} with 𝓆_{i}^{(k)} (𝓃, 2) or u_{i}^{(𝓃)} = u_{i, 3}^{(𝓃)} with 𝓆_{i}^{(k)} (𝓃, 3), \dots ., u_{i}^{(𝓃)} = u_{i, p}^{(𝓃)} with 𝓆_{i}^{(k)} (𝓃, p), \dots ., u_{i}^{(𝓃)} = u_{i, P}^{(𝓃)} . with 𝓆_{i}^{(k)} (𝓃, P)

(35)

where

U_{i, 𝓃} = {u_{i, 1}^{(𝓃)}, .. u_{i, p}^{(𝓃)}, .., u_{i, P}^{(𝓃)}}

is the set of action candidates in the rule

R_{𝓃, i}

.

L_{i}^{(𝓃)} =

{

L_{1, i}^{(𝓃)}, \dots, L_{𝓂, i}^{(𝓃)}

} is the set of linguistic variables of fuzzy inputs.

𝓈_{i}^{(k)} = {𝓈_{1, i}^{(k)}, \dots, 𝓈_{𝓂, i}^{(k)}}

is the set of fuzzy inputs at the

k

moment. In this paper, the fuzzy inputs are

e_{i}

and

{\dot{q}}_{i}

such that

𝓈_{i}^{(k)} = {e_{i} (k), {\dot{q}}_{i} (k)}

.

The set of fuzzy inputs

𝓈_{i}^{(k)} = {𝓈_{1, i}^{(k)}, \dots, 𝓈_{𝓂, i}^{(k)}}

is fuzzified by the membership function shown in Figure 2 and then matched with the rule antecedents (19), providing the firing rate vector

φ (𝓈_{i}^{(k)}) = [φ_{1} (𝓈_{i}^{(k)}), φ_{2} (𝓈_{i}^{(k)}), .., φ_{N} (𝓈_{i}^{(k)})]

.

N

is the amount of fuzzy rules (There are

N

fuzzy rules).

For the

𝓃^{t h}

rule in the

i^{t h}

subsystem (

R_{𝓃, i}

), the optimal action at the

k

moment is defined as the action with the maximum

𝓆_{i}^{(k)} (𝓃, p)

,

p \in {1, 2, .., P}

among

P

action candidates.

u^{*}_{i}^{(𝓃)} = a r g \max_{u_{i}^{(𝓃)} \in U_{i, 𝓃}} 𝓆_{i}^{(k)} (𝓃, u_{i, p}^{(𝓃)})

(36)

To prevent the selection of

u

from the local optimum in the learning process, we introduce a greed mechanism:

{\hat{u}}_{i}^{(𝓃)} = {\begin{matrix} u^{+}_{i}^{(𝓃)}, w i t h p r o b a b l i t y ρ \\ u^{*}_{i}^{(𝓃)} w i t h p r o b a b l i t y 1 - ρ \end{matrix}

(37)

where

u^{+}_{i}^{(𝓃)}

is an action randomly selected from

U_{i, 𝓃}

.

0 < ρ < 1

is the probability to explore random actions.

The numerical value of

λ_{i}

is calculated by firing rates and the selected actions:

λ_{i} (𝓈_{i}^{(k)}) = \frac{\sum_{𝓃 = 1}^{N} φ_{𝓃} (𝓈_{i}^{(k)}) {\hat{u}}_{i}^{(𝓃)}}{\sum_{𝓃 = 1}^{N} φ_{𝓃} (𝓈_{i}^{(k)})}

(38)

The update principle of

𝓆

is the most important part in the whole learning process. The

𝓆

-values are updated according to the rewards of the selected actions; the optimal action can achieve higher rewards, therefore, we can finally learn the optimal action.

The Q value at the

k

moment can be designed as follows:

Q (𝓈_{i}^{(k)}) = \frac{\sum_{𝓃 = 1}^{N} φ_{𝓃} (𝓈_{i}^{(k)}) 𝓆_{i}^{(k)} (𝓃, {\hat{u}}_{i}^{(𝓃)})}{\sum_{𝓃 = 1}^{N} φ_{𝓃} (𝓈_{i}^{(k)})}

(39)

The target value at the state

𝓈_{i}^{(k)}

is calculated as:

V (𝓈_{i}^{(k)}) = \frac{\sum_{𝓃 = 1}^{N} φ_{𝓃} (𝓈_{i}^{(k)}) 𝓆_{i}^{(k)} (𝓃, u^{*}_{i}^{(𝓃)})}{\sum_{𝓃 = 1}^{N} φ_{𝓃} (𝓈_{i}^{(k)})}

(40)

When the system state

𝓈_{i}^{(k)}

is driven to the next state

𝓈_{i}^{(k + 1)}

, temporal difference (TD) is calculated according to the reward obtained at the

k

moment

𝓇_{i}^{(k)}

:

∆ Q_{i}^{(k)} = 𝓇_{i}^{(k)} + α_{i} V (𝓈_{i}^{(k + 1)}) - Q (𝓈_{i}^{(k)})

(41)

where

α_{i} \in [0, 1]

is the discount factor reflecting the contribution of the future reward. The reward

𝓇_{i}^{k}

at the

k

moment of the

i^{t h}

subsystem is designed in (42) and the meaning of (42) is explained in Remark 9.

𝓇_{i}^{(k)} = {\begin{matrix} 0, i f | 𝓍_{i} (k) | > σ_{i} \\ \cos (\frac{π}{2} \frac{| 𝓍_{i} (k) |}{σ_{i}}), i f | 𝓍_{i} (k) | \leq σ_{i} \end{matrix}

(42)

where

𝓍_{i} (k) = {\dot{q}}_{i} (k) + Λ_{i} t a n h (\frac{e_{i} (k)}{₵_{i}})

and

σ_{i} > 0

is a positive constant.

Finally, the adaptive law of q-values is:

𝓆_{i}^{(k + 1)} (𝓃, {\hat{u}}_{i}^{(𝓃)}) = 𝓆_{i}^{(k)} (𝓃, {\hat{u}}_{i}^{(𝓃)}) + γ_{i} \cdot ∆ Q_{i}^{(k)} \cdot \frac{φ_{𝓃} (𝓈_{i}^{(k)})}{\sum_{𝓃 = 1}^{N} φ_{𝓃} (𝓈_{i}^{(k)})}

(43)

where

γ_{i} \in [0, 1]

is the learning rate.

Remark 9.

The reward function (42) indicates that higher values of reward

𝓇_{i}^{(k)}

can be obtained by the smaller

𝓍_{i} (k)

, which means the the action that satisfies (24) will be given with the highest reward (

𝓇_{i}^{(k)} = 1

).

The aforementioned process of fuzzy Q learning to tune

λ_{i}

can be also concluded in Figure 3. The proposed control scheme can be concluded in Figure 4.

Remark 10.

All the system states including the tracking errors are proven to be bounded with

λ_{i} > 0

. Therefore, the instability will not occur even if an inappropriate positive value of

λ_{i}

is calculated by the fuzzy Q learning mechanism trying some bad action candidates. Hence, the designed control law (7), (9)–(14) offers the fuzzy Q learning mechanism (32)–(43) a safe environment to learn a optimal

λ_{i}

. Notably, in order to make sure

λ_{i} > 0

always holds, all the action candidates

U_{i, 𝓃} = {u_{i, 1}^{(𝓃)}, .. u_{i, p}^{(𝓃)}, .., u_{i, P}^{(𝓃)}}

should be positive.

Remark 11.

The proof of Theorem 1 requires the parameter

λ_{i} > 0

to be a constant. However, the parameter

λ_{i}

is oneline tuned by a fuzzy Q learning mechanism, and therefore

λ_{i}

is varying in the implementation of the proposed controller. To handle this issue, we design that the time interval between two consecutive fuzzy inferences of

λ_{i}

in fuzzy Q learning is 20-times greater than the time interval between two consecutive control actions

τ

(eg. 0.1s between two consecutive

{\hat{u}}_{i}^{(𝓃)}

(equally, two consecutive

λ_{i}

) while 0.001s between two consecutive

τ

). Namely,

λ_{i} (t) = λ_{i} (t^{*}), \forall t \in [t^{*}, t^{*} + 20 L]

;

τ (t) =

τ (t^{*})

,

\forall t \in [t^{*}, t^{*} + L]

.

t^{*}

is any time moment when the algorithm works. Thereby, not only can the time derivative of

λ_{i}

be negligible in the proof of Theorem 1, but the computational load is also decreased.

4. Simulation Results and Analysis

In this section, similar to works [38,39,40,41,42], we use the simulation to verify the effectiveness of the proposed controller. A 2-rigid-link robotic manipulator shown in Figure 5 is carried in Matlab 2018a. The sampling time of simulating the real dynamics of robotic manipulators is set as

1 \times 10^{- 4}

s. The sampling time of TDE and controller is set as

1 \times 10^{- 3}

s (10 times to

1 \times 10^{- 4}

) to show the discrete nature of using controllers in practice. The sampling time of fuzzy Q learning is set as 0.01 s according to Remark 10. The dynamic model of a 2-rigid-link robotic manipulator is given as follows, which can be also found in [26].

M (q) = [\begin{matrix} m_{2} l_{2}^{2} + 2 l_{1} l_{2} m_{2} \cos (q_{2}) + (m_{1} + m_{2}) l_{1}^{2} & m_{2} l_{2}^{2} + l_{1} l_{2} m_{2} \cos (q_{2}) \\ m_{2} l_{2}^{2} + l_{1} l_{2} m_{2} \cos (q_{2}) & m_{2} l_{2}^{2} \end{matrix}]

C (q, \dot{q}) \dot{q} = [\begin{matrix} - m_{2} l_{1} l_{2} \sin (q_{2}) {\dot{q}}_{2}^{2} - 2 m_{2} l_{1} l_{2} \sin (q_{2}) {\dot{q}}_{1} {\dot{q}}_{2} \\ m_{2} l_{1} l_{2} \sin (q_{2}) {\dot{q}}_{1}^{2} \end{matrix}]

G (q) = [\begin{matrix} (m_{1} + m_{2}) l_{1} c o s (q_{2}) g + m_{2} l_{2} c o s (q_{1} + q_{2}) g \\ m_{2} l_{2} \cos (q_{1} + q_{2}) g \end{matrix}]

F (\dot{q}) = [\begin{matrix} F_{1} \\ F_{2} \end{matrix}]

τ_{d} = [\begin{matrix} τ_{d 1} \\ τ_{d 2} \end{matrix}]

The system parameters are given in Table 1, which are the same as [26].

The initial angular positions and velocities of two joints are set as:

q_{1} = 0^{o}

,

q_{2} = 0^{o}

,

{\dot{q}}_{1} = 0^{o} / s

and

{\dot{q}}_{2} = 0^{o} / s

. The desired angular positions of two joints are set as:

q_{r 1} = 30^{o}

and

q_{r 2} = 45^{o}

.

The upper bounds of the angular position and angular velocity for the

i^{t h}

joint of manipulator (

ε_{i}

and

Λ_{i}

) are selected by the user dependent on the specific mission requirement. To successfully implement the proposed controller, users can select any values satisfying

ε_{i} > | q_{r i} |

and

Λ_{i} > 0

. Therefore, in the simulation of this paper, we select

ε_{1} = 50^{o}

,

ε_{2} = 60^{o}, Λ_{1} = 10^{o} / s

and

Λ_{2} = 12^{o} / s

.

Remark 12.

k_{p i}

and

k_{d i}

are the proportional coefficient and differential coefficient, respectively. The great

k_{p i}

could decrease steady state error, but an excessively great could result in a significant overshoot. While the great

k_{d i}

could improve the robustness to disturbance/uncertainty, but an over-great

k_{d i}

could compromise the tracking accuracy. The selection of

k_{p i}

and

k_{d i}

can be based on the tuning rules of PID controllers.

Remark 13.

Great

k_{s i}

can amplify the effect of the terms

A_{i}

and

B_{i}

in (10) to keep the angular positions from hitting their constraints. An inappropriately small

k_{s i}

could lead to a

| q_{i} |

too close to

ε_{i}

and then result in an over-great

u_{i}

in (10). Therefore, the selection of

k_{s i}

could be started at a small value, and then users could gradually increase

k_{s i}

until the magnitude of

u_{i}

is acceptable. Great

k_{y i}

can amplify the effect of the term

C_{i}

in (10) to keep the angular velocities from hitting their constraints. An excessively great

k_{y i}

could lead to an over great (

1 + k_{y i} C_{i})

in (10) to decline the converging rate of the tracking error even if the angular velocity is not close to its constraint, which could increase the settling time. Therefore, the trial of selecting

k_{y i}

could start at a small value, and then users could decrease it until the converging rate of the tracking error is acceptable.

β_{i}

determines the converging rate of the auxiliary variable

z_{i}

that has a significant effect on steady state error in the controller [33]. However, in the proposed controller, the RL-based term is introduced to improve the tracking performance and thereby the effect of

β_{i}

is decreased. We suggest to offer

β_{i}

with a medium value ranging between 0.1–1.

Remark 14.

large values of

ψ_{i}

and

δ_{i}

can lead to a fast adaption of

{\hat{d}}_{i}

to handle system uncertainty and external disturbance. However, the inappropriately large

ψ_{i}

and

δ_{i}

could result in a sharp variation of

{\hat{d}}_{i}

and then a chattering effect on control torques. Therefore, the selection of

ψ_{i}

and

δ_{i}

could start at a big value, and then users could decrease them until no chattering effect occurs on the control torques. Over-small values of

{\bar{Ω}}_{i}

could lead to an insufficient decrease of

{\hat{d}}_{i}

that still brings up a chattering effect on control torques. Meanwhile, over-great values of

{\bar{Ω}}_{i}

could negatively influence the robustness to uncertainty and disturbance. Hence, the selection of

{\bar{Ω}}_{i}

could start at some small values, and then users could increase

{\bar{Ω}}_{i}

until

{\hat{d}}_{i}

is significantly decreased at the final stage of the control to have a satisfactory chattering attenuation.

Remark 15.

Small/great values of

₵_{i}

can amplify/reduce the effect of the RL term

λ_{i} Λ_{i} t a n h (\frac{e_{i}}{₵_{i}})

in (10). The over-small values of

₵_{i}

could result in the over-great magnitude of

u_{i}

, while the over-great values of

₵_{i}

could bring up an insufficient improvement on tracking performance. Therefore, the trial of selecting

₵_{i}

is suggested to start at a great value, and then users could decrease

₵_{i}

until a satisfying improvement on the tracking performance.

According to Remark 11~Remark 15, the parameters of the proposed controller are selected in Table 2.

Notably, it does not require an extensive trial to select parameters in Table 2 as the satisfying tracking performance is mainly obtained by the optimal

λ_{i}

in the RL term

λ_{i} Λ_{i} t a n h (\frac{e_{i}}{₵_{i}})

. In other words, users can select some medium values for parameters (not optimal) in Table 2 by an acceptable amount of parameters-selection trails. Then, the fuzzy Q learning can automatically explore the

λ_{i}

matching the selected parameters to have a satisfying tracking performance.

Remark 16.

Similar to [36], the parameters of membership function in (33) and (34),

L i n (e_{i}) = {ζ_{1, 1}^{(i)}, ζ_{1, 2}^{(i)}, \dots, ζ_{1, A}^{(i)}}

and

L i n ({\dot{q}}_{i}) = {ζ_{2, 1}^{(i)}, ζ_{2, 2}^{(i)}, \dots, ζ_{2, ℬ}^{(i)}}

, are used to fuzzify

e_{i}

and

{\dot{q}}_{i}

. To well present the

e_{i}

and

{\dot{q}}_{i}

in the form of firing rate, we suggest to offer

ζ_{1, 1}^{(i)}

and

ζ_{2, 1}^{(i)}

with some small values and provide some great values to

ζ_{1, A}^{(i)}

and

ζ_{2, ℬ}^{(i)}

. The

ζ_{1, 𝒶}^{(i)}

and

ζ_{2, 𝒷}^{(i)}

are suggested to be evenly distributed among (

ζ_{1, 1}^{(i)}

,

ζ_{1, A}^{(i)}

) and (

ζ_{2, 1}^{(i)}

,

ζ_{2, ℬ}^{(i)}

), respectively. The great

A

and

ℬ

increase the potential to well present

e_{i}

and

{\dot{q}}_{i}

at the cost of increasing the computational load. Therefore, the trial of selecting

A

and

ℬ

could start at some great values (e.g., 20), and then users could decrease them until a reasonable computational load.

Remark 17.

σ_{i}

is the threshold of obtaining rewards. The inappropriately small values of

σ_{i}

could result in a difficulty in obtaining high values of reward, while the over-great values of

σ_{i}

could result in the different action candidates to be offered with the similarly high rewards even if they lead to the different control performances. Therefore, both of the over-great and over-small values of

σ_{i}

will negatively influence the convergence of q-values in (43) and thereby compromise the performance of reinforcement learning. The selection of

σ_{i}

could start at a small value, and the users could increase

σ_{i}

until a satisfying convergence of q values (the convergence of q-values can be also reflected by the convergence of obtained rewards). The selection of mutation probability

ρ

, learning rate

γ_{i}

and discount factor

α_{2}

can be based on the strategy mentioned in [36].

Remark 18

Action candidates

U_{i, 𝓃} = {u_{i, 1}^{(𝓃)}, .. u_{i, p}^{(𝓃)}, .., u_{i, P}^{(𝓃)}}

are the most important parameters in the proposed control scheme because the optimal

λ_{i}

that brings up a satisfying control performance is calculated based on them. To make the optimal action included in the group of action candidates,

u_{i, 1}^{(𝓃)}

should be given a small value (e.g., 0) while

u_{i, P}^{(𝓃)}

should be given a great value. The rest candidates

u_{i, p}^{(𝓃)}

(

p = 1, 2, .., p

) are suggested to be evenly distributed between

u_{i, 1}^{(𝓃)}

and

u_{i, P}^{(𝓃)}

. Users could initially give

u_{i, P}^{(𝓃)}

with a small value (but greater than

u_{i, 1}^{(𝓃)}

), and an increase of

u_{i, P}^{(𝓃)}

until a sufficient improvement on the tracking performance is achieved.

p

is the amount of action candidates of each fuzzy rule for each subsystem. The selection of

p

could start at a great number (e.g., 20), and then users could decrease

p

until the computational load is acceptable.

According to Remark 16~Remark 18, the parameters of the fuzzy Q learning mechanism are given as follows. The amount of fuzzy sets in (17) are

A = ℬ = 8

. Therefore, the amount of fuzzy rules are

N = 64

. The parameters of membership function to do the fuzzification are:

L i n (e_{1}) = {ζ_{1, 1}^{(1)}, ζ_{1, 2}^{(1)}, \dots, ζ_{1, 8}^{(1)}} =

{−0.0017, −0.0014, −0.001, −0.0006, −0.0002, 0.0002, 0.0006, 0.001, 0.0014, 0.0017}.

L i n ({\dot{q}}_{1}) = {ζ_{2, 1}^{(1)}, ζ_{2, 2}^{(1)}, \dots, ζ_{2, 8}^{(1)}} =

{−0.175, −0.138, −0.097, −0.058, −0.019, 0.019, 0.058, 0.097, 0.138, 0.175}

\times 10^{- 3}

.

L i n (e_{2}) = {ζ_{1, 1}^{(2)}, ζ_{1, 2}^{(2)}, \dots, ζ_{1, 8}^{(2)}} =

{−0.0017, −0.0014, −0.001, −0.0006, −0.0002, 0.0002, 0.0006, 0.001, 0.0014, 0.0017}.

L i n ({\dot{q}}_{2}) = {ζ_{2, 1}^{(2)}, ζ_{2, 2}^{(2)}, \dots, ζ_{2, 8}^{(2)}} =

{−0.175, −0.138, −0.097, −0.058, −0.019, 0.019, 0.058, 0.097, 0.138, 0.175}

\times 10^{- 3}

. The action candidates for each fuzzy rule are:

U_{i, 𝓃} = {0, 22, 44, 66, 88, 111, 133, 155, 177, 200}

for all

i = 1, 2

and

𝓃 = 1, 2, \dots ., 64

. Therefore, the amount of action candidates for each fuzzy rule is

P = 10

. The discount factor is

α_{1} = α_{2} = 0.01

and the learning rate is

γ_{1} = γ_{2} = 0.2

. The threshold of obtaining rewards is

σ_{1} = σ_{2} = 0.01

.

To show the superiority, four existing controllers, Refs. [26,33,43,44] are compared with the proposed controller.

The controller from [26] is given as (44)–(52)

τ = τ_{1} + τ_{2}

(44)

τ_{1} = \hat{M} (q) ({\ddot{q}}_{r} - K_{V} \dot{e} - K_{P} e) + \hat{C} (q, \dot{q}) \dot{q} + \hat{G} (q)

(45)

τ_{2} = - \hat{M} (q) \hat{f} + \hat{M} (q) u_{r}

(46)

u_{r} = [\begin{matrix} - \hat{ξ} \tan h (a_{1} p / ρ_{1}) \\ - \hat{ξ} \tan h (a_{2} p / ρ_{1}) \end{matrix}]

(47)

\hat{f} = {\hat{W}}^{T} σ ({\hat{V}}^{T} X)

(48)

\dot{\hat{ξ}} = a_{1} p t a n h (\frac{a_{1} p}{ρ_{1}}) + a_{1} p t a n h (\frac{a_{2} p}{ρ_{1}}) - K \hat{ξ}

(49)

\dot{\hat{W}} = (σ - σ^{'} {\hat{V}}^{T} X) x^{T} P B - Υ_{W} \hat{W}

(50)

\dot{\hat{V}} = X x^{T} P B {\hat{W}}^{T} σ^{'} - Υ_{V} \hat{V}

(51)

p = 1 + | | X | | + | | \hat{V} | | \cdot | | X | | + | | \hat{W} | | \cdot | | X | |

(52)

where

σ = {[σ_{1}, σ_{2}, \dots, σ_{m}]}^{T}

is the vector of hidden neurons with the activation function

σ_{i} (s) = 1 / (1 + e^{- s})

.

σ^{'}

is the vector of partial derivative of

σ

such that

σ^{'} = \partial σ (s) / \partial s

.

x = {[e_{1}, e_{2}, {\dot{q}}_{1}, {\dot{q}}_{2}]}^{T}

and

X = {[q_{r 1}, q_{r 2}, q_{1}, q_{2}, {\dot{q}}_{1}, {\dot{q}}_{2}, {\ddot{q}}_{1}, {\ddot{q}}_{2}]}^{T}

.

P

satisfies

P A + A^{T} P = - Q

with

A = [\begin{matrix} 0 & I \\ - K_{P} & - K_{V} \end{matrix}]

.

[a_{1}, a_{2}] = x^{T} P B

.

In this simulation, we let the controller from [26] to fully know the system parameters such that

\hat{M} (q) = M (q)

,

\hat{C} (q, \dot{q}) = C (q, \dot{q})

and

\hat{G} (q) = G (q)

.

The parameters in (44)–(52) are given as:

K_{V} = [\begin{matrix} 300 & 0 \\ 0 & 300 \end{matrix}], K_{P} = [\begin{matrix} 200 & 0 \\ 0 & 200 \end{matrix}]

B = [\begin{matrix} 0 & 0 \\ 0 & 0 \\ 1 & 0 \\ 0 & 1 \end{matrix}], Q = [\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

ρ_{1} = 0.01

,

K = 0.005

,

Υ_{W} = 0.15

,

Υ_{V} = 0.15

,

\hat{ξ} (0) = 0.01

\hat{W} (0) = [\begin{matrix} - 0.1 & - 0.1 \\ - 0.1 & - 0.1 \\ - 0.1 & - 0.1 \\ - 0.1 & - 0.1 \\ - 0.1 & - 0.1 \end{matrix}]

\hat{V} (0) = [\begin{matrix} 0.1 & 0.1 & 0.1 & 0.1 & 0.1 \\ 0.1 & 0.1 & 0.1 & 0.1 & 0.1 \\ 0.1 & 0.1 & 0.1 & 0.1 & 0.1 \\ 0.1 & 0.1 & 0.1 & 0.1 & 0.1 \\ 0.1 & 0.1 & 0.1 & 0.1 & 0.1 \\ 0.1 & 0.1 & 0.1 & 0.1 & 0.1 \end{matrix}]

The controller from [33] is given as (53)–(58).

τ = G (q) \hat{p} - K_{d} \dot{q} - K_{p} e - K_{s} d i a g {X_{1}, X_{2}} z

(53)

X_{i} = \frac{ζ_{i}^{2}}{ζ_{i}^{2} - q_{i}^{2}} + \frac{ζ_{i}^{2} q_{i} z_{i}}{{(ζ_{i}^{2} - q_{i}^{2})}^{2}}, i = 1, 2

(54)

z_{i} = e_{i} + \int_{0}^{t} η_{i} (Θ) d Θ, i = 1, 2

(55)

η_{i} = - β_{i} z_{i}, i = 1, 2

(56)

\dot{\hat{p}} = - Ψ G^{T} (q) \dot{q}

(57)

G (q) = [\begin{matrix} \cos (q_{1} + q_{2}) & \cos (q_{2}) \\ \cos (q_{1} + q_{2}) & 0 \end{matrix}]

(58)

where

\hat{p} = {[{\hat{p}}_{1}, {\hat{p}}_{2}]}^{T}

,

z_{1} (0) = e_{1} (0)

and

z_{2} (0) = e_{2} (0)

.

ζ_{1}

and

ζ_{2}

are the specific constraint (designed upper bound) of the 1st and 2nd joint respectively, therefore,

ζ_{1} = ε_{1} = 50^{o}

and

ζ_{2} = ε_{2} = 60^{o}

.

The parameters in (53)–(58) are given as follows:

K_{d} = [\begin{matrix} 60 & 0 \\ 0 & 20 \end{matrix}], K_{p} = [\begin{matrix} 24 & 0 \\ 0 & 24 \end{matrix}], K_{s} = [\begin{matrix} 20 & 0 \\ 0 & 20 \end{matrix}]

β_{1} = 30, β_{2} = 30, Ψ = [\begin{matrix} 0.1 & 0 \\ 0 & 0.1 \end{matrix}]

The controller from [43] is given as (59)–(61).

τ = K_{D}^{- 1} \hat{M} [K_{P} \dot{e} - K_{I} e + K_{D} {\ddot{q}}_{r} + β \dot{s} - K_{D} \hat{Γ} - K_{D} λ s - K_{D} k_{s} s g n (\dot{s})]

(59)

\dot{s} + β s = K_{P} e + K_{I} \int_{0}^{t} e + K_{D} \dot{e}

(60)

{\dot{k}}_{s} = γ | | \dot{s} | |

(61)

where

\dot{s} (0) = {[0, 0]}^{T}

,

k_{s} (0) = 0

,

γ = 0.1

,

\hat{Γ}

is the known part of lumped uncertainty

Γ

. In this simulation, we let

\hat{Γ} = 0.7 Γ

. The parameters

K_{P}

,

K_{D}

,

K_{I}

and

β

are given as follows.

K_{D} = [\begin{matrix} 20 & 0 \\ 0 & 20 \end{matrix}], K_{P} = [\begin{matrix} 10 & 0 \\ 0 & 10 \end{matrix}], K_{I} = [\begin{matrix} 0.01 & 0 \\ 0 & 0.01 \end{matrix}], β = [\begin{matrix} 0.1 & 0 \\ 0 & 0.1 \end{matrix}]

The controller from [44] is given as (62)–(65).

{\begin{matrix} \hat{D} = z + Y \dot{q} \\ \dot{z} = - Y {\hat{M}}^{- 1} z + Y {\hat{M}}^{- 1} (- τ - Y \dot{q} - \bar{D}) \end{matrix}

(62)

τ = \hat{M} ({\ddot{q}}_{r} - Λ \dot{q} + Λ {\dot{q}}_{r}) - \bar{D} - K_{D} s a t (s) - \hat{D}

(63)

s a t (s_{i}) = {\begin{matrix} s g n (s_{i}), | s_{i} | \geq σ \\ s_{i} / σ, | s_{i} | < σ \end{matrix}

(64)

where

σ = 0.1

,

s a t (s) = {[s_{1}, s_{2}]}^{T}

.

z (0) = {[0, 0]}^{T}

and

s = \dot{e} + Λ e

.

\bar{D}

is known part of

{\hat{M}}^{- 1} Γ

,

\bar{D} = 0.7 {\hat{M}}^{- 1} Γ

in the simulation. The parameters

Y

,

Λ

and

K_{D}

are given as follows.

K_{D} = [\begin{matrix} 40 & 0 \\ 0 & 16 \end{matrix}], Y = [\begin{matrix} \frac{1}{0.06} & 0 \\ 0 & \frac{1}{0.06} \end{matrix}], Λ = [\begin{matrix} 40 & 0 \\ 0 & 20 \end{matrix}]

Furthermore, two cases are considered to run the simulation. In the 1st case, the parameters of the dynamic model are fully known, and there is no external disturbance and no friction. In the 2nd case, the parametric uncertainty is considered and the unknown external disturbance and frictions are applied on the dynamics model.

Remark 19.

The selections of parameters for the proposed controller, the controller from [26] and the controller from [33] are all carried out in case 1. In other words, all the parameters for the three different controllers are fine-tuned to have a good performance in case 1. In case 2, all the selected parameters of the three controllers remain unchanged to test the robustness to the lumped uncertainty.

Case 1.

In the absence of system uncertainty, external disturbance and friction.

In this case, the system parameters used in the three controllers (proposed controller, controller from [26] and controller from [33]) are the same as the parameters in the dynamics model of manipulator, which means no parametric uncertainty. Meanwhile, the friction and external disturbance are null, shown as follows:

F (\dot{q}) = [\begin{matrix} F_{1} \\ F_{2} \end{matrix}] = [\begin{matrix} 0 \\ 0 \end{matrix}], τ_{d} = [\begin{matrix} τ_{d 1} \\ τ_{d 2} \end{matrix}] = [\begin{matrix} 0 \\ 0 \end{matrix}]

The comparisons of angular position tracking are shown in Figure 6 and Figure 7, while the comparisons of angular velocity are shown in Figure 8 and Figure 9. Cleanly, in the absence of unknown disturbance and system parametric uncertainty, the proposed controller shows an inferior performance in terms of a greater steady state error and a slower error converging rate, compared to the controllers from [26,33]. In comparison to [43,44], it provides a response faster than the sliding mode controller [44] but slower than the disturbance observer-based controller [44]. However, the steady state error of the proposed controller is smaller than that of [43,44].

In Figure 8 and Figure 9, the angular velocities of the proposed control scheme are within the preset constraint (red lines). While the controllers from [26,33,43,44] result in the angular velocities of two joints exceeding the constraints a

t = 0 ~ 2 s

.

case 2:

In the presence of system uncertainty, external disturbance and friction.

In this case, the system parameters used in all the five controllers are different to that in the dynamics model of the manipulator to indicate the parametric uncertainty. Namely,

∆ M

,

∆ C

and

∆ G

is taken as 20% of

M

,

C

and

G

. Moreover, the friction and external disturbance, which are not known by the controllers, are applied on the dynamics model.

The friction model is from [33].

F (\dot{q}) = [\begin{matrix} f_{s 1} [\tanh (f_{s 2} {\dot{q}}_{1}) - \tan h (f_{s 3} {\dot{q}}_{1})] + f_{c 1} \tanh (f_{c 2} {\dot{q}}_{1}) + f_{v} {\dot{q}}_{1} \\ f_{s 1} [\tanh (f_{s 2} {\dot{q}}_{2}) - \tan h (f_{s 3} {\dot{q}}_{2})] + f_{c 1} \tanh (f_{c 2} {\dot{q}}_{2}) + f_{v} {\dot{q}}_{2} \end{matrix}]

The parameters of the friction model are given as:

f_{s 1} = 20

,

f_{s 2} = 5

,

f_{s 3} = 3

,

f_{c 1} = 10

,

f_{c 2} = 2

and

f_{v} = 10

.

It is widely seen to use triangular functions as the unknown disturbance in the literature of the robotic system control, such as [13,36,45,46]. Therefore, in this paper, the external disturbance is in the form of the triangular functions.

τ_{d} = [\begin{matrix} 0.78 \sin (\frac{π}{3} t + \frac{π}{4}) + 0.065 \sin (\frac{π}{10} t + \frac{π}{4}) \\ 0.58 \cos (\frac{π}{3} t + \frac{π}{4}) + 0.091 \sin (\frac{π}{10} t + \frac{π}{4}) \end{matrix}]

The comparisons of tracking performance and tracking error in the presence of system parametric uncertainty, friction and external disturbance are shown in Figure 10, Figure 11, Figure 12 and Figure 13, respectively. Clearly, the proposed controller can achieve the smallest steady state errors, which means the robustness to the lumped uncertainty and unknown disturbance. The converging rate of tracking errors of the proposed controller is faster than the controller [33,43] but slower than the controller [26,44]. It is because the preset constraint of angular velocity (black lines in Figure 14 and Figure 15) limits the converging rate of tracking errors. Therefore, the converging rate of tracking errors could be increased by increasing the value of velocity constraint (

Λ_{i}

) (e.g., applying some better driving motors that have a greater maximum rotational speed to drive the joints of manipulator).

The computed control torques are shown in Figure 16 and Figure 17. The chattering effect of the proposed controller occurs at the initial stage because of the two following factors: 1. the increasing value of switching gain

{\hat{d}}_{i}

to handle disturbance at the initial stage, 2. the fuzzy Q learning mechanism tried some bad action candidates that lead to an undesirable consequence. After the initial stage (

t > 8 s

), it is observed that the proposed controller shows the smoothest control torque compared to [26,33,43,44]. Figure 18 shows the values of switching gain

{\hat{d}}_{1}

and

{\hat{d}}_{2}

, it is clear the

{\hat{d}}_{1}

and

{\hat{d}}_{2}

will decrease to a small value to avoid a chattering in steady state regardless of the disturbance and uncertainty.

Notably, in both case 1 and case 2, the proposed controller can make the angular positions and angular velocities of two joints to be within their constraints during the whole period of position tracking. More precisely, the angular position of each joint is always between the two black dash lines in Figure 6, Figure 7, Figure 10 and Figure 11. Meanwhile, the angular velocity of each joint is always between the two black dash lines in Figure 8, Figure 9, Figure 14 and Figure 15.

5. Conclusions

This paper proposed a novel adaptive control scheme utilizing TDE and RL for the angular position tracking control of robotic manipulators. The proposed control scheme can achieve a good tracking accuracy and a fast tracking performance even when subject to the system uncertainty and unknown disturbance. Moreover, the angular position and angular velocity of each joint of the manipulator are guaranteed to be within their preset constraints. The boundness of tracking errors and the stability of the robotic system controlled by the proposed controller are proven by Lyapunov theory. Notably, the stability will not be breached by the RL trying some bad action candidates, which ensures a safe environment for RL to explore the optimal policy. Simulation results validate the effectiveness of the proposed control scheme.

Author Contributions

Z.X.: Conceptualization, data curation, writing original draft. Q.L.: Formal analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

The author agrees to publication.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that there are no conflict of interest regarding the publication of this paper.

References

Chan, S.P.; Liaw, H.C. Generalized impedance control of robot for assembly tasks requiring compliant manipulation. IEEE Trans. Ind. Electron. 2002, 43, 453–461. [Google Scholar] [CrossRef]
Naito, J.; Obinata, G.; Nakayama, A.; Hase, K. Development of a Wearable Robot for Assisting Carpentry Workers. Int. J. Adv. Robot. Syst. 2007, 4, 48. [Google Scholar] [CrossRef]
Kazerooni, H.; Bausch, J.J.; Kramer, B.M. An Approach to Automated Deburring by Robot Manipulators. J. Dyn. Syst. Meas. Control 1986, 108, 354–359. [Google Scholar] [CrossRef]
Lee, W.Y.; Shih, C.L. Control and breakthrough detection of a three-axis robotic bone drilling system. Mechatronics 2006, 16, 73–84. [Google Scholar] [CrossRef]
Takei, T.; Imamura, R.; Yuta, S.I. Baggage Transportation and Navigation by a Wheeled Inverted Pendulum Mobile Robot. IEEE Trans. Ind. Electron. 2009, 56, 3985–3994. [Google Scholar] [CrossRef]
Datta, S.; Ray, R.; Banerji, D. Development of autonomous mobile robot with manipulator for manufacturing environment. Int. J. Adv. Manuf. Technol. 2008, 38, 536–542. [Google Scholar] [CrossRef]
Kim, Y.G.; Jeong, K.S.; Lee, J.W. Development of the composite third robot arm of the six-axis articulated robot manipulator. Compos. Struct. 1996, 35, 331–342. [Google Scholar] [CrossRef]
Gao, M.C.; Hou, J.C. Finite time linear quadratic control for weakly regular linear systems. IMA J. Math. Control Inf. 2001, 18, 405–425. [Google Scholar] [CrossRef]
Dabiri, A.; Chahrogh, L.K.; Machado, J.A.T. Closed-form Solution for The Finite-horizon Linear-quadratic Control Problem of Linear Fractional-order Systems. In Proceedings of the 2021 American Control Conference (ACC), New Orleans, LA, USA, 25–28 May 2021. [Google Scholar]
Shi, S. H_∞ output feedback stabilization for continuous-time switched linear systems. In Proceedings of the 2014 International Conference on Mechatronics and Control (ICMC), Jinzhou, China, 3–5 July 2014. [Google Scholar]
Chang, X.; Yang, G. New Results on Output Feedback H_∞ Control for Linear Discrete-Time Systems. IEEE Trans. Autom. Control 2014, 59, 1355–1359. [Google Scholar] [CrossRef]
Kim, E. Output feedback tracking control of robot manipulators with model uncertainty via adaptive fuzzy logic. IEEE Trans. Fuzzy Syst. 2004, 12, 368–378. [Google Scholar] [CrossRef]
Zhang, L.; Wang, Y.; Hou, Y.; Li, H. Fixed-time sliding mode control for uncertain robot manipulators. IEEE Access 2019, 7, 149750–149763. [Google Scholar] [CrossRef]
Baek, J.; Jin, M.; Han, S. A New Adaptive Sliding-Mode Control Scheme for Application to Robot Manipulators. IEEE Trans. Ind. Electron. 2016, 63, 3628–3637. [Google Scholar] [CrossRef]
Islam, S.; Liu, P.X. Robust sliding mode control for robot manipulators. IEEE Trans. Ind. Electron. 2011, 58, 2444–2453. [Google Scholar] [CrossRef]
Ahmed, S.; Wang, H.; Tian, Y. Adaptive fractional high-order terminal sliding mode control for nonlinear robotic manipulator under alternating loads. Asian J. Control 2021, 23, 1900–1910. [Google Scholar] [CrossRef]
Feng, Y.; Zhou, M.; Yu, X.; Han, F. Full-order sliding-mode control of rigid robotic manipulators. Asian J. Control 2019, 21, 1228–1236. [Google Scholar] [CrossRef]
Qi, W.; Zong, G.; Karimi, H.R. Sliding mode control for nonlinear stochastic semi-Markov switching systems with application to space robot manipulator model. IEEE Trans. Ind. Electron. 2020, 67, 3955–3966. [Google Scholar] [CrossRef]
Sun, F.; Li, L.; Li, H.; Liu, H. Neuro-fuzzy dynamic inversionbased adaptive control for robotic manipulators—Discrete time case. IEEE Trans. Ind. Electron. 2007, 54, 1342–1351. [Google Scholar] [CrossRef]
Fateh, S.; Fateh, M.M. Adaptive fuzzy control of robot manipulators with asymptotic tracking performance. J. Control Autom. Electr. Syst. 2020, 31, 52–61. [Google Scholar] [CrossRef]
Fan, Y.; An, Y.; Wang, W.; Yang, C. TS Fuzzy Adaptive Control Based on Small Gain Approach for an Uncertain Robot Manipulators. Int. J. Fuzzy Syst. 2020, 22, 930–942. [Google Scholar] [CrossRef]
He, W.; Dong, Y.; Sun, C. Adaptive neural impedance control of a robotic manipulator with input saturation. IEEE Trans. Syst. Man Cybern. Syst. 2016, 46, 334–344. [Google Scholar] [CrossRef]
Zhou, Q.; Zhao, S.; Li, H.; Lu, R.; Wu, C. Adaptive neural network tracking control for robotic manipulators with dead zone. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3611–3620. [Google Scholar] [CrossRef] [PubMed]
Nikdel, N.; Nikdel, P.; Badamchizadeh, M.A.; Hassanzadeh, I. Using Neural Network Model Predictive Control for Controlling Shape Memory Alloy-Based Manipulator. IEEE Trans. Ind. Electron. 2013, 61, 1394–1401. [Google Scholar] [CrossRef]
Yen, V.T.; Nan, W.Y.; Van Cuong, P.; Quynh, N.X.; Thich, V.H. Robust adaptive sliding mode control for industrial robot manipulator using fuzzy wavelet neural networks. Int. J. Control Autom. Syst. 2017, 15, 2930–2941. [Google Scholar] [CrossRef]
Hu, J.; Wang, P.; Xu, C.; Zhou, H.; Yao, J. High accuracy adaptive motion control for a robotic manipulator with model uncertainties based on multilayer neural network. Asian J. Control 2021, 24, 1503–1514. [Google Scholar] [CrossRef]
Liu, H.; Sun, J.; Nie, J.; Zou, L. Observer-based adaptive second-order non-singular fast terminal sliding mode controller for robotic manipulators. Asian J. Control 2021, 23, 1845–1854. [Google Scholar] [CrossRef]
Xiao, B.; Yang, X.; Karimi, H.R.; Qiu, J. Asymptotic tracking control for a more representative class of uncertain nonlinear systems with mismatched uncertainties. IEEE Trans. Ind. Electron. 2019, 66, 9417–9427. [Google Scholar] [CrossRef]
Song, T.; Fang, L.; Wang, H. Model-free finite-time terminal sliding mode control with a novel adaptive sliding mode observer of uncertain robot systems. Asian J. Control 2021, 24, 1437–1451. [Google Scholar] [CrossRef]
Sun, W.; Wu, Y.; Lv, X. Adaptive Neural Network Control for Full-State Constrained Robotic Manipulator With Actuator Saturation and Time-Varying Delays. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 3331–3342. [Google Scholar] [CrossRef]
Zhou, Q.; Wang, L.; Wu, C.; Li, H.; Du, H. Adaptive fuzzy control for nonstrict-feedback systems with input saturation and output constraint. IEEE Trans. Syst. Man Cybern. Syst. 2016, 47, 2209–2217. [Google Scholar] [CrossRef]
Yu, X.; Zhang, S.; Fu, Q.; Xue, C.; Sun, W. Fuzzy Logic Control of an Uncertain Manipulator with Full-State Constraints and Disturbance Observer. IEEE Access 2020, 8, 24284–24295. [Google Scholar] [CrossRef]
Yang, T.; Sun, N.; Fang, Y.; Xin, X.; Chen, H. New adaptive control methods for n-link robot manipulators with online gravity compensation: Design and experiments. IEEE Trans. Ind. Electron. 2021, 69, 539–548. [Google Scholar] [CrossRef]
Tang, L.; Liu, Y.-J.; Tong, S. Adaptive neural contro lusing reinforcement learning for a class of robot manipulator. Neural Comput. Appl. 2013, 25, 135–141. [Google Scholar] [CrossRef]
Li, Y.; Chen, L.; Tee, K.P.; Li, Q. Reinforcement learning control for coordinated manipulation of multi-robots. Neurocomputing 2015, 170, 168–175. [Google Scholar] [CrossRef] [Green Version]
Xie, Z.; Sun, T.; Kwan, T.H.; Mu, Z.; Wu, X. A New Reinforcement Learning Based Adaptive Sliding Mode Control Scheme for Free-Floating Space Robotic Manipulator. IEEE Access 2020, 8, 127048–127064. [Google Scholar] [CrossRef]
Kumar, A.; Sharma, R. Linguistic Lyapunov reinforcement learning control for robotic manipulators. Neurocomputing 2018, 272, 84–95. [Google Scholar] [CrossRef]
Yih, C.C.; Wu, S.J. Adaptive task-space manipulator control with parametric uncertainties in kinematics and dynamics. Appl. Sci. 2020, 10, 8806. [Google Scholar] [CrossRef]
Han, S.H.; Tran, M.S.; Tran, D.T. Adaptive sliding mode control for a robotic manipulator with unknown friction and unknown control direction. Appl. Sci. 2021, 11, 3919. [Google Scholar] [CrossRef]
Gao, M.; Ding, L.; Jin, X. ELM-Based Adaptive Faster Fixed-Time Control of Robotic Manipulator Systems. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 1–13. [Google Scholar] [CrossRef]
Liu, C.; Wen, G.; Zhao, Z.; Sedaghati, R. Neural-network-based sliding-mode control of an uncertain robot using dynamic model approximated switching gain. IEEE Trans. Cybern. 2020, 51, 2339–2346. [Google Scholar] [CrossRef]
Doan, Q.V.; Vo, A.T.; Le, T.D.; Kang, H.J.; Nguyen, N.H.A. A novel fast terminal sliding mode tracking control methodology for robot manipulators. Appl. Sci. 2020, 10, 3010. [Google Scholar] [CrossRef]
Mobayen, S.; Mofid, O.; Din, S.U.; Bartoszewicz, A. Finite-time tracking controller design of perturbed robotic manipulator based on adaptive second-order sliding mode control method. IEEE Access 2021, 9, 71159–71169. [Google Scholar] [CrossRef]
Yin, F.; Wen, C.; Ji, Q.; Zhang, H.; Shao, H. A compensation sliding mode control for machining robotic manipulators based on nonlinear disturbance observer. Trans. Inst. Meas. Control 2022, 44, 01423312221083771. [Google Scholar] [CrossRef]
Jia, S.; Shan, J. Continuous integral sliding mode control for space manipulator with actuator uncertainties. Aerosp. Sci. Technol. 2020, 106, 106192. [Google Scholar] [CrossRef]
Jia, S.; Shan, J. Finite-time trajectory tracking control of space manipulator under actuator saturation. IEEE Trans. Ind. Electron. 2019, 67, 2086–2096. [Google Scholar] [CrossRef]

Figure 1. Converging behaviour of tracking error satisfying (24).The red line is the velocity constraint, and the black line with arrow is the converging trajectory.

Figure 2. Membership function of fuzzy input of the

i^{t h}

subsystem.

Figure 2. Membership function of fuzzy input of the

i^{t h}

subsystem.

Figure 3. Flow chart of fuzzy Q learning to determine

λ_{i}

.

Figure 3. Flow chart of fuzzy Q learning to determine

λ_{i}

.

Figure 4. Proposed RL-based adaptive control scheme.

Figure 5. 2-rigid-link robotic manipulator.

Figure 6. Comparison of angular position to [26,33] in case 1: [26] scheme (yellow solid line), [33] scheme (red dash line), the proposed scheme (blue solid line), the reference angular position (green dashed line), the angular position constrains (black dashed line); (a) Joint 1. (b) Joint 2.

Figure 7. Comparison of the angular position to [43,44] in case 1: [43] scheme (yellow solid line), [44] scheme (red dash line), the proposed scheme (blue solid line), the reference angular position (green dashed line), the angular position constrains (black dashed line); (a) Joint 1. (b) Joint 2.

Figure 8. Comparison of angular velocity to [26,32] in case 1: [26] scheme (yellow solid line), [33] scheme (red dash line), the proposed scheme (blue solid line), the angular velocity constrains (black dashed line); (a) Joint 1. (b) Joint 2.

Figure 9. Comparison of angular velocity to [43,44] in case 1: [43] scheme (yellow solid line), [44] scheme (red dash line), the proposed scheme (blue solid line), the angular velocity constrains (black dashed line); (a) Joint 1. (b) Joint 2.

Figure 10. Comparison of angular position to [26,33] in case 2: the [26] scheme (yellow solid line), the [33] scheme (red dash line), the proposed scheme (blue solid line), the reference angular position (green dashed line), the angular position constrains (black dashed line); (a) Joint 1. (b) Joint 2.

Figure 11. Comparison of angular position error to [26,33] in case 2: the [26] scheme (yellow solid line), the [33] scheme (red dash line), the proposed scheme (blue solid line); (a) Joint 1. (b) Joint 2.

Figure 12. Comparison of angular position error to [43,44] in case 2: the [43] scheme (yellow solid line), the [44] scheme (red dash line), the proposed scheme (blue solid line); (a) Joint 1. (b) Joint 2.

Figure 13. Comparison of angular position to [43,44] in case 2: The [43] scheme (yellow solid line), the [44] scheme (red dash line), the proposed scheme (blue solid line), the reference angular position (green dashed line), the angular position constrains (black dashed line); (a) Joint 1. (b) Joint 2.

Figure 14. Comparison of angular velocity to [26,33] in case 2: the [26] scheme (yellow solid line), the [33] scheme (red dash line), the proposed scheme (blue solid line), the angular velocity constrains (black dashed line); (a) Joint 1. (b) Joint 2.

Figure 15. Comparison of angular velocity to [43,44] in case 2: the [43] scheme (yellow solid line), the [44] scheme (red dash line), the proposed scheme (blue solid line), the angular velocity constrains (black dashed line); (a) Joint 1. (b) Joint 2.

Figure 16. Comparison of torque to [26,33] in case 2: the [26] scheme (yellow solid line), the [33] scheme (red dash line), the proposed scheme (blue solid line); (a) Joint 1. (b) Joint 2.

Figure 17. Comparison of torque to [43,44] in case 2: the [43] scheme (yellow solid line), the [44] scheme (red dash line), the proposed scheme (blue solid line); (a) Joint 1. (b) Joint 2.

Figure 18. Parameter

{\hat{d}}_{i}

handling TDE errors in case 2: Joint 1 (blue solid line). Joint 2 (orange solid line).

Figure 18. Parameter

{\hat{d}}_{i}

handling TDE errors in case 2: Joint 1 (blue solid line). Joint 2 (orange solid line).

Table 1. Parameters of robotic manipulator.

Robotic manipulator	Parameter	Value
	$l_{1}$ ( $m$ )	0.5
	$l_{2}$ ( $m)$	0.5
	$m_{1}$ ( $k g)$	5
	$m_{2}$ ( $k g)$	2
	$g$ ( $m / s^{2}$ )	9.8

Table 2. Parameters of proposed controller.

Parameter	Value $(i = 1)$	Value $(i = 2)$
$k_{p i}$	10	10
$k_{d i}$	1	1
$k_{s i}$	0.1	0.1
$₵_{i}$	0.1	0.1
$k_{y i}$	0.1	0.1
$β_{i}$	0.1	0.1
$ψ_{i}$	0.01	0.01
${\bar{Ω}}_{i}$	0.05	0.05
$δ_{i}$	0.1	0.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, Z.; Lin, Q. Reinforcement Learning-Based Adaptive Position Control Scheme for Uncertain Robotic Manipulators with Constrained Angular Position and Angular Velocity. Appl. Sci. 2023, 13, 1275. https://doi.org/10.3390/app13031275

AMA Style

Xie Z, Lin Q. Reinforcement Learning-Based Adaptive Position Control Scheme for Uncertain Robotic Manipulators with Constrained Angular Position and Angular Velocity. Applied Sciences. 2023; 13(3):1275. https://doi.org/10.3390/app13031275

Chicago/Turabian Style

Xie, Zhihang, and Qiquan Lin. 2023. "Reinforcement Learning-Based Adaptive Position Control Scheme for Uncertain Robotic Manipulators with Constrained Angular Position and Angular Velocity" Applied Sciences 13, no. 3: 1275. https://doi.org/10.3390/app13031275

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning-Based Adaptive Position Control Scheme for Uncertain Robotic Manipulators with Constrained Angular Position and Angular Velocity

Abstract

1. Introduction

2. Dynamical Model and Problem Statement

3. Controller Design and Stability Analysis

3.1. Controller Design

3.2. Stability Analysis

3.3. Fuzzy Q Reinforcement Learning Mechanism Determining Parameters of Controller

4. Simulation Results and Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI