Next Article in Journal
The Relation between Induced Electric Field and TMS-Evoked Potentials: A Deep TMS-EEG Study
Next Article in Special Issue
Fv-AD: F-AnoGAN Based Anomaly Detection in Chromate Process for Smart Manufacturing
Previous Article in Journal
CFD and Energy Loss Model Analysis of High-Speed Centrifugal Pump with Low Specific Speed
Previous Article in Special Issue
Scorecard Model-Based Motion Stability Evaluation of Manipulators for Robot Training
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reinforcement-Learning-Based Tracking Control with Fixed-Time Prescribed Performance for Reusable Launch Vehicle under Input Constraints

1
School of Astronautics, Harbin Institute of Technology, Harbin 150001, China
2
Beijing Institute of Space Launch Technology, China Academy of Launch Vehicle Technology, Beijing 100076, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(15), 7436; https://doi.org/10.3390/app12157436
Submission received: 21 June 2022 / Revised: 18 July 2022 / Accepted: 21 July 2022 / Published: 24 July 2022
(This article belongs to the Special Issue AI Applications in the Industrial Technologies)

Abstract

:
This paper proposes a novel reinforcement learning (RL)-based tracking control scheme with fixed-time prescribed performance for a reusable launch vehicle subject to parametric uncertainties, external disturbances, and input constraints. First, a fixed-time prescribed performance function is employed to restrain attitude tracking errors, and an equivalent unconstrained system is derived via an error transformation technique. Then, a hyperbolic tangent function is incorporated into the optimal performance index of the unconstrained system to tackle the input constraints. Subsequently, an actor-critic RL framework with super-twisting-like sliding mode control is constructed to establish a practical solution for the optimal control problem. Benefiting from the proposed scheme, the robustness of the RL-based controller against unknown dynamics is enhanced, and the control performance can be qualitatively prearranged by users. Theoretical analysis shows that the attitude tracking errors converge to a preset region within a preassigned fixed time, and the weight estimation errors of the actor-critic networks are uniformly ultimately bounded. Finally, comparative numerical simulation results are provided to illustrate the effectiveness and improved performance of the proposed control scheme.

1. Introduction

Recent years have witnessed an increasing demand for reliable and economical access to space. Reusable launch vehicles (RLV), as a cost-effective means of undertaking space missions, are attracting more and more attention from researchers [1]. A dynamic model of RLV provides strong non-linear and coupling characteristics due to the complex flight environment of the re-entry phase. External disturbances, uncertain structural and aerodynamic parameters, and input constraints inevitably exist during real flight, having a significant impact on the attitude control system. In this context, attitude control for RLV is a challenging topic and has elicited widespread interest. Various control methodologies, such as adaptive control [2], dynamic inversion control [3], robust control [4], sliding mode control [5,6], and neural network (NN) control [7,8], have been applied over the past decades. Nevertheless, there is still scope to develop an optimal control approach for RLV suffering from complicated non-linear dynamics, parametric uncertainties and limited inputs.
From a mathematical point of view, the Hamilton–Jacobi–Bellman (HJB) function and its solution are required to be established to solve the optimal control problems. However, it is difficult to derive an analytical solution from the HJB function for non-linear continuous-time systems. Given this, a reinforcement learning (RL) scheme with an actor-critic (AC) structure was initially created by Werbos [9], whereby a critic network was exploited to approximate the value function, and an actor network was deployed to obtain the optimal control policy. Informed by Werbos’ contribution, Vamvoudakis et al. developed an online AC algorithm to solve the continuous-time infinite horizon optimal control problem [10]. He et al. proposed a novel online learning and optimization structure by incorporating a reference network into the AC structure [11]. Ma et al. devised a learning-based adaptive sliding mode control scheme for a tethered space robot with limited inputs [12]. Although the above control strategies have provided excellent results in terms of optimal control, the existing problem concerns the need for accurate dynamic modeling [13]. Given the parametric uncertainties and unknown disturbances, it is, in practice, difficult to exactly determine the system dynamics for RLV, limiting the methods’ applicability. Therefore, a further problem exists in that the aforementioned methods’ robustness must be enhanced for practical systems with unknown dynamics. Fan et al. combined (ISMC) with a reinforcement learning control scheme for non-linear systems with partially unknown dynamics [14]. However, the input constraints were not considered, and the ISMC used may lead to unexpected oscillation [15]. Zhang et al. developed a learning-based H tracking control scheme for which the uncertainties and input constraints were considered [16]. Nevertheless, the control design is conservative, and the iterative algorithm is rather complicated.
It is of note that previous RL-based control methods only establish asymptotic or finite-time convergence [17]. Therefore, the upper bound of the convergence time is uncontrollable, and the transient and steady-state performance, namely, the maximum overshoot and the steady accuracy, cannot be quantitatively prearranged by users. As a promising solution to this problem, the prescribed performance control (PPC) method created by Bechlioulis et al. has attracted widespread attention [18]. The salient feature of PPC is that users can quantitatively pre-arrange both the transient performance and the steady tracking error. In [19], a novel PPC scheme combined with a command filter was proposed for a quadrotor unmanned aerial vehicle subject to error constraints. In [20], an NN-based adaptive non-affine tracking controller was devised for an air-breathing hypersonic vehicle with guaranteed prescribed performance. In [21], a data-driven PPC scheme was developed for an unmanned surface vehicle with unknown dynamics. However, the conventional PPC approach only ensures that the system states converge to the preset region as time tends to infinity [22], leading to an unsatisfactory solution for time-limited problems, such as the re-entry mission. Moreover, the input constraints and the optimality are not comprehensively considered in the conventional PPC paradigm. Therefore, the expected performance index cannot be consistently guaranteed and optimized for RLV suffering from poor aerodynamic maneuverability and limited control torques.
Motivated by the foregoing considerations, a novel RL-based tracking controller, with fixed-time prescribed performance for RLV subject to parameter uncertainties, external disturbances and input constraints, is investigated. The main contributions and characteristics of the proposed method can be summarized as follows.
  • An online RL-based, nearly optimal, controller with limited inputs is developed by synthesizing the AC structure and the hyperbolic tangent performance index. In addition, the robustness of the learning-based controller is strengthened by incorporating a super-twisting-like sliding mode control.
  • Compared with a previous learning-based controller described in [10,11,12], in which the system dynamics are required to be known exactly, the proposed control scheme only requires the input-output data pairs of RLV, such that the system dynamics can be completely unknown.
  • In contrast to existing RL-based control schemes with asymptotic or finite-time convergence [10,11,12,17,21], the proposed control scheme can ensure that the tracking errors converge to a preset region within a preassigned fixed time. Moreover, the prescribed transient and steady-state performance can be guaranteed.
  • Comparative numerical simulation investigations show that the proposed method can provide improved performance in terms of the transient response, and steady accuracy with less control effort.

2. Problem Statement and Preliminaries

2.1. Problem Statement

Following [2], the control-oriented model of the rigid-body RLV is given as follows:
Θ ˙ = R ω I ω ˙ = Ω I ω + M + Δ D
where Θ = [ α , β , σ ] represents the attitude angle vector, ω = [ p , q , r ] denotes the angular rate vector, M = [ M x , M y , M z ] is the control input vector, and M x , M y , M z are limited to the interval [ M ¯ , M ¯ ] , Δ D = Δ D a + Δ D e is the unknown disturbance vector, Δ D a is the aerodynamic torque vector, and Δ D e is the external disturbance vector. Δ D a can be formulated as:
Δ D a = m x p L r p V + m x σ σ q V S r L r m y r L r r V + m y β β q V S r L r m z q L r q V + m z α α q V S r L r
where m x p , m y r and m z q represent the damping moment coefficients, m x σ , m y β and m z α are the static stability moment coefficients, V is the velocity, q V is the dynamic pressure, and S r and L r are the cross-sectional area and the reference length of RLV, respectively.
The skew-symmetric matrix Ω , the inertia matrix I , and the coordinate transformation matrix R are defined by
Ω = 0 r q r 0 p q p 0 , I = I x x 0 I x z 0 I y y 0 I z x 0 I z z , R = cos α tan β 1 sin α tan β sin α 0 cos α cos α cos β sin β sin α cos β .
Defining the guidance command vector Θ d = α c , β c , σ c , the attitude tracking error vector e 1 = Θ Θ d = [ e 1 α , e 1 β , e 1 σ ] , and the angular rate tracking error vector e 2 = R ω Θ ˙ d , the tracking error dynamics are given as:
e ˙ 1 = e 2 e ˙ 2 = B 1 M + Δ D 1
where B 1 = R I 1 is the control matrix, Δ D 1 = R I 1 Ω I ω Θ ¨ c + R ˙ ω + R I 1 Δ D denotes the lumped disturbance vector.
Assumption A1
([2]). Δ D 1 is bounded by | | Δ D 1 | | D m .
Assumption A2
([2]). During the re-entry phase, β ± 90 deg , thus R is always invertible.
Control Objective: According to the tracking error dynamics (4), the control objective of this paper can be summarized as developing an RL-based optimal control scheme with guaranteed fixed-time prescribed performance such that the attitude tracking errors e 1 i ( i = α , β , σ ) can converge to a preset region within a preassigned fixed time.

2.2. Preliminaries

Lemma 1
([23]). Considering a non-linear function α ( x ) = ln [ 1 tanh 2 ( x ) ] , the following equation
α ( x ) = ln 4 2 x sign x + κ α
always holds, where κ α is bounded by a real positive constant.
Lemma 2
([24]). Considering the following fixed-time prescribed performance function (FTPPF)
ρ t = ρ 0 ρ sin 2 π t / T 2 π t T + ρ 0 , 0 t T ρ , t > T
where ρ 0 > ρ > 0 represent the initial and terminal values of FTPPF, respectively. T > 0 is the preassigned convergence time. It can be concluded that ρ ( t ) is a positive, non-increasing and 𝒞 2 continuous function with ρ ( 0 ) = ρ 0 , ρ ( T ) = ρ and ρ ˙ ( T ) = ρ ¨ ( T ) = 0 .
Lemma 3
([25]). For any μ > 0 , the following inequality holds
0 | x | x tanh x μ k p μ
where k p satisfies k p = e ( k p + 1 ) (i.e., k p = 0.2785 ).

3. Controller Design

3.1. Prescribed Performance Constraint

In this subsection, the following constraint is formulated to restrain the attitude tracking errors e 1 i ( i = α , β , σ ) within the FTPPF
ρ i ( t ) < e 1 i ( t ) < ρ i ( t )
where ρ i ( t ) is defined in Lemma 2. Subsequently, the equivalent unconstrained error variables η 1 i ( i = α , β , σ ) can be derived via the error transformation method [18]
η 1 i = 1 2 ln 1 + z i 1 z i
with z i = e 1 i / ρ i . Taking the first and second-order time derivatives of η 1 i yields
η ˙ 1 i = ξ i η 2 i η ˙ 2 i = e ¨ 2 i Λ i
with η 2 i = e 2 i ρ ˙ i ρ i e 1 i , ξ i = 1 2 ρ i 1 z i + 1 1 z i 1 1 ρ i and Λ i = ρ i ρ ¨ i e 1 i + ρ i ρ ˙ i e 2 i ρ ˙ i 2 e 1 i ρ i 2 . Substituting (4) into (10), one obtains
η ˙ 1 = ξ η 2 η ˙ 2 = B 1 M + D 1 Λ
where ξ = D ( ξ i ) , η 2 = C ( η 2 i ) and Λ = C ( Λ i ) , and the symbols C · and D · represent the diagonal matrix and the column vector, respectively.

3.2. Reinforcement Learning-Based Control Design

Firstly, the following sliding variable s is defined as:
s = η 2 + c η 1
where c is a positive diagonal matrix. Taking the first time derivative of s along (11) yields
s ˙ = Ξ + B 1 M + D 1
where Ξ = c η ˙ 1 Λ . In order to achieve a satisfactory tracking performance, a traditional super-twisting controller based on (12) and (13) is developed as follows:
M = B 1 1 Γ , Γ = Ξ + C λ 1 i sig 1 / 2 ( s i ) M d ,
where sig 1 / 2 ( s i ) = | s i | 1 / 2 sign ( s i ) , M ˙ d = D λ 2 i sign ( s i ) , and λ 1 i and λ 2 i are positive constants. Nevertheless, the feasibility of (14) may not be guaranteed with limited control inputs, and the time-fuel performance index cannot be approximately optimized. To this end, an online RL-based, nearly optimal, controller is proposed by integrating the AC structure into the super-twisting controller for a comprehensive solution to the issue mentioned above.
Before elaborating the detailed design procedure, it is assumed that there exists a group of admissible control strategies [10]
M = M o B 1 1 C λ 1 i sig 1 / 2 ( s i ) M d Ω u .
Moreover, M can achieve the control objective with the following time-fuel performance index being satisfied
V ( s ) = t V s τ + J M o τ d τ
where V s = s Q s , Q is a positive diagonal matrix, and J M o is chosen as [26]:
J ( M o ) = 2 0 M o λ tanh 1 v λ ϖ d v
where ϖ is selected as a positive diagonal matrix, v represents the variable of integration, and the upper and lower bounds of v are M o and 0 , respectively. The Lyapunov equation of (16) can be calculated as:
V s ( t ) + J M o ( t ) + V s s ˙ = 0
where V s = V ( s ) / s . The optimal value function is defined as:
V * ( s ) = min M o t V s ( τ ) + J M o ( τ ) d τ ,
and the corresponding HJB equation can be formulated as:
min M o V s + J ( M o ) + V s * s ˙ = 0 ,
where V s * = V * s / s .
Equation (20) is equivalent to
M o * V s + J ( M o * ) + V s * s ˙ = 0 .
Solving the partial derivative of (21) yields the following optimal control strategy
M o * = λ tanh M ¯ o * , M ¯ o * = 1 2 λ ϖ 1 B 1 V s * .
Substituting M o * into (17), one obtains
J ( M o * ) = λ V s * B 1 tanh ( M ¯ o * ) + λ 2 ϖ ¯ ln 1 c tanh 2 ( M ¯ o * ) = V s * B 1 M o * + λ 2 ϖ ¯ ln 1 c tanh 2 ( M ¯ o * ) ,
where 1 c is a column vector with all elements being one, and ϖ ¯ is a row vector generated by the elements on the main diagonal of ϖ . Combining (20) and (23), it can be derived that
V s + V s * s ˙ + λ 2 ϖ ¯ ln [ 1 c tanh 2 ( M ¯ o * ) ] = 0 .
The nearly optimal control M o * can be obtained by solving (24) if V s * is available. Inspired by the NN-based control scheme, the following approximation of V * ( s ) can be established
V * ( s ) = W * σ s ( t ) + ϵ s ( t )
where W * is the optimal weight vector, σ s ( t ) is the base vector, and ϵ s ( t ) is the approximation error of NN. Subsequently, the gradient of V * ( s ) is
V s * = W * σ s s ( t ) + ϵ s s ( t )
where σ s = σ / s , ϵ s = ϵ / s . In light of the universal approximation property of NN for smooth functions on prescribed compact sets, the approximation errors ϵ s ( t ) and ϵ s s ( t ) are bounded with a finite dimension of σ s ( t ) [27]. Moreover, it is assumed that | | W * | | , σ s ( t ) and σ s s ( t ) are bounded [13,14,16,17,21].
Recalling (22), the NN-based nearly optimal control law can be formulated as:
M = M ^ o * B 1 1 C λ 1 i sig 1 / 2 ( s i ) M d M ^ o * = λ tanh 1 2 λ ϖ 1 B 1 σ s W * + ϵ s .
Defining the Bellman error as [14]:
B ϵ = λ 2 ϖ ¯ ln 1 c tanh 2 M ¯ o * ln 1 c tanh 2 M ¯ ^ o *
with M ¯ ^ o * = 1 2 λ ϖ 1 B 1 σ s W * , (20) can be rewritten as:
V s + W * σ s Γ + λ 2 ϖ ¯ ln 1 c tanh 2 ( M ¯ ^ o * ) + ϵ H = 0
where ϵ H = ϵ s Γ + B ϵ + W * σ s D 2 is the bounded HJB error [10].
In this paper, the optimal weight W * is generated by the online RL scheme with the AC structure. In this context, the nearly optimal control policy is formulated as:
M = M ^ a * B 1 1 C λ 1 i sig 1 / 2 ( s i ) M d M ^ a * = λ tanh M ¯ ^ a * , M ¯ ^ a * = 1 2 λ ϖ 1 B 1 σ s W ^ a
where W ^ a is the weight of the actor network. Moreover, the performance index (16) can be estimated as:
V ^ s = W ^ c σ s s ( t )
where W ^ c is the weight of the critic network. The adaptation laws for W ^ c and W ^ a are designed as:
W ^ ˙ c = A 1 θ ¯ θ W ^ c + V s + J M ^ a * + p c W ^ c W ^ a ,
W ^ ˙ a = A 2 p a W ^ a W ^ c λ p a p c σ s B 1 Ψ θ ¯ W ^ c ,
where θ = σ s Γ + B 1 M ^ a * , θ ¯ = θ / θ θ + 1 2 ; A 1 and A 2 are positive diagonal matrices; Ψ = tanh M ¯ ^ a * tanh M ¯ ^ a * / κ ; p c , p a and κ are positive real constants. The projection operator Proj · is imposed on (33) to guarantee that W ^ a is bounded [28]. It is assumed that θ 1 = θ / θ θ + 1 is persistently excitating (PE) [29].
The proposed control scheme is illustrated by a block diagram in Figure 1.

3.3. Stability Analysis

Theorem 1.
Considering the control-oriented RLV model (1) with input constraints, if the initial conditions satisfies ρ i 0 < e 1 i ( 0 ) < ρ i 0 , the nearly optimal control policy is chosen as (30) with the weight update laws (32) and (33), then the following results can be obtained:
  • the sliding variable s , the weight estimation errors W ˜ a and W ˜ c are uniformly ultimately bounded (UUB);
  • the attitude tracking errors e 1 i uniformly obey the fixed-time performance envelops in (8).
Proof of Theorem 1
Consider the Lyapunov function candidate as follows:
V = V s + 1 2 W ˜ c A 1 1 W ˜ c + p c 2 p a W ˜ a A 2 1 W ˜ a
where W ˜ c = W * W ^ c and W ˜ a = W * W ^ a . Taking the first time derivative of V yields
V ˙ = V ˙ s + V ˙ c + V ˙ a
with V ˙ c = W ˜ c A 1 1 W ˜ ˙ c , V ˙ a = p c p a W ˜ a A 2 1 W ˜ ˙ a , and
V ˙ s = V s Γ + D 1 + B 1 M ^ a * .
By invoking the HJB equation (29), V s Ξ can be rewritten as:
V s Ξ = V s λ 2 ϖ ¯ ln 1 c tanh 2 M ¯ ^ o * + V s C λ 1 i sig 1 / 2 ( s i ) M d ε e
where ε e = B ϵ + V s D 1 . Substituting (37) into (36) yields
V ˙ s = V s λ 2 ϖ ¯ ln 1 c tanh 2 M ¯ ^ o * λ V s * B 1 tanh M ¯ ^ a * B ϵ .
Furthermore, according to (17), it can be deduced that
λ 2 ϖ ¯ ln 1 c tanh 2 M ¯ ^ o * = J M ̲ o * + V s * B 1 M ̲ o *
with M ̲ o * = λ tanh M ¯ ^ o * . Equation (38) can be rewritten as:
V ˙ s = V s J M ̲ o * V s * B 1 M ̲ o * B ϵ λ W * σ s + ϵ s B 1 tanh M ¯ ^ a * = V s J M ̲ o * + ε v λ W ^ a + W ˜ a σ s B 1 tanh M ¯ ^ a *
where ε v = B ε + V s * B 1 M ̲ o * ϵ s B 1 M ^ a * . Noting that tanh · is an odd function, it can be concluded that λ W ^ a σ s B 1 tanh M ¯ ^ a * > 0 . Moreover, it is indicated that J M ̲ o * > 0 from the definition in (17). Based on the above discussions, (40) can be simplified as:
V ˙ s s V s λ W ˜ a σ s B 1 tanh M ¯ ^ a * + ε v .
Subsequently, incorporating (32) with V ˙ c yields
V ˙ c = W ˜ c Γ 1 W ^ ˙ c = W ˜ c θ ¯ θ W ^ c + V s + J M ^ a * + p c W ^ c W ^ a .
Recalling (29) and the definition of θ , (42) can be rearranged as:
V ˙ c = W ˜ c θ ¯ [ θ W ^ c θ W * + θ W * ϵ H λ 2 ϖ ¯ ln [ 1 c tanh 2 ( M ¯ ^ o * ) ] W * σ s Γ + J ( M ^ a * ) ] + p c W ˜ c W ^ c W ^ a = W ˜ c θ ¯ θ W ˜ c + σ s B 1 M ^ a * W * λ 2 ϖ ¯ ln 1 c tanh 2 M ¯ ^ o * ϵ H + λ W ^ a σ s B 1 tanh M ¯ ^ a * + λ 2 ϖ ¯ ln 1 c tanh 2 M ¯ ^ a * + p c W ˜ c W ^ c W ^ a = W ˜ c θ ¯ θ W ˜ c λ W ˜ a σ s B 1 tanh M ¯ ^ a * λ 2 ϖ ¯ ln 1 c tanh 2 M ¯ ^ o * ϵ H + λ 2 ϖ ¯ ln 1 c tanh 2 M ¯ ^ a * + p c W ˜ c W ^ c W ^ a .
With the aid of Lemma 1 and Lemma 3, (43) can be further simplified as:
V ˙ c = W ˜ c θ ¯ θ W ˜ c λ W ˜ a σ s B 1 tanh M ¯ ^ a * ϵ H λ B 1 σ s W ^ a sign M ¯ ^ a * + λ B 1 σ s W * sign M ¯ ^ o * + λ 2 ϖ ¯ ε a ε a * + p c W ˜ c W ^ c W ^ a = W ˜ c θ ¯ ϵ H + λ 2 ϖ ¯ ε a ε a * + λ B 1 σ s W * sign M ¯ ^ o * W ˜ c θ ¯ λ W ˜ a σ s B 1 tanh M ¯ ^ a * + λ W ^ a σ s B 1 tanh M ¯ ^ a * κ + κ 1 W ˜ c θ ¯ θ W ˜ c + p c W ˜ c W ^ c W ^ a
where | | ε a | | , | | ε a * | | and | κ 1 | are bounded. For ease of notation, two bounded vectors 1 and 2 are defined as:
1 = λ σ s B 1 tanh M ¯ ^ a * κ λ σ s B 1 tanh M ¯ ^ a * θ ¯ W * , 2 = λ W * σ s B 1 sign W ¯ ^ o * λ W * σ s B 1 tanh M ¯ ^ a * κ + κ 1 ϵ H + λ 2 ϖ ¯ ε a ε a * ,
then (44) can be further sorted into the following structure
V ˙ c = W ˜ c θ ¯ θ W ˜ c + p c W ˜ c W ^ c W ^ a + W ˜ c θ ¯ 2 + W ˜ a 1 + λ κ 1 W ˜ c θ ¯ W ˜ a σ s B 1 + λ W ^ c θ ¯ W ˜ a σ s B 1 Ψ .
Substituting (33), (41), and (46) into (35) yields
V ˙ V s λ W ˜ a σ s B 1 tanh M ¯ ^ a * + ε W ˜ c θ ¯ θ W ˜ c + p c W ˜ c W ^ c W ^ a + W ˜ c θ ¯ 2 + W ˜ a 1 + λ κ 1 W ˜ c θ ¯ W ˜ a σ s B 1 λ W ^ c θ ¯ W ˜ a σ s B 1 + λ W ^ c θ ¯ W ˜ a σ s B 1 Ψ + W ˜ a p c W ^ a W ^ c λ σ s B 1 Ψ θ ¯ W ^ c V s λ W ˜ a σ s B 1 tanh M ¯ ^ a * + ε v W ˜ c θ ¯ θ W ˜ c + p c W ˜ a W ˜ c W ˜ a W ˜ c + W ˜ c θ ¯ 2 + W ˜ a 1 + λ κ 1 W ˜ c θ ¯ W ˜ a σ s B 1 V s W ˜ c θ ¯ θ W ˜ c + ε v 1 + W ˜ c θ ¯ 3
where ε v 1 = ε v λ W ˜ a σ s B 1 tanh M ¯ ^ a * + W ˜ a 1 and 3 = 2 + λ κ 1 W ˜ a σ s B 1 .
With the definition of the generalized vector G = s , W ˜ c θ 1 , the inequality (47) can be rewritten as:
V ˙ G Π G + G Π θ + ε v 1
where Π = diag Q , I , Π θ = 0 , 3 / θ θ + 1 . Given that Π θ , W ^ a are bounded vectors, it can be concluded that the W ˜ a is a bounded vector and | | Π θ | | Π ¯ θ and | ε v 1 | ε ¯ v 1 . Therefore, V ˙ is negative if
| | G | | > Π ¯ θ Π ¯ θ 2 + 4 λ min Π ε ¯ v 1 2 λ min Π ,
which implies that the sliding variable s and W ˜ c θ 1 are UUB. Furthermore, in view of the assumption that θ 1 is PE, the weight estimation error W ˜ c is also UUB [14]. Once the sliding variable converges to the vicinity of origin, the gradient vector σ s t together with M ^ a * will also be small enough. Defining d = Ξ + D 1 + B 1 M ^ a * , and assuming that there exists a small positive constant d ¯ satisfying | | D | | d ¯ , the dynamic of the sliding variable can be represented as:
s ˙ M ˙ d = d C λ 1 i sig 1 / 2 s i M d D λ 2 i sign s i .
According to the deduction lines in [30], the sliding variable can reach a small neighborhood of origin; thus, the equivalent unconstrained errors η 1 i are bounded.
Recalling the error transformation method in (9), it can be derived that
z i = T η 1 i = exp η 1 i exp η 1 i exp η 1 i + exp ( η 1 i ) .
Note that T η 1 i is a smooth, strictly increasing, function w.r.t. η 1 i and satisfies lim η 1 i T η 1 i = 1 , lim η 1 i T η 1 i = 1 ; thus, it can be derived that 1 < z i < 1 and ρ i < e 1 i < ρ i hold true if η 1 i is bounded. Following the above analysis, it can be deduced that e 1 i never violates the fixed-time prescribed performance constraints in (8) for all t > 0 , and | e 1 i | < ρ i for t > T .
This completes the proof of Theorem 1. □

4. Numerical Simulations

In this section, numerical simulations for the re-entry phase of RLV are carried out to illustrate the effectiveness and improved performance of the proposed control scheme. The parameters of the RLV are based on [31]. The initial states of the re-entry phase are set as α 0 = 47 deg , β 0 = 2 deg , σ 0 = 2 deg , p 0 = 0 deg / s , q 0 = 0 deg / s and r 0 = 0 deg / s . The guidance commands are designed as α c = ( 45 0.5 t ) deg , β c = 0 deg and σ c = 0 deg . The uncertainties of the inertia parameters, the aerodynamic parameters and the air density are each set to +20% bias. The external disturbance Δ D e is given as [32]:
Δ D e = 0.5 + cos π t / 2 + sin π t / 3 × p 0.5 + cos π t / 2 + sin π t / 4 × r 0.5 + cos π t / 3 + sin π t / 2 × q × 10 4 N · m .
The saturation bound of the control torques is set as M ¯ = 1 × 10 5 N · m . The attitude tracking errors are required to be less than 0.5 deg for t 2 s . Therefore, the parameters of the FTPPF are selected as ρ 0 i = π / 36 , ρ i = π / 360 and T = 2 . Comprehensively considering the satisfactory transient and steady-state performance, the parameters of the proposed control scheme are rigorously chosen as c = 2 I 3 × 3 , ϖ = I 3 × 3 , Q = I 3 × 3 , λ 1 i = 2 , λ 2 i = 0.1 . Inspired by the work of [12,14], the suitable basis vectors can be selected as polynomial combinations of the concerned state variables in the performance index (16). The base vectors of the actor and critic networks are adjusted in repeated trials to balance the approximation error and the computational burden. They are identically selected as:
σ s t = s α 4 4 , s α 3 3 , s α 2 2 , s β 4 4 , s β 3 3 , s β 2 2 , s σ 4 4 , s σ 3 3 , s σ 2 2 .
The initial values of W ^ a and W ^ c are randomly generated in 0 , 1 . The user-defined parameters of the weight adaptation laws are designed as A 1 = I 9 × 9 , A 2 = 0.1 I 9 × 9 , p c = 0.001 , p a = 1 , κ = 0.001 .
Furthermore, in order to demonstrate the superiority of the proposed control scheme, the RL-based finite-time control (RLFTC) with input constraints in [17], and the robust adaptive backstepping control (RABC) method in [33], are implemented. To provide a fair comparison, the time-fuel performance index of the RLFTC is identical to the proposed method, and the generation method for the initial values of the actor network and the critic network remains the same. The other user-defined parameters of the RLFTC are intensively selected as T 1 = 0.1 , λ = 2 / 3 , Γ c = 0.2 , Γ a = 0.2 , k a = 0.1 . The parameters of the RABC remain the same as that in [33]. The simulation results are given in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10.
The tracking performances of the attitude angles under three controllers are shown in Figure 2, Figure 3, Figure 4 and Figure 5. It can clearly be seen that the proposed control scheme can provide faster convergence and smaller overshoot in the presence of parametric uncertainties and external disturbances. The angular rates of RLV and the sliding manifolds are demonstrated in Figure 6 and Figure 7, which provide further evidence for the improved performance of the proposed control scheme. Moreover, by comparing the tracking performance with the control inputs illustrated in Figure 8, it can be observed that the proposed control scheme exhibits better transient and steady-state performance with limited control inputs. The evolution trajectories of W ^ a and W ^ c are depicted in Figure 9 and Figure 10, respectively. It can be readily found that W ^ a and W ^ c are convergent to the same values, which indicates that the ideal weight vector W * can be effectively estimated via the proposed adaptation law.
To make a clear comparison, the maximum overshoot and the adjustment time are introduced to evaluate the transient performance of these controllers. Moreover, the integral absolute control effort (IACE) index 0 30 ( | M i | ) d τ , ( i = x , y , z ) and the integral of the time and absolute error (ITAE) 0 30 τ ( | e 1 i ) | d τ , ( i = α , β , σ ) index of three control schemes are calculated to evaluate the tracking accuracy and control effort. The performance indices of the three channels are summarized in Table 1, Table 2 and Table 3.
From the foregoing simulation results, it can be concluded that the proposed control scheme outperforms RLFTC and RABC in terms of transient performance, tracking accuracy and control effort. Furthermore, by synthesizing the AC structure and the fixed-time PPC paradigm, the proposed control scheme offers an online RL-based model-free solution for controlling RLV and other complex industrial systems.

5. Conclusions

In this paper, an innovative RL-based tracking control scheme with fixed-time prescribed performance has been proposed for RLV under parametric uncertainties, external disturbances, and input constraints. By resorting to the FTPPF, fixed-time performance envelopes have been imposed on the attitude tracking errors. Combined with the AC-based online RL structure and the super-twisting-like sliding mode control, the optimal control policy and the performance index have been learned recursively, and the robustness of the learning process has been further enhanced. Moreover, theoretical analysis has demonstrated that the attitude tracking error can converge to a preset region within a preassigned fixed time, and that the sliding variable, the weight estimation errors of the actor and critic networks are UUB. Comparative simulation results have verified the effectiveness and improved performance of the proposed control scheme. The angular rate constraints will be addressed in our future work, and the optimal control problem for the underactuated RLV will be specifically addressed. Experimental investigations, such as hardware-in-the-loop simulations, will be undertaken.

Author Contributions

Conceptualization, S.X. and Y.G.; methodology, S.X.; software, S.X.; validation, Y.G., C.W. and Y.L.; formal analysis, S.X.; investigation, S.X.; resources, C.W.; data curation, Y.G.; writing—original draft preparation, S.X.; writing—review and editing, S.X., Y.G., C.W., Y.L. and L.X.; visualization, S.X., Y.G. and L.X.; supervision, C.W.; project administration, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the SAST Foundation grant number SAST2021-028.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their constructive comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RLVReusable Launch Vehicles
NNNeural Network
RLReinforcement Learning
ACActor-Critic
UUBUniformly Ultimately Bounded
HJBHamilton–Jacobi–Bellman
PPCPrescribed Performance Control
FTPPFFixed-time Prescribed Performance Function
RLFTCReinforcement-Learning-Based Finite-Time Control
RABCRobust Adaptive Backstepping Control
IACEIntegral Absolute Control Effort
ITAEIntegral of Time and Absolute Error

References

  1. Stott, J.E.; Shtessel, Y.B. Launch vehicle attitude control using sliding mode control and observation techniques. J. Frankl. Inst. B 2012, 349, 397–412. [Google Scholar] [CrossRef]
  2. Tian, B.L.; Li, Z.Y.; Zhao, X.P.; Zong, Q. Adaptive Multivariable Reentry Attitude Control of RLV With Prescribed Performance. IEEE Trans. Syst. Man Cybern. Syst. 2022, 1–5. [Google Scholar] [CrossRef]
  3. Acquatella, P.; Briese, L.E.; Schnepper, K. Guidance command generation and nonlinear dynamic inversion control for reusable launch vehicles. Acta Astronaut. 2020, 174, 334–346. [Google Scholar] [CrossRef]
  4. Xu, B.; Wang, X.; Shi, Z.K. Robust adaptive neural control of nonminimum phase hypersonic vehicle model. IEEE Trans. Syst. Man Cybern. Syst. 2019, 51, 1107–1115. [Google Scholar] [CrossRef]
  5. Zhang, L.; Wei, C.Z.; Wu, R.; Cui, N.G. Fixed-time extended state observer based non-singular fast terminal sliding mode control for a VTVL reusable launch vehicle. Aerosp. Sci. Technol. 2018, 82, 70–79. [Google Scholar] [CrossRef]
  6. Ju, X.Z.; Wei, C.Z.; Xu, H.C.; Wang, F. Fractional-order sliding mode control with a predefined-time observer for VTVL reusable launch vehicles under actuator faults and saturation constraints. ISA Trans. 2022, in press. [CrossRef]
  7. Cheng, L.; Wang, Z.B.; Gong, S.P. Adaptive control of hypersonic vehicles with unknown dynamics based on dual network architecture. Acta Astronaut. 2022, 193, 197–208. [Google Scholar] [CrossRef]
  8. Xu, B.; Shou, Y.X.; Shi, Z.K.; Yan, T. Predefined-Time Hierarchical Coordinated Neural Control for Hypersonic Reentry Vehicle. IEEE Trans. Neural Netw. Learn. Syst. 2022. [Google Scholar] [CrossRef]
  9. Werbos, P. Approximate dynamic programming for realtime control and neural modelling. In Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches; Van Nostrand Reinhold: New York, NY, USA, 1992; pp. 493–525. [Google Scholar]
  10. Vamvoudakis, K.G.; Lewis, F.L. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 2010, 46, 878–888. [Google Scholar] [CrossRef]
  11. He, H.B.; Ni, Z.; Fu, J. A three-network architecture for on-line learning and optimization based on adaptive dynamic programming. Neurocomputing 2012, 78, 3–13. [Google Scholar] [CrossRef]
  12. Ma, Z.Q.; Huang, P.F.; Lin, Y.X. Learning-based Sliding Mode Control for Underactuated Deployment of Tethered Space Robot with Limited Input. IEEE Trans. Aerosp. Electron. Syst. 2021, 58, 2026–2038. [Google Scholar] [CrossRef]
  13. Wang, N.; Gao, Y.; Zhao, H.; Ahn, C.K. Reinforcement learning-based optimal tracking control of an unknown unmanned surface vehicle. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3034–3045. [Google Scholar] [CrossRef] [PubMed]
  14. Fan, Q.Y.; Yang, G.H. Adaptive actor–critic design-based integral sliding-mode control for partially unknown nonlinear systems with input disturbances. IEEE Trans. Neural Netw. Learn. Syst. 2015, 27, 165–177. [Google Scholar] [CrossRef] [PubMed]
  15. Kuang, Z.; Gao, H.; Tomizuka, M. Precise linear-motor synchronization control via cross-coupled second-order discrete-time fractional-order sliding mode. IEEE/ASME Trans. Mechatronics 2020, 26, 358–368. [Google Scholar] [CrossRef]
  16. Zhang, H.; Cui, X.; Luo, Y.; Jiang, H. Finite-horizon H tracking control for unknown nonlinear systems with saturating actuators. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 1200–1212. [Google Scholar] [CrossRef]
  17. Wang, N.; Gao, Y.; Yang, C.; Zhang, X.F. Reinforcement learning-based finite-time tracking control of an unknown unmanned surface vehicle with input constraints. Neurocomputing 2022, 484, 26–37. [Google Scholar] [CrossRef]
  18. Bechlioulis, C.P.; Rovithakis, G.A. Robust adaptive control of feedback linearizable MIMO nonlinear systems with prescribed performance. IEEE Trans. Autom. Control 2008, 53, 2090–2099. [Google Scholar] [CrossRef]
  19. Cui, G.Z.; Yang, W.; Yu, J.P.; Li, Z.; Tao, C.B. Fixed-time prescribed performance adaptive trajectory tracking control for a QUAV. IEEE Trans. Circuits Syst. II 2021, 69, 494–498. [Google Scholar] [CrossRef]
  20. Bu, X.W. Guaranteeing prescribed performance for air-breathing hypersonic vehicles via an adaptive non-affine tracking controller. Acta Astronaut. 2018, 151, 368–379. [Google Scholar] [CrossRef]
  21. Wang, N.; Gao, Y.; Zhang, X.F. Data-driven performance-prescribed reinforcement learning control of an unmanned surface vehicle. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5456–5467. [Google Scholar] [CrossRef]
  22. Luo, S.B.; Wu, X.; Wei, C.S.; Zhang, Y.L.; Yang, Z. Adaptive finite-time prescribed performance attitude tracking control for reusable launch vehicle during reentry phase: An event-triggered case. Adv. Space Res. 2022, 69, 3814–3827. [Google Scholar] [CrossRef]
  23. Modares, H.; Sistani, M.B.N.; Lewis, F.L. A policy iteration approach to online optimal control of continuous-time constrained-input systems. ISA Trans. 2013, 52, 611–621. [Google Scholar] [CrossRef]
  24. Tan, J.; Guo, S.J. Backstepping control with fixed-time prescribed performance for fixed wing UAV under model uncertainties and external disturbances. Int. J. Control 2022, 95, 934–951. [Google Scholar] [CrossRef]
  25. Yuan, Y.; Wang, Z.; Guo, L.; Liu, H.P. Barrier Lyapunov functions-based adaptive fault tolerant control for flexible hypersonic flight vehicles with full state constraints. IEEE Trans. Syst. Man Cybern. Syst. 2018, 50, 3391–3400. [Google Scholar] [CrossRef]
  26. Lyshevski, S.E. Optimal control of nonlinear continuous-time systems: Design of bounded controllers via generalized nonquadratic functionals. In Proceedings of the 1998 American Control Conference, ACC (IEEE Cat. No. 98CH36207). Philadelphia, PA, USA, 26–26 June 1998; Volume 1, pp. 205–209. [Google Scholar]
  27. Hornik, K.; Stinchcombe, M.; White, H. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw. 1990, 3, 551–560. [Google Scholar] [CrossRef]
  28. Kamalapurkar, R.; Walters, P.; Dixon, W.E. Model-based reinforcement learning for approximate optimal regulation. Automatica 2016, 64, 94–104. [Google Scholar] [CrossRef] [Green Version]
  29. Bhasin, S.; Kamalapurkar, R.; Johnson, M.; Vamvoudakis, K.G.; Lewis, F.L.; Dixon, W.E. A novel actor–critic–identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 2013, 49, 82–92. [Google Scholar] [CrossRef]
  30. Moreno, J.A.; Osorio, M. Strict Lyapunov functions for the super-twisting algorithm. IEEE Trans. Autom. Control 2012, 57, 1035–1040. [Google Scholar] [CrossRef]
  31. Wei, C.Z.; Wang, M.Z.; Lu, B.G.; Pu, J.L. Accelerated Landweber iteration based control allocation for fault tolerant control of reusable launch vehicle. Chin. J. Aeronaut. 2022, 35, 175–184. [Google Scholar] [CrossRef]
  32. Zhang, C.F.; Zhang, G.S.; Dong, Q. Fixed-time disturbance observer-based nearly optimal control for reusable launch vehicle with input constraints. ISA Trans. 2022, 122, 182–197. [Google Scholar] [CrossRef]
  33. Wang, Z.; Wu, Z.; Du, Y. Robust adaptive backstepping control for reentry reusable launch vehicles. Acta Astronaut. 2016, 126, 258–264. [Google Scholar] [CrossRef]
Figure 1. The block diagram of the proposed control scheme.
Figure 1. The block diagram of the proposed control scheme.
Applsci 12 07436 g001
Figure 2. Time histories of the attitude angles.
Figure 2. Time histories of the attitude angles.
Applsci 12 07436 g002
Figure 3. Time histories of the attitude tracking error e 1 α .
Figure 3. Time histories of the attitude tracking error e 1 α .
Applsci 12 07436 g003
Figure 4. Time histories of the attitude tracking error e 1 β .
Figure 4. Time histories of the attitude tracking error e 1 β .
Applsci 12 07436 g004
Figure 5. Time histories of the attitude tracking error e 1 σ .
Figure 5. Time histories of the attitude tracking error e 1 σ .
Applsci 12 07436 g005
Figure 6. Time histories of the angular rates.
Figure 6. Time histories of the angular rates.
Applsci 12 07436 g006
Figure 7. Time histories of the sliding manifolds.
Figure 7. Time histories of the sliding manifolds.
Applsci 12 07436 g007
Figure 8. Time histories of the control torques.
Figure 8. Time histories of the control torques.
Applsci 12 07436 g008
Figure 9. Weights of the actor network.
Figure 9. Weights of the actor network.
Applsci 12 07436 g009
Figure 10. Weights of the critic network.
Figure 10. Weights of the critic network.
Applsci 12 07436 g010
Table 1. Performance indexes of the α -channel.
Table 1. Performance indexes of the α -channel.
Performance IndexProposed MethodRABCRLFTC
Maximum Overshoot/20.5%34.0%
Adjustment Time 1 (s)3.95.64.3
ITAE index ( deg · s 2 )1.640743.26001.6962
IACE index ( N · m · s ) 8.507 × 10 5 8.893 × 10 5 1.022 × 10 6
1 The time when the attitude tracking error reaches to ±0.05 deg.
Table 2. Performance indexes of the β -channel.
Table 2. Performance indexes of the β -channel.
Performance IndexProposed MethodRABCRLFTC
Maximum Overshoot4.5%19.0%38.2%
Adjustment Time (s)3.14.14.6
ITAE index ( deg · s 2 )0.73471.51921.1507
IACE index ( N · m · s ) 1.8365 × 10 5 2.0571 × 10 5 4.155 × 10 5
Table 3. Performance indexes of the σ -channel.
Table 3. Performance indexes of the σ -channel.
Performance IndexProposed MethodRABCRLFTC
Maximum Overshoot5.5%17.7%53.5%
Adjustment Time (s)4.14.84.7
ITAE index ( deg · s 2 )0.87831.78531.7242
IACE index ( N · m · s ) 4.6577 × 10 3 5.0277 × 10 3 1.8364 × 10 4
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xu, S.; Guan, Y.; Wei, C.; Li, Y.; Xu, L. Reinforcement-Learning-Based Tracking Control with Fixed-Time Prescribed Performance for Reusable Launch Vehicle under Input Constraints. Appl. Sci. 2022, 12, 7436. https://doi.org/10.3390/app12157436

AMA Style

Xu S, Guan Y, Wei C, Li Y, Xu L. Reinforcement-Learning-Based Tracking Control with Fixed-Time Prescribed Performance for Reusable Launch Vehicle under Input Constraints. Applied Sciences. 2022; 12(15):7436. https://doi.org/10.3390/app12157436

Chicago/Turabian Style

Xu, Shihao, Yingzi Guan, Changzhu Wei, Yulong Li, and Lei Xu. 2022. "Reinforcement-Learning-Based Tracking Control with Fixed-Time Prescribed Performance for Reusable Launch Vehicle under Input Constraints" Applied Sciences 12, no. 15: 7436. https://doi.org/10.3390/app12157436

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop