Next Article in Journal
Design of a Capacitive MEMS Accelerometer with Softened Beams
Next Article in Special Issue
Quality Study and Numerical Simulation Analysis of Solid–Liquid Two-Phase Magnetic Fluid Polishing Seven-Order Variable-Diameter Pipe
Previous Article in Journal
Temperature-Dependent Optical Properties of Perovskite Quantum Dots with Mixed-A-Cations
Previous Article in Special Issue
Machine Vision-Based Method for Measuring and Controlling the Angle of Conductive Slip Ring Brushes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Adaptive Sliding Mode Disturbance Observer and Deep Reinforcement Learning Based Motion Control for Micropositioners

1
State Key Laboratory of Internet of Things for Smart City and Department of Electromechanical Engineering, University of Macau, Macau 999078, China
2
Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Micromachines 2022, 13(3), 458; https://doi.org/10.3390/mi13030458
Submission received: 28 February 2022 / Revised: 13 March 2022 / Accepted: 15 March 2022 / Published: 17 March 2022
(This article belongs to the Special Issue Ocean MEMS and Related Technology)

Abstract

:
The motion control of high-precision electromechanitcal systems, such as micropositioners, is challenging in terms of the inherent high nonlinearity, the sensitivity to external interference, and the complexity of accurate identification of the model parameters. To cope with these problems, this work investigates a disturbance observer-based deep reinforcement learning control strategy to realize high robustness and precise tracking performance. Reinforcement learning has shown great potential as optimal control scheme, however, its application in micropositioning systems is still rare. Therefore, embedded with the integral differential compensator (ID), deep deterministic policy gradient (DDPG) is utilized in this work with the ability to not only decrease the state error but also improve the transient response speed. In addition, an adaptive sliding mode disturbance observer (ASMDO) is proposed to further eliminate the collective effect caused by the lumped disturbances. The micropositioner controlled by the proposed algorithm can track the target path precisely with less than 1 μ m error in simulations and actual experiments, which shows the sterling performance and the accuracy improvement of the controller.

1. Introduction

Micropositioning technologies based on smart materials in precision industries have gained much attention for numerous potential applications in optical steering, micro-assembly, nano-inscribing, cell manipulation, etc. [1,2,3,4,5,6,7]. One of the greatest challenge in this research field is the uncertainties produced by various factors such as the dynamic model, environmental temperature, sensors performance, and the actuators’ nonlinear characteristics [8,9], which make the control of micropositioning system a demanding problem.
To address the uncertain problem, different kinds of control approach have been developed, such as the PID control method [10], sliding mode control [11,12], and adaptive control [13]. In addition, many researchers have integrated these control strategies to further improve the control performance. Victor et al. have proposed a scalable field-programmable gate array-based motion control system with a parabolic velocity profile [14]. A new seven-segment profile algorithm was developed by Jose et al. to improve the performance of the motion controller [15]. Combined with the backstepping strategy, Fei et al. proposed an adaptive fuzzy sliding mode controller in [16]. Based on the radial basis function neural network (RBFNN) and sliding mode control (SMC), Ruan et al. developed a RBFNN-SMC for nonlinear electromechanical actuator systems [17]. Gharib et al. designed a PID controller with a feedback linearization technique for path tracking control of a micropositioner [18]. Nevertheless, the performance and robustness of such model-based control strategies are still limited by the precision of the dynamics model. On the other hand, a sophisticated system model frequently leads to a complex control strategy. Although many researchers have considered the factors of uncertainties and disturbances, it is still difficult for the system to provide a precise and comprehensive process.
As the rapid development in artificial intelligence in recent years has roundly impacted the traditional control field, learning-based and data-driven approaches, especially reinforcement learning (RL) and neural networks, have become a promising research tropic. Different from traditional control strategies that need to make assumptions based on the dynamics model [19,20], reinforcement learning can directly learn the policy by interacting with the system. Back in 2005, Adda et al. presented a reinforcement learning algorithm for learning control of stochastic micromanipulation systems [21]. Li et al. designed a state–action–reward–state–action (SARSA) method using linear function approximation to generate an optimal path by controlling the choice of the micropositioner [22]; however, the reinforcement learning algorithms such as Q-learning [23] and SARSA [24] utilized in the aforementioned works are unable to deal with complex dynamics problems, especially the continuous state action space problem. With the spectacular improvement enjoyed by deep reinforcement learning (DRL), primarily driven by deep neural networks (DNN) [25], the DRL algorithms, such as the deep Q network (DQN) [26], policy gradient (PG) [27], deterministic policy gradient (DPG) [28], and deep deterministic policy gradient (DDPG) [29] with the ability to approximate the value function, have played an important role in continuous control tasks.
Latifi et al. introduced a model-free neural fitted Q iteration control method for micromanipulation devices; in this work, the DNN is adopted to represent Q-value function [30]. Leinen introduced the concept of experience playback in DQN and the approximate value function of the neural network into the SARSA algorithm for the control of a scanning probe microscope [31]. Both simulation and real experimental results have shown that their proposed RL algorithm based on the neural network could achieve better performance compared to traditional control methods to some extent; however, due to the collective effects of disturbances generated from nonlinear systems and deviations in value functions [29,32,33], the RL control method could induce significant inaccuracies in the tracking control tasks [34]. To improve the anti-disturbance capability and control accuracy, disturbance rejection control [35], time-delay estimation based control [36], disturbance observer-based controllers [37,38] have been proposed successively. To deal with this issue, a deep reinforcement learning controller integrated with an adaptive sliding mode disturbance observer (ASMDO) is developed in this work. Previous research on trajectory tracking control of DRL has shown that apparent state errors have always existed [39,40,41,42]. One of the main reasons is the inaccurate estimation of the action value function in DRL structure. As indicated in [43], even in elementary control tasks, accurate action values cannot be attained from the same action value function; therefore, in this work, the DDPG algorithm is developed with an integral differential compensator (DDPG-ID) added to cope with this situation. In addition, the comparison of the reinforcement learning control method with various common state-of-the-art control methods are listed in Table 1, which shows the pros and cons of these different methods.
In this study, deep reinforcement learning is leveraged into a novel optimal control scheme for complex systems. An anti-disturbance, stable, and precise control strategy is proposed for the trajectory tracking task of the micropositioner system. The contribution of this works are presented as follows:
(1)
A DDPG-ID algorithm based on deep reinforcement learning is introduced as a basic micropositioner system motion controller, which avoided the limitation of traditional control strategies to the accuracy and comprehensiveness of the dynamic model;
(2)
To eliminate the collective effect caused by the lumped disturbances from the micropositioner system and inaccurate estimation of the value function in deep reinforcement learning, an adaptive sliding mode disturbance observer (ASMDO) is proposed;
(3)
An integral differential compensator is introduced in DDPG-ID to compensate for the feedback state of the system, which improves the accuracy and response time of the controller, and further improves the robustness of the controller subject to external disturbances.
The manuscript is structured as follows. Section 2 presents the system description of the micropositioner. In Section 3, we develop a deep reinforcement learning control method combined with ASMDO and compensator, and parameters of the DNNs are illustrated. Then, simulation parameters and tracking results are given in Section 4. To further evaluate the performance of the proposed control strategy in the micropositioner, tracking experiments are presented in Section 4. Lastly, conclusions are given in Section 5.

2. System Description

The basic structure of micropositioner is shown in Figure 1, which consists of a base, a platform, and a kinematic device. The kinematic device is composed with an armature, an electromagnetic actuator, and a chain mechanism driven by electromagnetic actuator. As shown in Figure 1, there are mutual-perpendicular compliant chains actuated by the electron-magnetic actuator (EMA) in the structure. The movement of the chain mechanism is in accordance with the working air gap y. The EMA generates the magnetic force T m , which can be approximated as:
T m = k I c y + p 2
where k and p are constant parameters related to the electronmagnetic actuator, I c is the excitation current, and y is the working air gap between the armature and the EMA. Then, the electrical model of the system can be given as:
V i = R I c + d d t H I c
where V i is the input voltage from the EMA, R is the resistance of the coil and H denotes the coil inductance, which can be given as:
H = H 1 + p H 0 y + p
where H 1 is the coil inductance while the air gap is infinite, and H 0 is the incremental inductance when the gap is zero. The motion equation for the micropositioner can be expressed as:
m d 2 y d t 2 = ι α 0 y T m
where ι is the stiffness along the motion direction in the system, and α 0 is the initial air gap.
According to Equations (1)–(4), they define x 1 = y , x 2 = y ˙ , x 3 = I c as the state variables and the control input u = V i . Then, the dynamics model of the electromagnetic actuator can be written as:
x ˙ 1 = x 2 x ˙ 2 = ι m α 0 x 1 k m x 3 x 1 + p 2 x ˙ 3 = 1 H R x 3 + H 0 p x 2 x 3 x 1 + p 2 + u
Define the variables z 1 = x 1 , z 2 = x 2 , z 3 = ι m α 0 x 1 k m x 3 x 1 + p 2 , then we have
z ˙ 1 = z 2 z ˙ 2 = z 3 z ˙ 3 = f ( x ) + g ( x ) u
where f ( x ) = ι x 2 m + 2 k x 3 2 m x 1 + p 2 H x 1 + p p H 0 H x 1 + p 2 x 2 + R H , g ( x ) = 2 k x 3 H m x 1 + p 2 , and z 1 is the system output.
In realistic engineering application, there always exist some uncertainties of the system, then system Equation (6) can be rewritten as:
z ˙ i = z i + 1 , i = 1 , 2 z ˙ 3 = f 0 ( x ) + g 0 ( x ) u + ( Δ f ( x ) + Δ g ( x ) u ) + d
where f 0 ( x ) and g 0 ( x ) denote the nominal part of the micropositioner system and Δ f ( x ) , Δ g ( x ) denote the uncertainties of the modeling system; d denotes the external disturbances. Then, defining D = ( Δ f ( x ) + Δ g ( x ) u ) + d , we have
z ˙ i = z i + 1 , i = 1 , 2 z ˙ 3 = f 0 ( x ) + g 0 ( x ) u + D
where D is the lumped system disturbances. The following assumption is exploited [44]:
Assumption 1.
The lumped interference D is bounded and its upper bound is less than a fixed parameter β 1 and the derivative of D is unknown but bounded.
Remark 1.
Assumption 1 is reasonable since all micropositioner platforms are accurately designed and parameter identified, and all disturbances are remained in a controllable domain.

3. Design of ASMDO and DDPG-ID Algorithm

In this section, the adaptive sliding mode disturbance observer (ASMDO) is introduced based on the dynamics of the micropositioner. Then, the DDPG-ID control method and pseudocode are given.

3.1. Design of Adaptive Sliding Mode Disturbance Observer

To develop the ASMDO, a virtual dynamic is firstly designed as
η ˙ i = η i + 1 , i = 1 , 2 η ˙ 3 = f ( z ) + g ( z ) u + D ^ + ρ
where η i , i = 1 , 2 , 3 are auxiliary variables, D ^ is the estimation of lumped disturbances, ρ denotes the sliding mode term, which is introduced afterwards.
Define a sliding variable S = σ 3 + k 2 σ 2 + k 1 σ 1 , where σ i = x i η i , i = 1 , 2 , 3 , k 1 and k 2 are positive design parameters. Then the sliding mode term ρ is designed as
ρ = λ 1 S + k 2 σ 3 + k 1 σ 2 + λ 2 sgn ( S )
where λ 1 , λ 2 are positive design parameters with λ 2 β 1 .
Choosing an unknown constant β 2 to present the upper bound of D ˙ , the ASMDO is proposed as:
D ^ ˙ = k ( x ˙ 3 f 0 ( z ) g 0 ( z ) u D ^ ) + ( β ^ 2 + λ 3 ) sgn ( ρ )
where k and λ 3 are positive design parameters and β ^ 2 is defined as the estimation of β 2 given by β ^ ˙ 2 = δ 0 β ^ 2 + ρ , with δ 0 is a small positive number.
Then, the output D ^ of the ASMDO is used as a compensation of the control input to eliminate the uncertainties generated by the system and external disturbances.
Remark 2.
Choosing V 1 = 1 2 S 2 and V 2 = 1 2 ( D ˜ 2 + β ˜ 2 2 ) , where D ˜ = D D ^ , β ˜ 2 = β 2 β ^ 2 as two Lyapunov function, derivative V 1 and V 2 with respect to time, it is easy to prove that both S and D ˜ will exponentially converge to the equilibrium point, so the proof process is not repeated.

3.2. Design of DDPG-ID Algorithm for Micropositioner

The goal of reinforcement learning is to obtain a policy for the agent that could maximizes the cumulative reward through interactions with the environment. The environment is usually formalized as a Makov decision process (MDP) described by a four-tuple ( S , A , P , R ) , where S, A, P, and R represent the state space of environment, set of actions, state transition probability function, and reward function separately. At each time step t, the agent in current state s t S takes action a t A from policy π ( a t | s t ) , then the agent acquires a reward r t R ( s t , a t ) and enters the next state s t + 1 according to the state transition probability function P ( s t + 1 | s t , a t ) . Based on the Markov property, the Bellman equation of action–value function Q π ( s t , a t ) , which is used for calculating the future expected reward, can be given as:
Q π ( s t , a t ) = E π r t + γ Q π ( s t + 1 , a t + 1 )
where γ [ 0 , 1 ] denotes the discount factor.
In trajectory tracking control task of micropositioner, state s t is state array about the air gap y of micropositioner at time t. Action a t is the voltage u applied by the controller to micropositioner. As shown in Figure 2, DDPG is one of actor–critic algorithms, which has an actor and a critic. The actor is responsible for generating actions and interacting with the environment, and the critic evaluates the performance of the actor and guides the action in the next state.
The action–value function and policy approximation are parameterized by DNN to solve the continuous states and actions problem in micropositioner with Q ( s t , a t , w Q ) Q π ( s t , a t ) , π w μ ( a t | s t ) π ( a t | s t ) , where w Q and w μ are the parameters of neural networks in action–value function and policy function. Under the prerequisite of using the neural network approximation representation policy function, the neural network gradient update method is used to seek the optimal policy π .
DDPG-ID uses deterministic policy π ( s t , w μ ) rather than traditional stochastic policy π w μ ( a t | s t ) , where the output of policy is the action a t with highest probability to current state s t , π ( s t , w μ ) = a t . The policy gradient is given as
w μ J ( π ) = E s ρ π [ w μ π ( s , w μ ) a Q ( s , a , w Q )
where J ( π ) = E π [ t = 1 T γ ( t 1 ) r t ] is the expectation of discount accumulative rewards, T denotes the final time of a whole process, ρ π is the distribution of state following the deterministic policy. Value function Q ( s t , a t , w Q ) is updated by calculating time temporal-difference error (TD-error), which can be defined as
e T D = r t + γ Q ( s t + 1 , π ( s t + 1 ) ) Q ( s t , a t )
where e T D is the TD-error, r t + γ Q ( s t + 1 , π ( s t + 1 ) ) represents the TD target value. By minimizing the TD-error, the parameters are updated backwards through the neural network gradient.
To avoid the convergence problem of single network caused by correlation between TD target value and current value [45,46], A target Q network Q T ( s t + 1 , a t + 1 , w Q ) is introduced to calculate network portion of TD target value and an online Q network Q O ( s t , a t , w Q ) is used to calculate current value in critic. Both these two DNN have the same structure. The actor also has an online policy network π O ( s t , w μ ) to generate current action and a target policy network π T ( s t , w μ ) to provide the target action a t + 1 . w μ and w Q separately represent the parameters of target policy and target Q networks.
In order to improve the stability and efficiency during RL training, experience replay technology is utilized in this work, which saves transition experience ( s t , a t , r t , s t + 1 ) into the experience replay buffer Ψ at each interaction with the environment for subsequent updates. In each training time t, a minibatch of M transitions ( s j , a j , r j , s j + 1 ) from the experience replay buffer are extracted to calculate the gradients and update neural networks.
An integral differential compensator is developed in deep reinforcement learning structure to improve the accuracy and responsiveness of tracking tasks in this work, which is shown in Figure 2. The integral portion of the state is utilized to increase the control input continuously, which would eventually reduce tracking error. The differential part is integrated to reduce the system oscillation and accelerates stability. The proposed compensator is designed as follows:
s I D t = y e t + α n = 1 t y e t + β y e t y e t 1
where s I D t represents the compensator error at time t, y e t = y d t y ^ t 2 , y d t represents the desired trajectory at time t, y ^ t is the measured air gap at time t and y e t is the error between them. α is the integral gain and β is the differential gain.
Then the state s t at time t can be described as:
s t = s I D t y ^ t y ^ ˙ t y d t y ˙ d t T
where y ^ ˙ t and y ˙ d t represent the derivatives of y ^ t and y d t .
The reward r t function designed is to measure the tracking error:
r t = 4 , y e t > 0.005 + 5 , 0.003 < y e t 0.005 + 10 , 0.001 < y e t 0.003 + 18 , y e t 0.001
As shown in Figure 3, the adaptive sliding mode disturbance observer (ASMDO) is embedded into the DDPG-ID between the actor and micropositioner system environment. Action a t with the environment is expressed as
a t = π O ( s t , w μ ) + D ^ t + N t
where w μ is the parameters of online policy network π O , D ^ t is the estimation of the micropositioner system at time t, and N t is Gaussian noise for action exploration.

3.2.1. Critic Update

After selecting M transitions ( s j , a j , r j , s j + 1 ) samples from experience replay buffer Ψ , the Q value is calculated. The online Q network is responsible for calculating the current Q value, which is as follows:
Q O ( s j , a j , w Q ) = w Q ϕ ( s j , a j )
where ϕ ( s j , a j ) represents the input of online Q network, which is an eigenvector consisting of state s j and action a j .
The target Q network Q T is defined as:
Q T ( s j + 1 , π T ( s j + 1 , w μ ) , w Q ) = w Q ϕ ( s j + 1 , π T ( s j + 1 , w μ ) )
where ϕ ( s j + 1 , π T ( s j + 1 , w μ ) ) is the input of the target Q network, which is a eigenvector consisting state s j + 1 and target policy network output π T ( s j + 1 , w μ ) .
For target policy network π T , the equation is:
π T ( s j + 1 , w μ ) = w μ s j + 1
Then, we rewrite the target Q value Q T as:
Q T = r j + γ Q T ( s j + 1 , π T ( s j + 1 , w μ ) , w Q )
where r j is the reward from the selected samples.
Since M transitions ( s j , a j , r j , s j + 1 ) are sampled from experience buffer Ψ , the loss function of the update critic is shown in Equation (23).
L w Q = 1 M j = 1 M Q T Q O s j , a j , w Q 2
where L w Q is the loss value of critic.
In order to smooth the target network update process, the soft update is applied without copying parameters periodically as:
w Q τ w Q + ( 1 τ ) w Q
where τ is the update factor, usually a small constant.
The diagram of Q network is shown in Figure 4, which is a parallel neural network. The Q network includes both state and action portions, and the output value of Q network is based on state and action. The state portion of the neural network consists of a state input layer, three full connection layers, and two ReLU layers clamped between the three full connection layers. The neural network of the action portion contains an action input layer and a full connection layer. The output layers of the above two portions are combined entering the neural network of the common part, which contains a ReLU layer and one output layer.
The parameters of each layer in the Q network are shown in Table 2.

3.2.2. Actor Update

The output of online policy network is
π O = w μ s j
On account of using deterministic policy, the calculation of the policy gradient has no integrals of action a, but instead has the derivatives of the value function Q O with respect to action a in comparison with stochastic policy. The gradient formula can be rewritten as follows:
w μ J 1 M j M ( a j Q O ( s j , a j , w Q ) w μ π O s j , w μ )
where the weights w μ are updated with the gradient back-propagation method. The target policy network is also updated with soft update pattern as follows:
w μ τ w μ + ( 1 τ ) w μ
where τ is the update factor, usually a small constant.
Figure 5 shows the diagram of the policy network in this paper, which contains a state input layer, a full connection layer, a tanh layer, and an output layer. The parameters of each layer in the policy network are shown in Table 3.
The Algorithm 1 pseudocode can be shown as:
Algorithm 1 DDPG-ID Algorithm.
  1:
Randomly initialize online Q network with weights w Q
  2:
Randomly initialize online policy network with weights w μ
  3:
Initialize the target Q network by w Q w Q
  4:
Initialize the target policy network by w μ w μ
  5:
Initialize the experience replay buffer Ψ
  6:
Load the simplified micropositioner dynamic model
  7:
for episode = 1, MaxEpisode do
  8:
    Initialize a noise process N for exploration
  9:
    Initialize ASMDO and ID compensator
10:
    Randomly initialize micropositioner states
11:
    Receive initial observation state s 1
12:
    for step = 1, T do
13:
       Select action a t = π O ( s t ) + D ^ t + N t
14:
       Use a t to run micropositioner system model
15:
       Process errors with integral differential compensator
16:
       Receive reward r t and new state s t + 1
17:
       Store transition ( s t , a t , r t , s t + 1 ) in replay buffer Ψ
18:
       Randomly sample a minibatch of M transitions ( s j , a j , r j , s j + 1 ) from Ψ
19:
       Set Q T = r j + γ Q T ( s j + 1 , π T ( s j + 1 , w μ ) , w Q )
20:
       Minimize loss: L ( w Q ) = 1 M j = 1 M ( Q T Q O ( s j , a j , w Q ) ) 2 to update online Q network
21:
       Use the sampled policy gradient to update online policy network:
        w μ J = 1 M j M ( a j Q O ( s j , a j , w Q ) w μ π O s j , w μ )
22:
       Update the target networks: w Q τ w Q + ( 1 τ ) w Q , w μ τ w μ + ( 1 τ ) w μ
23:
  end for
24:
end for

4. Simulation and Experimental Results

In this section, two kinds of periodic external disturbances were added to verify the practicability of the proposed ASMDO and three distinct desired trajectories were utilized to evaluate the performance of proposed deep reinforcement learning control strategy. An traditional DDPG algorithm and a well-tuned PID strategy were adopted for comparison. To further verify the spatial performances of the proposed algorithm, two kinds of different trajectories were introduced in the experiments.

4.1. Simulation Results

The parametric equations of two kinds of periodic external disturbances are defined as d 1 = 0.1 sin ( 2 π t ) + 0.1 sin ( 0.5 π t + π 3 ) , and d 2 = 0.1 + 0.1 sin ( 0.5 π t + π 3 ) . Based on the micropositoner model proposed in [44], the effectiveness of the observer is presented in Figure 6 and Figure 7.
The disturbance estimation results from the proposed ASMDO are presented in Figure 6a and Figure 7a, it is can be seen that the observer could track the given disturbance rapidly. The estimation errors are less than 0.01 mm in Figure 6b and Figure 7b, which shows the effectiveness of the ASMDO as interference compensation.
The dynamics model of micropositioner is given in Section 2, and its basic system model parameters are from our previous research [44,47], which is shown in Table 4. The DDPG algorithm is defined in same neural network structure and training parameters as DDPG-ID in this paper. The training parameters of the DDPG-ID and DDPG are shown in Table 5.
The first desired trajectory designed for tracking control simulation is a waved signal. According to the initial conditions, the parametric equation of the waved trajectory is defined as:
y d ( t ) = 0.985 0.015 sin ( π t 4 π 2 )
The training process of both DDPG-ID and DDPG are run on the same model with stochastic initialized micropositioner states. During the training evaluation, a larger episode reward indicates a more accurate and lower error control policy. It is shown in Figure 8 that DDPG-ID reaches the maximum reward score with fewer episodes compared to DDPG, which reveals that DDPG-ID algorithm converge faster than DDPG algorithm. Comparing Figure 8a with Figure 8b, the average reward of DDPG-ID training process is larger than DDPG’s average reward in stable state, which further indicates that policy learned by DDPG-ID algorithm has better performance. The trained algorithms are employed for tracking control of micropositioner system simulation experiments.
The tracking results of the waved trajectory is shown in Figure 9. The RMSE value, MAX value, and mean value of the tracking errors for these three control methods are provided in Table 6. In terms of tracking accuracy, the trained DDPG-ID controller has a better performance compared to DDPG and PID, which has smaller state error and smoother tracking trajectory. The tracking error of the DDPG-ID algorithm ranges from 8 × 10 4 to 9 × 10 4 mm, which is almost about a half of the DDPG policy. In the interim, the DDPG controller has a lesser tracking error than PID. A huge oscillation has been induced by the PID controller, which will affect the hardware to a certain extent in the actual operation process. This huge oscillation input signal is much larger than a normal control input signal, which typically ranges from 0 to 11 V. Based on the characteristics of reinforcement learning, it is hard for a well-trained policy to generate such a shock signal.
As can be seen in these figures, the tracking error of DDPG-ID in periodic trajectory is still less than the others, which ranges from 1.6 × 10 4 to 9 × 10 4 mm. Similar to the previous waved trajectory, the control input based on DDPG has shown better performance in terms of oscillations.
Another tracking results of a periodic trajectory is illustrated in Figure 10, and the tracking errors comparison of these three control methods are given in Table 7. The parametric equation of the periodic trajectory is defined as
y d ( t ) = 0.981 0.015 sin ( π t 4 π 2 ) + 0.008 sin ( π t 2 π 16 ) .
To further demonstrate the universality of the DDPG-ID policy, a periodic step trajectory is also utilized for comparison. The step signal with a period of 8 s is designed as the desired trajectory, which is shown in Figure 11a. The well-tuned PID controller is also tested in this step trajectory simulation. Since intense oscillations emerge, the results of PID show extremely worse performance are not shown in this paper.
According to Figure 11, the tracking result of DDPG-ID algorithm remains stable with the tracking error bounded in 2 × 10 4 to 9 × 10 4 mm, which is still as a half of DDPG’s performance. Due to the characteristic of the step signal, the state error will become tremendous during the step transition. Errors of DDPG-ID and DDPG are observed dropping quickly after step transition. It can be seen from Table 8 that the errors of DDPG-ID algorithm are substantially less than that of DDPG algorithm. As to the control inputs, the value of DDPG still fluctuates considerably when the state converges stable.
According to above simulation results, it can be concluded that the control policy of DDPG-ID has triumphantly dealt with collective effect caused by disturbance and inaccurate estimation of deep reinforcement learning comparing to DDPG. The comparison results also have demonstrated the excellent control performance of the policy learned by DDPG-ID algorithm.

4.2. Experimental Results

The speed, acceleration, and direction of these designed trajectories vary with time, which makes the experiments results more trustworthy. In each test, the EMA in micropositioner is regulated for tracking the desired path of working air gap.
As shown in Figure 12, a laser displacement sensor is utilized to detect the motion states. Then DDPG-ID algorithm was administered through a SimLab board transplanted with Matlab-Simulink. The EMA controls the movement of the chain mechanism by executing the control signal, which is from the analog output port of SimLab board. The analog input port of SIMLAB board is connected with the signal output from the laser displacement sensor.
Figure 13 shows the tracking experiment results of the waved trajectory. It reaches the starting point on a straight track with a speed of 5.6 μ m/s. At time 5 s, it begins to track the desired waved trajectory in three periods, and the waved trajectory can be described as y d ( t ) = 28 + 25 sin ( π t 10 + π 2 ) . The tracking error fluctuates within ± 1.5 μ m, which is demonstrated in Figure 13b. Except for several particular points of time, the tracking errors could range from ±1 μ m.
Another periodic trajectory tracking experiment was also executed. As shown in Figure 14, the desired periodic trajectory starts at time 5 s, and it is defined as y d ( t ) = 35 25 sin ( π t 7.5 2 π 3 ) 5 sin ( π t 15 + π 6 ) . The tracking error of the periodic trajectory still range from ± 1.5 μ m.
The experimental results show that the proposed DDPG-ID algorithm is able to closely track above two trajectories. Compared with the simulation results, the tracking error does not increase significantly, and it can be maintained between −1 μ m and +1 μ m.

5. Conclusions and Future Works

In this paper, a composite controller is developed based on an adaptive sliding mode disturbance observer and a deep reinforcement learning control scheme. A deep deterministic policy gradient is utilized to obtain the optimal control performance. To improve the tracking accuracy and transient response time, an integral differential compensator is applied during the learning process in the actor–critic framework. An adaptive sliding mode disturbance observer is developed to further retrench the influence of modeling uncertainty, external disturbances, and the effect of inaccurate value function. In comparison with the existing DDPG and the most commonly used PID controller, the trajectory tracking results has successfully indicated the satisfactory performances and the precision of the control policy based on the DDPG-ID algorithm in the simulation. The tracking errors are less than 1 μ m, which shows the significant tracking efficiency of the proposed methods. The experimental results also indicate the high accuracy and strong anti-interference capability of the proposed deep reinforcement learning control scheme. To further improve the tracking effect and realize micro-manipulation tasks in the future work, specific operation experiments will be performed such as cell manipulation, micro-assembly, etc.

Author Contributions

Writing—original draft preparation, S.L., R.X., X.X. and Z.Y.; writing—review and editing, S.L. and R.X.; data collection, S.L. and R.X.; visualization, S.L., R.X., X.X. and Z.Y.; supervision, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the Science and Technology Development Fund, Macau SAR (Grant No. 0018/2019/AKP and SKL-IOTSC(UM)-2021-2023), in part by the Ministry of Science and Technology of China (Grant No. 2019YFB1600700), in part by the Guangdong Science and Technology Department (Grant No. 2018B030324002 and 2020B1515130001), in part by the Zhuhai Science and Technology Innovation Bureau (Grant no. ZH22017002200001PWC), Jiangsu Science and Technology Department (Grant No. BZ2021061), and in part by the University of Macau (Grant No. MYRG2020-00253-FST).

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

P I D Proportional–integral–derivative control
R B F N N Radial basis neural network
R L Reinforcement learning
S A R S A State-Action-Reward-State-Action
QThe Value of Action in reinforcement learning
D R L Deep reinforcement learning
D N N Deep neural networks
D Q N Deep Q network
P G Policy gradient
D D P G Deep deterministic policy gradient
I D Integral differential compensator
T m The magnetic force
yThe working air gap in micropositioner
I c The excitation current in micropositioner
E M A The electron-magnetic actuator
V i The input voltage from the electron-magnetic actuator
RThe resistance of the coil in micropositioner
HThe coil inductance in micropositioner
uThe control input
DThe lumped system disturbance
A S M D O Adaptive Sliding Mode Disturbance Observer
s t The state at time t in reinforcement learning
a t The action at time t in reinforcement learning
r t The reward at time t in reinforcement learning
R e L U Rectified linear unit activation function
t a n h Hyperbolic tangent activation function

References

  1. Català-Castro, F.; Martín-Badosa, E. Positioning Accuracy in Holographic Optical Traps. Micromachines 2021, 12, 559. [Google Scholar] [CrossRef] [PubMed]
  2. Bettahar, H.; Clévy, C.; Courjal, N.; Lutz, P. Force-Position Photo-Robotic Approach for the High-Accurate Micro-Assembly of Photonic Devices. IEEE Robot. Autom. Lett. 2020, 5, 6396–6402. [Google Scholar] [CrossRef]
  3. Cox, L.M.; Martinez, A.M.; Blevins, A.K.; Sowan, N.; Ding, Y.; Bowman, C.N. Nanoimprint lithography: Emergent materials and methods of actuation. Nano Today 2020, 31, 100838. [Google Scholar] [CrossRef]
  4. Dai, C.; Zhang, Z.; Lu, Y.; Shan, G.; Wang, X.; Zhao, Q.; Ru, C.; Sun, Y. Robotic manipulation of deformable cells for orientation control. IEEE Trans. Robot. 2019, 36, 271–283. [Google Scholar] [CrossRef]
  5. Zhang, P.; Yang, Z. A robust adaboost. rt based ensemble extreme learning machine. Math. Probl. Eng. 2015, 2015, 260970. [Google Scholar] [CrossRef]
  6. Yang, Z.; Wong, P.; Vong, C.; Zong, J.; Liang, J. Simultaneous-fault diagnosis of gas turbine generator systems using a pairwise-coupled probabilistic classifier. Math. Probl. Eng. 2013, 2013, 827128. [Google Scholar] [CrossRef] [Green Version]
  7. Wang, D.; Zhou, L.; Yang, Z.; Cui, Y.; Wang, L.; Jiang, J.; Guo, L. A new testing method for the dielectric response of oil-immersed transformer. IEEE Trans. Ind. Electron. 2019, 67, 10833–10843. [Google Scholar] [CrossRef]
  8. Roshandel, N.; Soleymanzadeh, D.; Ghafarirad, H.; Koupaei, A.S. A modified sensorless position estimation approach for piezoelectric bending actuators. Mech. Syst. Signal Process. 2021, 149, 107231. [Google Scholar] [CrossRef]
  9. Ding, B.; Yang, Z.X.; Xiao, X.; Zhang, G. Design of reconfigurable planar micro-positioning stages based on function modules. IEEE Access 2019, 7, 15102–15112. [Google Scholar] [CrossRef]
  10. García-Martínez, J.R.; Cruz-Miguel, E.E.; Carrillo-Serrano, R.V.; Mendoza-Mondragón, F.; Toledano-Ayala, M.; Rodríguez-Reséndiz, J. A PID-type fuzzy logic controller-based approach for motion control applications. Sensors 2020, 20, 5323. [Google Scholar] [CrossRef]
  11. Salehi Kolahi, M.R.; Gharib, M.R.; Heydari, A. Design of a non-singular fast terminal sliding mode control for second-order nonlinear systems with compound disturbance. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2021, 235, 7343–7352. [Google Scholar] [CrossRef]
  12. Nguyen, M.H.; Dao, H.V.; Ahn, K.K. Adaptive Robust Position Control of Electro-Hydraulic Servo Systems with Large Uncertainties and Disturbances. Appl. Sci. 2022, 12, 794. [Google Scholar] [CrossRef]
  13. Cruz-Miguel, E.E.; García-Martínez, J.R.; Rodríguez-Reséndiz, J.; Carrillo-Serrano, R.V. A new methodology for a retrofitted self-tuned controller with open-source fpga. Sensors 2020, 20, 6155. [Google Scholar] [CrossRef]
  14. Montalvo, V.; Estévez-Bén, A.A.; Rodríguez-Reséndiz, J.; Macias-Bobadilla, G.; Mendiola-Santíbañez, J.D.; Camarillo-Gómez, K.A. FPGA-Based Architecture for Sensing Power Consumption on Parabolic and Trapezoidal Motion Profiles. Electronics 2020, 9, 1301. [Google Scholar] [CrossRef]
  15. García-Martínez, J.R.; Rodríguez-Reséndiz, J.; Cruz-Miguel, E.E. A new seven-segment profile algorithm for an open source architecture in a hybrid electronic platform. Electronics 2019, 8, 652. [Google Scholar] [CrossRef] [Green Version]
  16. Fei, J.; Fang, Y.; Yuan, Z. Adaptive Fuzzy Sliding Mode Control for a Micro Gyroscope with Backstepping Controller. Micromachines 2020, 11, 968. [Google Scholar] [CrossRef] [PubMed]
  17. Ruan, W.; Dong, Q.; Zhang, X.; Li, Z. Friction Compensation Control of Electromechanical Actuator Based on Neural Network Adaptive Sliding Mode. Sensors 2021, 21, 1508. [Google Scholar] [CrossRef] [PubMed]
  18. Gharib, M.R.; Koochi, A.; Ghorbani, M. Path tracking control of electromechanical micro-positioner by considering control effort of the system. Proc. Inst. Mech. Eng. Part I J. Syst. Control Eng. 2021, 235, 984–991. [Google Scholar] [CrossRef]
  19. Han, M.; Tian, Y.; Zhang, L.; Wang, J.; Pan, W. Reinforcement learning control of constrained dynamic systems with uniformly ultimate boundedness stability guarantee. Automatica 2021, 129, 109689. [Google Scholar] [CrossRef]
  20. de Orio, R.L.; Ender, J.; Fiorentini, S.; Goes, W.; Selberherr, S.; Sverdlov, V. Optimization of a spin-orbit torque switching scheme based on micromagnetic simulations and reinforcement learning. Micromachines 2021, 12, 443. [Google Scholar] [CrossRef]
  21. Adda, C.; Laurent, G.J.; Le Fort-Piat, N. Learning to control a real micropositioning system in the STM-Q framework. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, Barcelona, Spain, 18–22 April 2005; pp. 4569–4574. [Google Scholar]
  22. Li, J.; Li, Z.; Chen, J. Reinforcement learning based precise positioning method for a millimeters-sized omnidirectional mobile microrobot. In Proceedings of the International Conference on Intelligent Robotics and Applications, Wuhan, China, 15–17 October 2008; pp. 943–952. [Google Scholar]
  23. Shi, H.; Shi, L.; Sun, G.; Hwang, K.S. Adaptive Image-Based Visual Servoing for Hovering Control of Quad-Rotor. IEEE Trans. Cogn. Dev. Syst. 2019, 12, 417–426. [Google Scholar] [CrossRef]
  24. Zheng, N.; Ma, Q.; Jin, M.; Zhang, S.; Guan, N.; Yang, Q.; Dai, J. Abdominal-waving control of tethered bumblebees based on sarsa with transformed reward. IEEE Trans. Cybern. 2018, 49, 3064–3073. [Google Scholar] [CrossRef]
  25. Tang, L.; Yang, Z.X.; Jia, K. Canonical correlation analysis regularization: An effective deep multiview learning baseline for RGB-D object recognition. IEEE Trans. Cogn. Dev. Syst. 2018, 11, 107–118. [Google Scholar] [CrossRef]
  26. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
  27. Sutton, R.S.; McAllester, D.A.; Singh, S.P.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 27–30 November 2000; pp. 1057–1063. [Google Scholar]
  28. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning (PMLR), Bejing, China, 22–24 June 2014; pp. 387–395. [Google Scholar]
  29. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  30. Latifi, K.; Kopitca, A.; Zhou, Q. Model-free control for dynamic-field acoustic manipulation using reinforcement learning. IEEE Access 2020, 8, 20597–20606. [Google Scholar] [CrossRef]
  31. Leinen, P.; Esders, M.; Schütt, K.T.; Wagner, C.; Müller, K.R.; Tautz, F.S. Autonomous robotic nanofabrication with reinforcement learning. Sci. Adv. 2020, 6, eabb6987. [Google Scholar] [CrossRef] [PubMed]
  32. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  33. Zeng, Y.; Wang, G.; Xu, B. A basal ganglia network centric reinforcement learning model and its application in unmanned aerial vehicle. IEEE Trans. Cogn. Dev. Syst. 2017, 10, 290–303. [Google Scholar] [CrossRef]
  34. Guo, X.; Yan, W.; Cui, R. Event-triggered reinforcement learning-based adaptive tracking control for completely unknown continuous-time nonlinear systems. IEEE Trans. Cybern. 2019, 50, 3231–3242. [Google Scholar] [CrossRef]
  35. Zhang, J.; Shi, P.; Xia, Y.; Yang, H.; Wang, S. Composite disturbance rejection control for Markovian Jump systems with external disturbances. Automatica 2020, 118, 109019. [Google Scholar] [CrossRef]
  36. Ahmed, S.; Wang, H.; Tian, Y. Adaptive high-order terminal sliding mode control based on time delay estimation for the robotic manipulators with backlash hysteresis. IEEE Trans. Syst. Man Cybern. Syst. 2019, 51, 1128–1137. [Google Scholar] [CrossRef]
  37. Chen, M.; Xiong, S.; Wu, Q. Tracking flight control of quadrotor based on disturbance observer. IEEE Trans. Syst. Man Cybern. Syst. 2019, 51, 1414–1423. [Google Scholar] [CrossRef]
  38. Zhao, Z.; He, X.; Ahn, C.K. Boundary disturbance observer-based control of a vibrating single-link flexible manipulator. IEEE Trans. Syst. Man Cybern. Syst. 2019, 51, 2382–2390. [Google Scholar] [CrossRef]
  39. Alibekov, E.; Kubalík, J.; Babuška, R. Policy derivation methods for critic-only reinforcement learning in continuous spaces. Eng. Appl. Artif. Intell. 2018, 69, 178–187. [Google Scholar] [CrossRef] [Green Version]
  40. Hasselt, H. Double Q-learning. Adv. Neural Inf. Process. Syst. 2010, 23, 2613–2621. [Google Scholar]
  41. Zhang, S.; Sun, C.; Feng, Z.; Hu, G. Trajectory-Tracking Control of Robotic Systems via Deep Reinforcement Learning. In Proceedings of the 2019 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM), Bangkok, Thailand, 18–20 November 2019; pp. 386–391. [Google Scholar]
  42. Kiumarsi, B.; Vamvoudakis, K.G.; Modares, H.; Lewis, F.L. Optimal and autonomous control using reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 2042–2062. [Google Scholar] [CrossRef] [PubMed]
  43. Yang, X.; Zhang, H.; Wang, Z. Policy Gradient Reinforcement Learning for Parameterized Continuous-Time Optimal Control. In Proceedings of the 2021 33rd Chinese Control and Decision Conference (CCDC), Kunming, China, 22–24 May 2021; pp. 59–64. [Google Scholar]
  44. Xiao, X.; Xi, R.; Li, Y.; Tang, Y.; Ding, B.; Ren, H.; Meng, M.Q.H. Design and control of a novel electromagnetic actuated 3-DoFs micropositioner. Microsyst. Technol. 2021, 27, 1–10. [Google Scholar] [CrossRef]
  45. Tommasino, P.; Caligiore, D.; Mirolli, M.; Baldassarre, G. A reinforcement learning architecture that transfers knowledge between skills when solving multiple tasks. IEEE Trans. Cogn. Dev. Syst. 2016, 11, 292–317. [Google Scholar]
  46. Srikant, R.; Ying, L. Finite-time error bounds for linear stochastic approximation andtd learning. In Proceedings of the Conference on Learning Theory (PMLR), Phoenix, AZ, USA, 25–28 June 2019; pp. 2803–2830. [Google Scholar]
  47. Feng, Z.; Ming, M.; Ling, J.; Xiao, X.; Yang, Z.X.; Wan, F. Fractional delay filter based repetitive control for precision tracking: Design and application to a piezoelectric nanopositioning stage. Mech. Syst. Signal Process. 2022, 164, 108249. [Google Scholar] [CrossRef]
Figure 1. The diagrammatic model of EMA actuated micropositioner. (a) The front view of micropositioner. (b) The end view of micropositioner. (c) The vertical view of micropositioner.
Figure 1. The diagrammatic model of EMA actuated micropositioner. (a) The front view of micropositioner. (b) The end view of micropositioner. (c) The vertical view of micropositioner.
Micromachines 13 00458 g001
Figure 2. The structure diagram of DDPG-ID algorithm.
Figure 2. The structure diagram of DDPG-ID algorithm.
Micromachines 13 00458 g002
Figure 3. System signal flow chart.
Figure 3. System signal flow chart.
Micromachines 13 00458 g003
Figure 4. The diagram of Q network.
Figure 4. The diagram of Q network.
Micromachines 13 00458 g004
Figure 5. The diagram of policy network.
Figure 5. The diagram of policy network.
Micromachines 13 00458 g005
Figure 6. Observation result of ASMDO with d 2 . (a) Observing result based on the ASMDO. (b) Observing error based on the ASMDO.
Figure 6. Observation result of ASMDO with d 2 . (a) Observing result based on the ASMDO. (b) Observing error based on the ASMDO.
Micromachines 13 00458 g006
Figure 7. Observation result of ASMDO with d 1 . (a) Observing result based on the ASMDO. (b) Observing error based on the ASMDO.
Figure 7. Observation result of ASMDO with d 1 . (a) Observing result based on the ASMDO. (b) Observing error based on the ASMDO.
Micromachines 13 00458 g007
Figure 8. The training rewards of two RL schemes. (a) The training rewards generated by DDPG-ID. (b) The training rewards generated by DDPG.
Figure 8. The training rewards of two RL schemes. (a) The training rewards generated by DDPG-ID. (b) The training rewards generated by DDPG.
Micromachines 13 00458 g008
Figure 9. Tracking results comparison of the waved trajectory. (a) Tracking results comparison based on three control schemes. (b) Tracking error comparison based on three control schemes. (c) Control input comparison based on three control schemes.
Figure 9. Tracking results comparison of the waved trajectory. (a) Tracking results comparison based on three control schemes. (b) Tracking error comparison based on three control schemes. (c) Control input comparison based on three control schemes.
Micromachines 13 00458 g009
Figure 10. Tracking results comparison of the periodic trajectory. (a) Tracking results comparison based on three control schemes. (b) Tracking error comparison based on three control schemes. (c) Control input comparison based on three control schemes.
Figure 10. Tracking results comparison of the periodic trajectory. (a) Tracking results comparison based on three control schemes. (b) Tracking error comparison based on three control schemes. (c) Control input comparison based on three control schemes.
Micromachines 13 00458 g010aMicromachines 13 00458 g010b
Figure 11. Tracking results comparison of the step trajectory. (a) Tracking results comparison based on two control schemes. (b) Tracking error comparison based on two control schemes. (c) Control input comparison based on two control schemes.
Figure 11. Tracking results comparison of the step trajectory. (a) Tracking results comparison based on two control schemes. (b) Tracking error comparison based on two control schemes. (c) Control input comparison based on two control schemes.
Micromachines 13 00458 g011aMicromachines 13 00458 g011b
Figure 12. The schematic diagram of experiment system.
Figure 12. The schematic diagram of experiment system.
Micromachines 13 00458 g012
Figure 13. Tracking results of the waved trajectory. (a) Tracking result of desired trajectory. (b) Tracking error of desired trajectory.
Figure 13. Tracking results of the waved trajectory. (a) Tracking result of desired trajectory. (b) Tracking error of desired trajectory.
Micromachines 13 00458 g013
Figure 14. Tracking results comparison of the step trajectory. (a) Tracking result of desired trajectory. (b) Tracking error of desired trajectory.
Figure 14. Tracking results comparison of the step trajectory. (a) Tracking result of desired trajectory. (b) Tracking error of desired trajectory.
Micromachines 13 00458 g014
Table 1. Comparison of different control algorithms.
Table 1. Comparison of different control algorithms.
MethodAdvantagesDisadvantages

PID control
Simple design structure
Easy to implementation
Mainly used in linear systems
Requirement of full-state feedback
Lack of adaptivity

SMC control
Simple design structure
Easy to implementation
High robustness
Excessive chattering effect
Lack of adaptivity

Adaptive control
Lower initial cost
Lower cost of redundancy
High reliability and performance
Stability is not treated rigorously
High gain observes needed
Slow convergence

Backstepping control
Global stability
Simple design structure
Easy to be integrated
Low anti-interference ability
Sensitive to system models
Lack of adaptivity

RL control
No need of accurate model
Improved control performance
High adaptivity
Poor anti-interference ability
Easy to generate state error
Table 2. Q network parameters.
Table 2. Q network parameters.
Network Layer NameNumber of Nodes
StateLayer5
CriticStateFC1120
CriticStateFC260
CriticStateFC360
ActionInput1
CriticActionFC160
addLayer2
CriticOutput1
Table 3. Policy network parameters.
Table 3. Policy network parameters.
Network Layer NameNumber of Nodes
StateLayer5
ActorFC130
ActorOutput1
Table 4. Parameters of the micropositioner model.
Table 4. Parameters of the micropositioner model.
NotationValueUnit
L 1 13.21 H
L 0 0.67 H
a 1.11 × 10 5 m
R43.66 Ω
c 8.83 × 10 5 Nm 2 A 2
k 1.803 × 10 N 5 Nm 1
m0.0272 Kg
Table 5. Training parameters of DDPG-ID and DDPG.
Table 5. Training parameters of DDPG-ID and DDPG.
HyperparametersValue
Learning rate for actor φ 1 0.001
Learning rate for critic φ 2 0.001
Discount factor γ 0.99
Initial exploration ε 1
Experience replay buffer size ψ 100,000
Minibatch size M64
Max episode ϖ 1500
Soft update factor τ 0.05
Max exploration steps T250 (25 s)
Time step T s 0.01 s
Intergal gain α 0.01
Differential gain β 0.001
Table 6. Tracking errors comparison of different controllers in the waved trajectory.
Table 6. Tracking errors comparison of different controllers in the waved trajectory.
RMSEMAXMEAN
DDPG-ID 3.658 × 10 4 4.758 × 10 4 1.003 × 10 4
DDPG 1.093 × 10 3 2.615 × 10 3 4.414 × 10 4
PID 1.654 × 10 3 3.144 × 10 4 3.104 × 10 4
Table 7. Tracking errors comparison of different controllers in the periodic trajectory.
Table 7. Tracking errors comparison of different controllers in the periodic trajectory.
RMSEMAXMEAN
DDPG-ID 4.272 × 10 4 8.471 × 10 4 5.404 × 10 5
DDPG 1.545 × 10 3 3.102 × 10 3 1.610 × 10 4
PID 1.923 × 10 3 3.376 × 10 3 3.311 × 10 4
Table 8. Tracking errors comparison of different controllers in the step trajectory.
Table 8. Tracking errors comparison of different controllers in the step trajectory.
RMSEMAXMEAN
DDPG-ID 4.612 × 10 3 0.02953 6.938 × 10 4
DDPG 5.279 × 10 3 0.02986 1.437 × 10 3
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Liang, S.; Xi, R.; Xiao, X.; Yang, Z. Adaptive Sliding Mode Disturbance Observer and Deep Reinforcement Learning Based Motion Control for Micropositioners. Micromachines 2022, 13, 458. https://doi.org/10.3390/mi13030458

AMA Style

Liang S, Xi R, Xiao X, Yang Z. Adaptive Sliding Mode Disturbance Observer and Deep Reinforcement Learning Based Motion Control for Micropositioners. Micromachines. 2022; 13(3):458. https://doi.org/10.3390/mi13030458

Chicago/Turabian Style

Liang, Shiyun, Ruidong Xi, Xiao Xiao, and Zhixin Yang. 2022. "Adaptive Sliding Mode Disturbance Observer and Deep Reinforcement Learning Based Motion Control for Micropositioners" Micromachines 13, no. 3: 458. https://doi.org/10.3390/mi13030458

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop