# Robust Control Strategy for Quadrotor Drone Using Reference Model-Based Deep Deterministic Policy Gradient

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Related Work

#### 1.2. Contributions

- (1)
- The method was to track the states of the reference model, which was designed according to a real system model. The reference mode can be seen as a baseline of actions to better direct the agent control quadrotor.
- (2)
- A reward-calculating system based on a model of a real quadrotor drone was designed to train the RL agent. At the same time, since the physical meanings and units were different between the system state variables, several hyperparameters were employed to adjust the weights of the state errors in the reward.
- (3)
- An RL-based quadrotor control algorithm, RM–DDPG, was proposed to improve the implementation performance, which can eliminate the steady-state error, reduce controller saturation, and guarantee robustness. To the best of our knowledge, this is the first time that the RL method has been applied to control a system with fast dynamics, such as a quadrotor attitude system, and this is the first experiment to demonstrate practical performance using an actual quadrotor drone.

## 2. Problem Formulation

#### 2.1. Dynamic Model of the Quadrotor Drone

#### 2.2. Policy Gradient Method of Reinforcement Learning

#### 2.3. RM–DDPG Algorithm

#### 2.4. Neural Network Structure

Algorithm 1 Offline training algorithm of RM–DDPG. |

Initialize:Randomly initialize the weights of the actor network ${\pi}^{\mu}$ and critic network ${Q}^{w}$ Copy parameters from the actor network ${\pi}^{\mu}$ and critic network ${Q}^{w}$ to the target actor network ${\pi}^{{\mu}^{\prime}}$ and target critic network ${Q}^{{w}^{\prime}}$, respectively Create an empty replay buffer D with length M Load the quadrotor drone model and the reference model as the environment Create a noise distribution N(0, ${\sigma}^{2}$) for exploration For episode = 1, M doRandomly reset quadrotor states and target states Initialize the reference model states by copying quadrotor states Observe initial states ${s}_{1}$ For t = 1, T doIf length of replay buffer D is bigger than mini-batch size, thenChoose action ${a}_{t}={\pi}^{\mu}\left({s}_{t}\right)+{n}_{t}$ based on state ${s}_{t}$ and noise ${n}_{t}~N$ ElseChoose an arbitrary action from the action space End ifPerform control command ${a}_{t}$ in the environment Calculate reward ${r}_{t}$ and new state ${s}_{t+1}$ Store transition tuple (${s}_{t}$, ${a}_{t}$, ${r}_{t}$, ${s}_{t+1}$) to replay buffer D If the length of replay buffer D is bigger than mini-batch size, thenRandomly sample a data batch from D Calculate the gradient and update the critic network following (20) (21) According to the output of the critic network, update the actor network -following (22) (23) Soft update the target network parameters following (18) End ifIf ${s}_{t+1}$ exceed the safe range thenbreakEnd IfEnd ForEnd ForSave model or evaluate |

## 3. Experiments and Analysis

#### 3.1. Drone Model and Simulator

#### 3.2. Performance Test

#### 3.3. Robustness Test

#### 3.4. Real Flight Experiment

## 4. Discussion and Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Eun, J.; Song, B.D.; Lee, S.; Lim, D.-E. Mathematical Investigation on the Sustainability of UAV Logistics. Sustainability
**2019**, 11, 5932. [Google Scholar] [CrossRef] - An, C.; Mingxi, J.; Jieyin, N.; Zhou, W.; Li, X.; Wang, J.; He, X. Research on the application of computer track planning algorithm in UAV power line patrol system. J. Phys. Conf. Ser.
**2021**, 1915, 032030. [Google Scholar] - Valente, J.; del Cerro, J.; Barrientos, A.; Sanz, D. Aerial coverage optimization in precision agriculture management: A musical harmony inspired approach. Comput. Electron. Agric.
**2013**, 99, 153–159. [Google Scholar] [CrossRef] - Cowling, I.D.; Yakimenko, O.A.; Whidborne, J.F.; Cooke, A.K. A prototype of an autonomous controller for a quadrotor UAV. In Proceedings of the 2007 European Control Conference (ECC), Kos, Greece, 2–5 July 2007; pp. 4001–4008. [Google Scholar] [CrossRef]
- Camacho, E.F.; Alba, C.B. Model Predictive Control; Springer Science & Business Media: London, UK, 2013. [Google Scholar]
- Mayne, D.Q. Model predictive control: Recent developments and future promise. Automatica
**2014**, 50, 2967–2986. [Google Scholar] [CrossRef] - Puangmalai, W.; Puangmalai, J.; Rojsiraphisal, T. Robust Finite-Time Control of Linear System with Non-Differentiable Time-Varying Delay. Symmetry
**2020**, 12, 680. [Google Scholar] [CrossRef] - Elmokadem, T.; Savkin, V.A. A method for autonomous collision-free navigation of a quadrotor UAV in unknown tunnel-like environments. Robotica
**2022**, 40, 835–861. [Google Scholar] [CrossRef] - Xu, R.; Ozguner, U. Sliding mode control of a quadrotor helicopter. In Proceedings of the 45th IEEE Conference on Decision and Control, San Diego, CA, USA, 13–15 December 2006; pp. 4957–4962. [Google Scholar] [CrossRef]
- Xu, B. Composite learning finite-time control with application to quadrotors. IEEE Trans. Syst. Man Cybern. Syst.
**2018**, 48, 1806–1815. [Google Scholar] [CrossRef] - Alattas, K.A.; Vu, M.T.; Mofid, O.; El-Sousy, F.F.M.; Fekih, A.; Mobayen, S. Barrier Function-Based Nonsingular Finite-Time Tracker for Quadrotor UAVs Subject to Uncertainties and Input Constraints. Mathematics
**2022**, 10, 1659. [Google Scholar] [CrossRef] - Hoang, V.T.; Phung, M.D.; Ha, Q.P. Adaptive twisting sliding mode control for quadrotor unmanned aerial vehicles. In Proceedings of the 2017 11th Asian Control Conference (ASCC), Gold Coast, QLD, Australia, 17–20 December 2017; pp. 671–676. [Google Scholar]
- Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, King’s College, Cambridge, UK, 1989. [Google Scholar]
- Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn.
**1992**, 8, 279–292. [Google Scholar] [CrossRef] - Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv
**2015**, arXiv:1509.02971. [Google Scholar] [CrossRef] - Zhang, J.; Wu, F. A novel model-based reinforcement learning attitude control method for virtual reality satellite. Wirel. Commun. Mob. Comput.
**2021**, 2021, 7331894. [Google Scholar] [CrossRef] - Liu, T.; Hu, Y.; Xu, H. Deep reinforcement learning for vectored thruster autonomous underwater vehicle control. Complexity
**2021**, 2021, 6649625. [Google Scholar] [CrossRef] - Long, X.; He, Z.; Wang, Z. Online optimal control of robotic systems with single critic NN-based reinforcement learning. Complexity
**2021**, 2021, 8839391. [Google Scholar] [CrossRef] - Han, J.; Jo, K.; Lim, W.; Lee, Y.; Ko, K.; Sim, E.; Cho, J.S.; Kim, S.H. Reinforcement learning guided by double replay memory. J. Sens.
**2021**, 2021, 6652042. [Google Scholar] [CrossRef] - Wang, Y.; Sun, J.; He, H.; Sun, C. Deterministic policy gradient with integral compensator for robust quadrotor control. IEEE Trans. Syst. Man Cybern. Syst.
**2020**, 50, 3713–3725. [Google Scholar] [CrossRef] - Dooraki, A.R.; Lee, D.J. An innovative bio-inspired flight controller for quad-rotor drones: Quad-rotor drone learning to fly using reinforcement learning. Robot. Auton. Syst.
**2021**, 135, 103671. [Google Scholar] [CrossRef] - Rozi, H.A.; Susanto, E.; Dwibawa, I.P. Quadrotor model with proportional derivative controller. In Proceedings of the 2017 International Conference on Control, Electronics, Renewable Energy and Communications (ICCREC), Yogyakarta, Indonesia, 26–28 September 2017; pp. 241–246. [Google Scholar] [CrossRef]
- Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst.
**1999**, 12, 1057–1063. [Google Scholar] - Lin, L.-J. Reinforcement Learning for Robots Using Neural Networks. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1993. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with deep reinforcement learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] [CrossRef] - Hwangbo, J.; Sa, I.; Siegwart, R.; Hutter, M. Control of a quadrotor with reinforcement learning. IEEE Robot. Autom. Lett.
**2017**, 2, 2096–2103. [Google Scholar] [CrossRef] [Green Version]

**Figure 5.**(

**a**) The angular rate model of the quadrotor was identified through sectional data; (

**b**) shows the fitting result on the whole flying data. Considering the safety of the experiment, the time of the data did not start from 0 s.

**Figure 6.**(

**a**) Step response of the angular velocity model, and (

**b**) extension of the attitude angle. The SI units on the x− and y−axes are s and rad/s, respectively. As this is the step response of the angular velocity, the angle in (

**b**) is in the shape of a ramp.

**Figure 7.**(

**a**) Step response of the designed reference model. (

**b**) The x−axis represents the time (s), and the y−axis represents the target angle, angular velocity, and angular acceleration in rad, rad/s, and rad/${s}^{2}$.

**Figure 9.**The transition of system states (angle, angular rate) and control input during the step response (

**a**) and sine wave response (

**b**). The maximum control input was 3.0. It can be seen that classical DDPG tended to provide the maximum control input, which was unacceptable to a real quadrotor. By contrast, RM–DDPG was softer and could significantly eliminate steady−state errors.

**Figure 10.**Performance of the controller in drones with different diagonal lengths. The controller implemented the control policy corresponding to the size of the drone to maintain a consistent attitude control performance. (

**a**) The attitude angle during the step response of drones with different diagonal lengths; (

**b**) control input during the step response of drones with different diagonal lengths.

**Figure 11.**RM–DDPG method drove the quadrotor to return to the stable status from different initial angles.

**Figure 12.**Experimental results on a real quadrotor. (

**a**) Flight experiment of classical DDPG on a real quadrotor drone; (

**b**) flight experiment of our RM–DDPG on the same real quadrotor drone.

Parameter | Description | Value |
---|---|---|

L | Diagonal length | 1.1 (m) |

m | Take-off weight | 6.8 (kg) |

g | Acceleration due to gravity | $9.81(\mathrm{m}/{\mathrm{s}}^{2}$) |

K | Thrust gain | 9.01 |

${I}_{x},{I}_{y},{I}_{z}$ | Moments of inertia of frame | $0.04,0.04,0.05(\mathrm{kg}\xb7{\mathrm{m}}^{2}$) |

${J}_{p}$ | Moments of inertia of proper | $0.00007(\mathrm{kg}\xb7{\mathrm{m}}^{2}$) |

Parameter | Value |
---|---|

Learning rate of critic network ${\alpha}_{w}$ | 0.001 |

Learning rate of actor network ${\alpha}_{\mu}$ | 0.003 |

Batch size N | 256 |

Replay buffer size M | 100,000 |

Discount factor $\gamma $ | 0.99 |

Soft update rate $\eta $ | 0.002 |

Noise variance $\sigma $ | 0.1 |

Simulation timestep | 0.02 (s) |

Maximum steps in an episode | 500 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Liu, H.; Suzuki, S.; Wang, W.; Liu, H.; Wang, Q.
Robust Control Strategy for Quadrotor Drone Using Reference Model-Based Deep Deterministic Policy Gradient. *Drones* **2022**, *6*, 251.
https://doi.org/10.3390/drones6090251

**AMA Style**

Liu H, Suzuki S, Wang W, Liu H, Wang Q.
Robust Control Strategy for Quadrotor Drone Using Reference Model-Based Deep Deterministic Policy Gradient. *Drones*. 2022; 6(9):251.
https://doi.org/10.3390/drones6090251

**Chicago/Turabian Style**

Liu, Hongxun, Satoshi Suzuki, Wei Wang, Hao Liu, and Qi Wang.
2022. "Robust Control Strategy for Quadrotor Drone Using Reference Model-Based Deep Deterministic Policy Gradient" *Drones* 6, no. 9: 251.
https://doi.org/10.3390/drones6090251