# Deep Reinforcement Learning-Based Accurate Control of Planetary Soft Landing

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Preliminaries and Problem Formulation

#### 2.1. Soft Landing Problem Formulation

#### 2.2. RL Basis

- (1)
- DDPG

Algorithm 1: DDPG based soft landing. |

- (2)
- TD3

**Clipped Double-Q Learning.**TD3 learns two value function networks at the same time. When calculating the target, and are input to the two target value function networks at the same time after obtaining. When the value function network is updated, the smaller one is selected to compute the loss function of the error of the Bellman equation.$$y(r,{s}^{\prime},d)=r+\gamma (1-d)\underset{i=1,2}{min}{Q}_{{\varphi}_{\mathrm{t}arg,i}}({s}^{\prime},{\tilde{a}}^{\prime}\left(s\right))$$**Target Policy Smoothing.**The value function learning method of TD3 and DDPG is the same. When the value function network is updated, noise is added to the action output of the target policy network to avoid overexploitation of the value function$$\underset{{a}^{\prime}}{max}{Q}_{\varphi}({s}^{\prime},{a}^{\prime})={Q}_{{\varphi}_{\mathrm{targ}}}({s}^{\prime},{\pi}_{{\theta}_{\mathrm{targ}}}\left({s}^{\prime}\right)+\epsilon )$$**Delayed Policy Updates.**As the output of the target strategy network is used to compute the target of the value function, the agent can be brittle because of frequent strategy updates, so TD3 adopts the Delayed Policy Updates trick. When updating the strategy network, the update frequency of the strategy network is lower than that of the value function network. This helps to suppress the training fluctuation and makes the learning process more stable.

- (3)
- SAC

## 3. Soft Landing with DRL

#### 3.1. Reward Setting

**Goal achieving reward:**When the altitude of the lander is less than 0, the speed is downward, the speed is less than the upper limit of soft landing speed, and the attitude angle and angular rate is within the limited range, the lander is considered to have achieved soft landing, and gets the reward$$\begin{array}{c}{r}_{goal}=\lambda (h<0\phantom{\rule{1.em}{0ex}}\mathrm{and}\phantom{\rule{1.em}{0ex}}{v}_{z}<0\phantom{\rule{1.em}{0ex}}\mathrm{and}\phantom{\rule{1.em}{0ex}}\u2225v\u2225<{v}_{lim}\phantom{\rule{1.em}{0ex}}\mathrm{and}\hfill \\ \phantom{\rule{2.em}{0ex}}\phantom{\rule{1.em}{0ex}}\varphi <{\varphi}_{lim}\phantom{\rule{1.em}{0ex}}\mathrm{and}\phantom{\rule{1.em}{0ex}}\theta <{\theta}_{lim}\phantom{\rule{1.em}{0ex}}\mathrm{and}\phantom{\rule{1.em}{0ex}}\psi <{\psi}_{lim}\phantom{\rule{1.em}{0ex}}\mathrm{and}\hfill \\ \phantom{\rule{2.em}{0ex}}\phantom{\rule{1.em}{0ex}}\u2225\omega \u2225<{\omega}_{lim})\hfill \end{array}$$**Velocity tracking reward:**At the beginning of the phase of powered descent, the lander is several kilometers away from the landing zone, and the initial velocity is around 100 m/s. If the agent is rewarded only when it achieves a soft landing at the target area, the state space is so sparse that it’s nearly impossible to converge. Therefore, we transfer the soft landing problem into a velocity tracking problem. The process reward is introduced in the landing process, that is, a reference velocity is given according to the real-time relative position between the lander and target landing area$${\mathit{v}}_{ref}=\left\{\begin{array}{cc}-\frac{\mathit{r}-{\mathit{r}}_{1}}{{k}_{v1}}& h\ge {h}_{1}\\ -\frac{\mathit{r}-{\mathit{r}}_{2}}{{k}_{v2}}& 0\le h<{h}_{1}\end{array}\right.$$$${\mathit{r}}_{vel}=\beta \u2225v-{v}_{ref}\u2225$$**Crash penalty:**To avoid the crash of the lander, a penalty is included in the reward. When the attitude angle or speed deviation exceeds the threshold, the episode terminates and the environment returns a large negative reward as a penalty$${r}_{crash}=\eta (\varphi >{\varphi}_{lim}\phantom{\rule{1.em}{0ex}}or\phantom{\rule{1.em}{0ex}}\theta >{\theta}_{lim}\phantom{\rule{1.em}{0ex}}or\phantom{\rule{1.em}{0ex}}\psi >{\psi}_{lim})$$- $\mathit{Fuel}$$\mathit{consumption}$$\mathit{penalty}$: In planetary exploration missions, the fuel carried by the lander is limited, so the fuel consumption should be minimized. A reward regarding fuel consumption is defined as$${r}_{fuel}=\alpha \frac{1}{{I}_{sp}}\sum _{i=1}^{6}{T}_{i}$$
**Constant reward:**Notice that the rewards ${r}_{vel}$, ${r}_{crash}$, ${r}_{fuel}$ are all negative. To encourage the agent to explore more, a positive constant reward needs to be introduced into the reward.$${r}_{constant}=\kappa $$

#### 3.2. Observation Space

#### 3.3. Action Space

#### 3.4. Network Architecture

## 4. Simulation Results and Discussion

#### 4.1. Simulation Settings

#### 4.2. Simulation Results

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

DRL | Deep reinforcement learning |

DDPG | Deep deterministic policy gradient |

TD3 | Twin Delayed DDPG |

SAC | Soft Actor Critic |

6DOF | 6 Degree of Freedom |

NLP | Nonlinear Programming |

$\mathit{g}$ | planet gravity |

${T}_{i}$ | thrust of each engine |

${T}_{min}$ | minimum thrust of each engine |

${T}_{max}$ | maximum thrust of each engine |

$\varphi $ | cant angle for the engines |

$\mathit{T}$ | thrust vector composed of each engine |

${\mathit{T}}_{b}$ | thrust vector in lander body frame |

${\mathit{M}}_{b}$ | torque vector in lander body frame |

$\mathit{r}$ | surface relative lander position vector |

${\mathit{v}}_{\mathrm{b}}$ | velocity vector |

${\mathit{C}}_{\mathrm{b}}^{\mathrm{e}},{\mathit{C}}_{\mathrm{e}}^{\mathrm{b}}$ | direction cosine matrix between |

lander body frame and planet surface frame | |

${\omega}^{\mathrm{t}}$ | angular rate vector in lander body frame |

$\mathbf{Q}$ | attitude quaterion |

m | lander mass |

$\mathit{I}$ | inertia matrix of the lander |

${I}_{\mathrm{sp}}$ | specific impulse for thrusters |

$a,b,c$ | lander sides of length |

J | optimization target |

${G}_{t}$ | agent accumulated return in an episode |

${v}_{lim},{\phi}_{lim},{\psi}_{lim},{\theta}_{lim},{\omega}_{lim}$ | limited range for velocity, attitude angle and angular rate to |

achieve soft landing | |

${\mathit{v}}_{ref}$ | reference velocity |

$\lambda ,\alpha ,\beta ,\eta ,\kappa $ | reward coefficients |

${T}_{s}$ | simulation sample time |

${T}_{f}$ | maximum simulation time in one episode |

## References

- Sanguino, T.D.J.M. 50 years of rovers for planetary exploration: A retrospective review for future directions. Robot. Auton. Syst.
**2017**, 94, 172–185. [Google Scholar] - Lu, B. Review and prospect of the development of world lunar exploration. Space Int.
**2019**, 481, 12–18. [Google Scholar] - Xu, X.; Bai, C.; Chen, Y.; Tang, H. A Survey of Guidance Technology for Moon /Mars Soft Landing. J. Astronaut.
**2020**, 41, 719–729. [Google Scholar] - Sostaric, R.R. Powered descent trajectory guidance and some considerations for human lunar landing. In Proceedings of the 30th Annual AAS Guidance and Control Conference, Breckenridge, CO, USA, 3–7 February 2007. [Google Scholar]
- Tata, A.; Salvitti, C.; Pepi, F. From vacuum to atmospheric pressure: A review of ambient ion soft landing. Int. J. Mass Spectrom.
**2020**, 450, 116309. [Google Scholar] [CrossRef] - He, X.S.; Lin, S.Y.; Zhang, Y.F. Optimal Design of Direct Soft-Landing Trajectory of Lunar Prospector. J. Astronaut.
**2007**, 2, 409–413. [Google Scholar] - Cheng, R.K. Lunar Terminal Guidance, Lunar Missions and Exploration. In University of California Engineering and Physical Sciences Extension Series; Leondes, C.T., Vance, R.W., Eds.; Wiley: New York, NY, USA, 1964; pp. 308–355. [Google Scholar]
- Citron, S.J.; Dunin, S.E.; Meissinger, H.F. A terminal guidance technique for lunar landing. AIAA J.
**1964**, 2, 503–509. [Google Scholar] [CrossRef] - Hull, D.G.; Speyer, J. Optimal reentry and plane-change trajectories. In Proceedings of the AIAA Astrodynamics Specialist Conference, Lake Tahoe, NV, USA, 3–5 August 1981. [Google Scholar]
- Pellegrini, E.; Russell, R.P. A multiple-shooting differential dynamic programming algorithm. Part 1: Theory. Acta Astronaut.
**2020**, 170, 686–700. [Google Scholar] [CrossRef] - Bolle, A.; Circi, C.; Corrao, G. Adaptive Multiple Shooting Optimization Method for Determining Optimal Spacecraft Trajectories. U.S. Patent 9,031,818, 12 May 2015. [Google Scholar]
- Bai, C.; Guo, J.; Zheng, H. Optimal Guidance for Planetary Landing in Hazardous Terrains. IEEE Trans. Aerosp. Electron. Syst.
**2020**, 56, 2896–2909. [Google Scholar] [CrossRef] - Chandler, D.C.; Smith, I.E. Development of the iterative guidance mode with its application to various vehicles and missions. J. Spacecr. Rocket.
**1967**, 4, 898–903. [Google Scholar] [CrossRef] - Song, Z.; Wang, C. Powered soft landing guidance method for launchers with non-cluster configured engines. Acta Astronaut.
**2021**, 189, 379–390. [Google Scholar] [CrossRef] - Amrutha, V.; Sreeja, S.; Sabarinath, A. Trajectory Optimization of Lunar Soft Landing Using Differential Evolution. In Proceedings of the 2021 IEEE Aerospace Conference (50100), Big Sky, MT, USA, 6–13 March 2021; pp. 1–9. [Google Scholar]
- Sánchez-Sánchez, C.; Izzo, D. Real-time optimal control via deep neural networks: Study on landing problems. J. Guid. Control Dyn.
**2018**, 41, 1122–1135. [Google Scholar] [CrossRef] [Green Version] - Furfaro, R.; Bloise, I.; Orlandelli, M.; Di Lizia, P.; Topputo, F.; Linares, R. Deep learning for autonomous lunar landing. In Proceedings of the 2018 AAS/AIAA Astrodynamics Specialist Conference, Snowbird, UT, USA, 19–28 August 2018; Volume 167, pp. 3285–3306. [Google Scholar]
- Furfaro, R.; Bloise, I.; Orlandelli, M.; Di Lizia, P.; Topputo, F.; Linares, R. A recurrent deep architecture for quasi-optimal feedback guidance in planetary landing. In Proceedings of the IAA SciTech Forum on Space Flight Mechanics and Space Structures and Materials, Moscow, Russia, 13–15 November 2018; pp. 1–24. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Al Sallab, A.A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. arXiv
**2020**, arXiv:2002.00444. [Google Scholar] [CrossRef] - Mohammed, M.Q.; Chung, K.L.; Chyi, C.S. Review of Deep Reinforcement Learning-based Object Grasping: Techniques, Open Challenges and Recommendations. IEEE Access
**2020**, 8, 178450–178481. [Google Scholar] [CrossRef] - Acikmese, B.; Ploen, S.R. Convex programming approach to powered descent guidance for mars landing. J. Guid. Control Dyn.
**2007**, 30, 1353–1366. [Google Scholar] [CrossRef]

**Figure 4.**Network architecture. (

**a**) Value function network. (

**b**) Policy network of DDPG and TD3. (

**c**) Policy network of SAC.

**Figure 6.**The curves of velocity deviation in velocity tracking experiments. (

**a**) DDPG. (

**b**) TD3. (

**c**) SAC.

**Figure 7.**Landing trajectories of lander under the control of the trained velocity controller. (

**a**) DDPG. (

**b**) TD3. (

**c**) SAC.

Parameters | Values |
---|---|

${m}_{0}$ | 1700 kg |

$a\times b\times c$ | $3\phantom{\rule{0.166667em}{0ex}}\mathrm{m}\times 3\phantom{\rule{0.166667em}{0ex}}\mathrm{m}\times 1\phantom{\rule{0.166667em}{0ex}}\mathrm{m}$ |

${I}_{sp}$ | 225 s |

${T}_{max}$ | 2880 N |

${T}_{min}$ | 1080 N |

L | 2 m |

$\varphi $ | 27° |

${T}_{s}$ | 0.2 s |

${T}_{f}$ | 50 s |

${k}_{v1}$ | 20 |

${k}_{v2}$ | 20 |

$\delta {v}_{\mathrm{bx}0}$ | $[-3,3]\phantom{\rule{3.33333pt}{0ex}}\mathrm{m}/\mathrm{s}$ |

$\delta {v}_{\mathrm{by}0}$ | $[-3,3]\phantom{\rule{3.33333pt}{0ex}}\mathrm{m}/\mathrm{s}$ |

$\delta {v}_{\mathrm{bz}0}$ | $[-12,5]\phantom{\rule{3.33333pt}{0ex}}\mathrm{m}/\mathrm{s}$ |

Parameters | Values |
---|---|

$\sigma $ | 0.02 |

$\left|D\right|$ | ${10}^{6}$ |

$\rho $ | 0.998 |

Learning rate of policy networks | ${10}^{-3}$ |

Learning rate of value function networks | $5\times {10}^{-4}$ |

Parameters | Values |
---|---|

${\sigma}_{1}$ | 0.02 |

${\sigma}_{2}$ | 0.02 |

c | 0.02 |

$\left|D\right|$ | ${10}^{6}$ |

$\rho $ | 0.995 |

Learning rate of policy networks | ${10}^{-3}$ |

Learning rate of value function networks | ${10}^{-3}$ |

Parameters | Values |
---|---|

$\left|D\right|$ | ${10}^{6}$ |

$\rho $ | 0.995 |

$\alpha $ | 0.05 |

Learning rate of policy networks | ${10}^{-3}$ |

Learning rate of value function networks | ${10}^{-3}$ |

Parameters | DDPG | TD3 | SAC |
---|---|---|---|

Number of experiments | 100 | 100 | 100 |

Success rate of soft landing | $74\%$ | $92\%$ | $96\%$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Xu, X.; Chen, Y.; Bai, C.
Deep Reinforcement Learning-Based Accurate Control of Planetary Soft Landing. *Sensors* **2021**, *21*, 8161.
https://doi.org/10.3390/s21238161

**AMA Style**

Xu X, Chen Y, Bai C.
Deep Reinforcement Learning-Based Accurate Control of Planetary Soft Landing. *Sensors*. 2021; 21(23):8161.
https://doi.org/10.3390/s21238161

**Chicago/Turabian Style**

Xu, Xibao, Yushen Chen, and Chengchao Bai.
2021. "Deep Reinforcement Learning-Based Accurate Control of Planetary Soft Landing" *Sensors* 21, no. 23: 8161.
https://doi.org/10.3390/s21238161