# Deep Dyna-Q for Rapid Learning and Improved Formation Achievement in Cooperative Transportation

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. OpenAI Gym for a Simulated System

## 3. Configuration of Deep Dyna-Q for Formation Change

#### 3.1. Deep Dyna-Q

_{t}is the current state, s${}_{t+1}$ is the next state, a

_{t}is the action taken from s

_{t}, p is the transition probability, and T is the transition probability function.

#### 3.2. Learning Method and Neural Network Construction

_{1–3}(x,y), R

_{1–2}(x,y), L

_{1–16}. P(x,y) and V(x,y) denote the position and velocity of the robots, respectively.

_{1–3}(x,y) and R

_{1–2}(x,y) denote the relative coordinates of the target points and other robots.

_{1–16}was the input of the virtual lidar sensors line used to measure the distances between the robots and transport objects, as shown in Figure 10. Every single lidar sensor line was calculated as one lidar (L

_{i}). As Figure 10 shows, three lidar lines detect the transport object.

## 4. Simulation Results

#### 4.1. Transition of Rewards

#### 4.2. Discount Rate Tuning

#### 4.3. Learning Rate Tuning

## 5. Experiment for Cooperative Transportation

#### 5.1. Robot Mechanism

_{1i}, v

_{2i}, and v

_{3i}are the velocities of each wheel on the ith robot. The parallel velocities of each robot in the x-axis and y-axis are denoted by v

_{iX}and v

_{iY}, respectively, and w

_{i}is the angular velocity. The relationship between each robot’s velocity and the parallel velocity is denoted as follows:

#### 5.2. Model Error Compensator (MEC)

_{n}(s) is the nominal model, and C(s) is the compensator. The robot inputs are w

_{iref}and v

_{iref}. Where w

_{iref}is the angular velocity of the ith robot. The velocity in the x-axis and y-axis for each robot is denoted by (3).

_{in}and w

_{in}are the output from P

_{n}(s). The MEC was designed to compensate for the model error between the actual plant P(s) and nominal model P

_{n}(s) as follows (4):

_{i}and v

_{i}. The effectiveness of the MEC in the robots was evaluated by comparing the application of the MEC in the robots to not applying the MEC.

#### 5.3. Virtual Lidar and Actual Experiment

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Kamel, M.A.; Yu, X.; Zhang, Y. Formation Control and Coordination of Multiple Unmanned Ground Vehicles in Normal and Faulty Situations: A Review. Annu. Rev. Control
**2020**, 49, 128–144. [Google Scholar] [CrossRef] - Liu, X.; Liu, L.; Bai, X.; Yang, Y.; Wu, H.; Zhang, S. A Low-Cost Solution for Leader-Follower Formation Control of Multi-UAV System Based on Pixhawk. J. Phys. Conf. Ser.
**2021**, 1754, 012081. [Google Scholar] [CrossRef] - Chen, X.; Huang, F.; Zhang, Y.; Chen, Z.; Liu, S.; Nie, Y.; Tang, J.; Zhu, S. A Novel Virtual-Structure Formation Control Design for Mobile Robots with Obstacle Avoidance. Appl. Sci.
**2020**, 10, 5807. [Google Scholar] [CrossRef] - Lee, G.; Chwa, D. Decentralized Behavior-Based Formation Control of Multiple Robots Considering Obstacle Avoidance. Intell. Serv. Robot.
**2018**, 11, 127–138. [Google Scholar] [CrossRef] - Trindade, P.; Cunha, R.; Batista, P. Distributed Formation Control of Double-Integrator Vehicles with Disturbance Rejection. IFAC-PapersOnLine
**2020**, 53, 3118–3123. [Google Scholar] [CrossRef] - Liang, D.; Liu, Z.; Bhamara, R. Collaborative Multi-Robot Formation Control and Global Path Optimization. Appl. Sci.
**2022**, 12, 7046. [Google Scholar] [CrossRef] - Najm, A.A.; Ibraheem, I.K.; Azar, A.T.; Humaidi, A.J. Genetic Optimization-Based Consensus Control of Multi-Agent 6-Dof Uav System. Sensors
**2020**, 20, 3576. [Google Scholar] [CrossRef] [PubMed] - Jorge, D.R.; Daniel, R.-R.; Jesus, H.-B.; Marco, P.-C.; Alma, Y.A. Formation Control of Mobile Robots Based on Pin Control of Complex Networks. Automation
**2022**, 10, 898. [Google Scholar] - Flores-Resendiz, J.F.; Avilés, D.; Aranda-Bricaire, E. Formation Control for Second-Order Multi-Agent Systems with Collision Avoidance. Machines
**2023**, 11, 208. [Google Scholar] [CrossRef] - Ohnishi, S.; Uchibe, E.; Yamaguchi, Y.; Nakanishi, K.; Yasui, Y.; Ishii, S. Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning. Front. Neurorobot.
**2019**, 13, 103. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Ikemoto, J.; Ushio, T. Continuous Deep Q-Learning with a Simulator for Stabilization of Uncertain Discrete-Time Systems. Nonlinear Theory Appl.
**2021**, 12, 738–757. [Google Scholar] [CrossRef] - Chen, X.; Ulmer, M.W.; Thomas, B.W. Deep Q-Learning for Same-Day Delivery with Vehicles and Drones. Eur. J. Oper. Res.
**2022**, 298, 939–952. [Google Scholar] [CrossRef] - Hester, T.; Deepmind, G.; Pietquin, O.; Lanctot, M.; Schaul, T.; Horgan, D.; Quan, J.; Sendonaris, A.; Dulac-Arnold, G.; Agapiou, J.; et al. Deep Q-Learning from Demonstrations. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 3223–3230. [Google Scholar] [CrossRef]
- Zhao, Y.; Wang, Z.; Yin, K.; Zhang, R.; Huang, Z.; Wang, P. Dynamic Reward-Based Dueling Deep Dyna-Q: Robust Policy Learning in Noisy Environments. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; pp. 9676–9684. [Google Scholar] [CrossRef]
- Miyazaki, K.; Matsunaga, N.; Murata, K. Formation Path Learning for Cooperative Transportation of Multiple Robots Using MADDPG. In Proceedings of the International Conference on Control, Automation and Systems, Jeju, Republic of Korea, 12–15 October 2021. [Google Scholar]
- Pitis, S. Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances In Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
- Fedus, W.; Gelada, C.; Bengio, Y.; Bellemare, M.G.; Larochelle, H. Hyperbolic Discounting and Learning over Multiple Horizons. arXiv
**2019**, arXiv:1902.06865. [Google Scholar] - Amit, R.; Meir, R.; Ciosek, K. Discount Factor as a Regularizer in Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, Online, 13–18 July 2020; pp. 269–278. [Google Scholar] [CrossRef]
- Christian, A.B.; Lin, C.-Y.; Tseng, Y.-C.; Van, L.-D.; Hu, W.-H.; Yu, C.-H. Accuracy-Time Efficient Hyperparameter Optimization Using Actor-Critic-based Reinforcement Learning and Early Stopping in OpenAI Gym Environment. In Proceedings of the 2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS), Bali, Indonesia, 24–26 November 2022; pp. 230–234. [Google Scholar]
- Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Adv. Neural Inf. Process. Syst.
**2017**, 30, 6380–6391. [Google Scholar] - Jaensch, F.; Klingel, L.; Verl, A. Virtual Commissioning Simulation as OpenAI Gym—A Reinforcement Learning Environment for Control Systems. In Proceedings of the 2022 5th International Conference on Artificial Intelligence for Industries (AI4I), Laguna Hills, CA, USA, 19–21 September 2022; pp. 64–67. [Google Scholar]
- Budiyanto, A.; Azetsu, K.; Miyazaki, K.; Matsunaga, N. On Fast Learning of Cooperative Transport by Multi-Robots Using DeepDyna-Q. In Proceedings of the SICE Annual Conference, Kumamoto, Japan, 6–9 September 2022. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: London, UK, 2018. [Google Scholar]
- Bergstra, G.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res.
**2012**, 13, 281–305. [Google Scholar] - Peng, B.; Li, X.; Gao, J.; Liu, J.; Wong, K.-F. Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2182–2192. [Google Scholar]
- Su, S.-Y.; Li, X.; Gao, J.; Liu, J.; Chen, Y.-N. Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
- Almasri, E.; Uyguroğlu, M.K. Modeling and Trajectory Planning Optimization for the Symmetrical Multiwheeled Omnidirectional Mobile Robot. Symmetry
**2021**, 13, 1033. [Google Scholar] [CrossRef] - Yoshida, R.; Tanigawa, Y.; Okajima, H.; Matsunaga, N. A design method of model error compensator for systems with polytopic-type uncertainty and disturbances. SICE J. Control Meas. Syst. Integr.
**2021**, 14, 119–127. [Google Scholar] [CrossRef]

**Figure 1.**Initial formation: The arrows represent the length and width of the cooperative transportation area, which is 4 m × 3 m.

**Figure 5.**OpenAI Gym random layout 1: The initial positions of the robots represent the start positions of the robots, shown by the yellow, red, and blue circles, while the target positions are the target set points shown by the black circles.

**Figure 6.**OpenAI Gym random layout 2: The initial positions of the robots represent the start positions of the robots, shown by the yellow, red, and blue circles, while the target positions are the target set points shown by the black circles.

**Figure 14.**Trajectories of Deep Dyna-Q with discount rate of 0.85: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

**Figure 15.**Trajectories of Deep Dyna-Q with discount rate of 0.9: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

**Figure 16.**Trajectories of Deep Dyna-Q with discount rate of 0.95: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

**Figure 17.**Trajectories of Deep Dyna-Q with learning rate of 0.001: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

**Figure 18.**Trajectories of Deep Dyna-Q with learning rate of 0.01: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

**Figure 19.**Trajectories of Deep Dyna-Q with learning rate of 0.04: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

**Figure 20.**Trajectories of Deep Dyna-Q with learning rate of 0.08: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

**Figure 27.**Trajectories of simulation experiment: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

**Figure 28.**Trajectories of actual experiment with the MEC: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

**Figure 29.**Trajectories of actual experiment without the MEC: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

**Figure 30.**Trajectories of actual experiment with the MEC on layout 1: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

**Figure 31.**Trajectories of actual experiment with the MEC on layout 2: The robots’ initial points are marked by the rectangle markers, while the target points are marked by the circle markers numbered 1–3.

Evaluation | 0.85 | 0.9 | 0.95 |
---|---|---|---|

Formation change achievement rate | 100% | 100% | 100% |

Collision number among agents | 0 | 0 | 0 |

Collision number between agents and transport objects | 1.35 | 1.01 | 1.23 |

Evaluation | DDQ (0.001) | DDQ (0.01) | DDQ (0.04) | DDQ (0.08) | DQN |
---|---|---|---|---|---|

Formation change achievement rate | 94% | 100% | 72% | 10% | 91% |

Collision number among agents | 0 | 0 | 0.76 | 0.92 | 0.06 |

Collision number between agents and transport objects | 2.15 | 1.01 | 1.75 | 3.01 | 3.69 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Budiyanto, A.; Matsunaga, N.
Deep Dyna-Q for Rapid Learning and Improved Formation Achievement in Cooperative Transportation. *Automation* **2023**, *4*, 210-231.
https://doi.org/10.3390/automation4030013

**AMA Style**

Budiyanto A, Matsunaga N.
Deep Dyna-Q for Rapid Learning and Improved Formation Achievement in Cooperative Transportation. *Automation*. 2023; 4(3):210-231.
https://doi.org/10.3390/automation4030013

**Chicago/Turabian Style**

Budiyanto, Almira, and Nobutomo Matsunaga.
2023. "Deep Dyna-Q for Rapid Learning and Improved Formation Achievement in Cooperative Transportation" *Automation* 4, no. 3: 210-231.
https://doi.org/10.3390/automation4030013