# Cascaded Fuzzy Reward Mechanisms in Deep Reinforcement Learning for Comprehensive Path Planning in Textile Robotic Systems

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background

#### 2.1. Fuzzy Reward System

#### 2.2. End-to-End DDPG

Algorithm 1 End-to-End DDPG Algorithm. |

1: $\mathbf{Initialize:}\mathrm{Actor}\mathrm{network}A\left({\theta}^{A}\right)$$,\mathrm{Critic}\mathrm{network}C\left({\theta}^{C}\right)$ 2: $\mathbf{Initialize:}\mathrm{Target}\mathrm{networks}{A}^{\prime}\left({\theta}^{{A}^{\prime}}\right)$$,{C}^{\prime}\left({\theta}^{{C}^{\prime}}\right)$ 3: $\mathbf{Initialize:}\mathrm{Replay}\mathrm{buffer}R$ 4: $\mathbf{Initialize:}\mathrm{Learning}\mathrm{rate}\alpha $$,\mathrm{discount}\mathrm{factor}\gamma $ 5: for each episode do 6: $\hspace{1em}\mathrm{Observe}\mathrm{initial}\mathrm{state}{s}_{0}$ 7: $\hspace{1em}\mathbf{for}\mathrm{each}\mathrm{step}t$ do8: $\hspace{1em}\hspace{1em}\mathrm{Extract}\mathrm{features}{f}_{t}$$\mathrm{from}\mathrm{state}{s}_{t}$ 9: $\hspace{1em}\hspace{1em}\mathrm{Select}\mathrm{action}{a}_{t}=A\left({f}_{t}|{\theta}^{A}\right)+noise$ 10: $\hspace{1em}\hspace{1em}\mathrm{Execute}\mathrm{action}{a}_{t}$$\mathrm{and}\mathrm{observe}\mathrm{reward}{r}_{t}$$\mathrm{and}\mathrm{new}\mathrm{state}{s}_{t+1}$ 11: $\hspace{1em}\hspace{1em}\mathrm{Adjust}\mathrm{reward}{r}_{t}$ using fuzzy reward system 12: $\hspace{1em}\hspace{1em}\mathrm{Store}\mathrm{transition}\left({s}_{t},{a}_{t},{r}_{t},{s}_{t+1}\right)$$\mathrm{in}R$ 13: $\hspace{1em}\hspace{1em}\mathrm{Sample}\mathrm{a}\mathrm{mini-batch}\mathrm{of}\mathrm{transitions}\left(s,a,r,{s}^{\prime}\right)$$\mathrm{from}R$ 14: $\hspace{1em}\hspace{1em}\mathrm{Update}\mathrm{Critic}\mathrm{by}\mathrm{minimizing}\mathrm{loss:}L=\frac{1}{N}\sum {\left(r+\gamma {C}^{\prime}\left({s}^{\prime}|{\theta}^{{C}^{\prime}}\right)-C\left(s|{\theta}^{C}\right)\right)}^{2}$ 15: Update Actor using sampled policy gradient: 16: ${\nabla}_{{\theta}^{A}}J\approx \frac{1}{N}{\nabla}_{\alpha}C\left(s,a|{\theta}^{C}\right){\nabla}_{{\theta}^{A}}A\left(s|{\theta}^{A}\right)$ 17: ${\theta}^{{A}^{\prime}}\leftarrow \tau {\theta}^{A}+\left(1-\tau \right){\theta}^{{A}^{\prime}}$ 18: ${\theta}^{{C}^{\prime}}\leftarrow \tau {\theta}^{C}+\left(1-\tau \right){\theta}^{{C}^{\prime}}$ 19: end for20: Update episode 21: end for |

## 3. Cascaded Fuzzy Reward System (CFRS)

#### 3.1. Additional Factors

#### 3.2. Fuzzy System

#### 3.3. Cascaded Structure

## 4. Simulation Environment and Tasks

## 5. Experiments

#### 5.1. Baseline Comparison with End-to-End DDPG

^{1}, BS

^{2}, and BS

^{3}, respectively. These three segments were sequentially concatenated to form BS

^{0}, with its experimental results determined solely using the data from BS

^{1}, BS

^{2}, and BS

^{3}, without independent experiments. Subsequently, the end-to-end DDPG, described in Section 2, was used to train on trajectories 1, 2, and 3, named ETE

^{1}, ETE

^{2}, and ETE

^{3}, respectively. Finally, end-to-end DDPG was used for comprehensive training on trajectories 1, 2, and 3, named ETE

^{0}. The parameters for the aforementioned DDPG are provided in Table 5.

^{0}in the table is the sum of the time taken for the three baseline segments, with the success rate and collision rate being the average of these three baseline performances. In test scenarios, our end-to-end DRL model achieved a success rate of 90.1% and a collision rate of only 5.7%. Despite a slight increase in training time, the end-to-end DDPG showed significant advantages in terms of task success rate and collision reduction compared to the baseline model. This can be attributed to the deep network’s precise mapping of state–action relationships and efficient execution of more complex tasks. Next, we will comprehensively assess the adaptability of the algorithm model to environmental changes in complex environments.

#### 5.2. Generalization Ability

#### 5.3. Cascade Fuzzy Reward System

#### 5.4. Real World Experiment

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Breyer, M.; Furrer, F.; Novkovic, T. Comparing Task Simplifications to Learn Closed Loop Object Picking Using Deep Reinforcement Learning. IEEE Robot. Autom. Lett.
**2019**, 4, 1549–1556. [Google Scholar] [CrossRef] - Sartoretti, G.; Paivine, W.; Shi, Y.; Wu, Y.; Choset, H. Distributed Learning of Decentralized Control Policies for Articulated Mobile Robots. IEEE Trans. Robot.
**2019**, 35, 1109–1122. [Google Scholar] [CrossRef] - Leonardo, L. Reward Functions for Learning to Control in Air Traffic Flow Management. Transp. Res. Part C Emerg. Technol.
**2013**, 35, 141–155. [Google Scholar] - Kim, Y.-L.; Ahn, K.H.; Song, J.B. Reinforcement Learning Based on Movement Primitives for Contact Tasks—ScienceDirect. Robot. Comput. Integr. Manuf.
**2020**, 62, 101863. [Google Scholar] [CrossRef] - Hossain, D.; Capi, G.; Jindai, M. Optimizing Deep Learning Parameters Using Genetic Algorithm for Object Recognition and Robot Grasping. J. Electron. Sci. Technol.
**2018**, 16, 11–15. [Google Scholar] - Kushwaha, V.; Shukla, P.; Nandi, G.C. Generating Quality Grasp Rectangle Using Pix2Pix GAN for Intelligent Robot Grasping. Mach. Vis. Appl.
**2023**, 34, 15. [Google Scholar] [CrossRef] - Frasson, C. On the Development of a Personalized Augmented Reality Spatial Ability Training Mobile Application//Novelties in Intelligent Digital Systems. In Proceedings of the 1st International Conference (NIDS 2021), Athens, Greece, 30 September–1 October 2021; Volume 338. Available online: https://ebooks.iospress.nl/doi/10.3233/FAIA210078 (accessed on 27 August 2023).
- Li, S.H.Q.; Zhang, X.F. Research on Hand-Eye Calibration Technology of Visual Service Robot Grasping Based On. Instrumentation
**2022**, 9, 23–30. [Google Scholar] - Sangiovanni, B.; Rendiniello, A.; Incremona, G.P. Deep Reinforcement Learning for Collision Avoidance of Robotic Manipulators//2018. In European Control Conference (ECC); IEEE: Piscataway, NJ, USA, 2018; pp. 2063–2068. [Google Scholar]
- Mahmood, A.R.; Korenkevych, D.; Komer, B.J.; Bergstra, J. Setting up a Reinforcement Learning Task with a Real-World Robot. arXiv
**2018**, arXiv:1803.07067. [Google Scholar] - Wen, S.; Chen, J.; Wang, S.; Zhang, H.; Hu, X. Path Planning of Humanoid Arm Based on Deep Deterministic Policy Gradient. In Proceedings of the 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), Kuala Lumpur, Malaysia, 12–15 December 2018. [Google Scholar]
- Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng.
**2010**, 22, 1345–1359. [Google Scholar] [CrossRef] - Xu, J.; Hou, Z.; Wang, W.; Xu, B.; Zhang, K.; Chen, K. Feedback Deep Deterministic Policy Gradient with Fuzzy Reward for Robotic Multiple Peg-in-Hole Assembly Tasks. IEEE Trans. Industr. Inform.
**2019**, 15, 1658–1667. [Google Scholar] [CrossRef] - Hao, D.; Jing, Y.; Shaobo, L.I. Research Progress in Robot Motion Control Based on Deep Reinforcement Learning. Control. Decis.
**2022**, 37, 278–292. [Google Scholar] - Kalashnikov, D.; Irpan, A.; Pastor, P. Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation//Conference on Robot Learning. In Proceedings of the Conference on Robot Learning, Zürich, Switzerland, 29–31 October 2018; pp. 651–673. [Google Scholar]
- Yahya, A.; Li, A.; Kalakrishnan, M.; Chebotar, Y.; Levine, S. Collective Robot Reinforcement Learning with Distributed Asynchronous Guided Policy Search. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
- Iriondo, A.; Lazkano, E.; Ansuategi, A.; Rivera, A.; Lluvia, I.; Tubío, C. Learning Positioning Policies for Mobile Manipulation Operations with Deep Reinforcement Learning. Int. J. Mach. Learn. Cybern.
**2023**, 14, 3003–3023. [Google Scholar] [CrossRef] - Ranaweera, M.; Mahmoud, Q.H. Bridging the Reality Gap between Virtual and Physical Environments through Reinforcement Learning. IEEE Access
**2023**, 11, 19914–19927. [Google Scholar] [CrossRef] - Finn, C.; Levine, S.; Abbeel, P. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. arXiv
**2016**, arXiv:1603.00448. [Google Scholar] - Ho, J.; Ermon, S. Generative Adversarial Imitation Learning. arXiv
**2016**, arXiv:1606.03476. [Google Scholar] - Sun, Y.; Yan, C.; Xiang, X.; Zhou, H.; Tang, D.; Zhu, Y. Towards End-to-End Formation Control for Robotic Fish via Deep Reinforcement Learning with Non-Expert Imitation. Ocean Eng.
**2023**, 271, 113811. [Google Scholar] [CrossRef] - Peng, X.B.; Abbeel, P.; Levine, S.; van de Panne, M. DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills. arXiv
**2018**, arXiv:1804.02717. [Google Scholar] [CrossRef] - Escontrela, A.; Peng, X.B.; Yu, W.; Zhang, T.; Iscen, A.; Goldberg, K.; Abbeel, P. Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022. [Google Scholar]
- Kofinas, P.; Vouros, G.; Dounis, A.I. Energy Management in Solar Microgrid via Reinforcement Learning Using Fuzzy Reward. Adv. Build. Energy Res.
**2018**, 12, 97–115. [Google Scholar] [CrossRef] - Chen, L.; Jiang, Z.; Cheng, L.; Knoll, A.C.; Zhou, M. Deep Reinforcement Learning Based Trajectory Planning under Uncertain Constraints. Front. Neurorobotics
**2022**, 16, 883562. [Google Scholar] [CrossRef] - Melin, P.; Castillo, O. A Review on the Applications of Type-2 Fuzzy Logic in Classification and Pattern Recognition. Expert Syst. Appl.
**2013**, 40, 5413–5423. [Google Scholar] [CrossRef] - Mamdani, E.H.; Assilian, S. An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller. Int. J. Hum. Comput. Stud.
**1999**, 51, 135–147. [Google Scholar] [CrossRef] - Sugeno, M.; Kang, G.T. Structure Identification of Fuzzy Model. Fuzzy Sets Syst.
**1988**, 28, 15–33. [Google Scholar] [CrossRef] - Guttorp, P.; Kaufman, A.; Gupta, M. Fuzzy Mathematical Models in Engineering and Management Science. Technometrics
**1990**, 32, 238. [Google Scholar] [CrossRef] - Abbasbandy, S.; Asady, B. The Nearest Trapezoidal Fuzzy Number to a Fuzzy Quantity. Appl. Math. Comput.
**2004**, 156, 381–386. [Google Scholar] [CrossRef] - Caruso, L.; Russo, R.; Savino, S. Microsoft Kinect V2 Vision System in a Manufacturing Application. Robot. Comput. Integr. Manuf.
**2017**, 48, 174–181. [Google Scholar] [CrossRef] - Wang, L.; Meng, X.; Xiang, Y.; Fox, D. Hierarchical Policies for Cluttered-Scene Grasping with Latent Plans. IEEE Robot. Autom. Lett.
**2022**, 7, 2883–2890. [Google Scholar] [CrossRef] - Guo, M.; Wang, Y.; Liang, B.; Chen, Z.; Lin, J.; Huang, K. Robot Path Planning via Deep Reinforcement Learning with Improved Reward Function. In Lecture Notes in Electrical Engineering; Springer: Singapore, 2022; pp. 672–680. [Google Scholar]
- Zeng, A.; Song, S.; Welker, S.; Lee, J.; Rodriguez, A.; Funkhouser, T. Learning Synergies between Pushing and Grasping with Self-Supervised Deep Reinforcement Learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar]
- Deng, Y.; Guo, X.; Wei, Y.; Lu, K.; Fang, B.; Guo, D.; Liu, H.; Sun, F. Deep Reinforcement Learning for Robotic Pushing and Picking in Cluttered Environment. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019. [Google Scholar]
- Fonseca, C.M.; Fleming, P.J. An Overview of Evolutionary Algorithms in Multiobjective Optimization. Evol. Comput.
**1995**, 3, 1–16. [Google Scholar] [CrossRef] - Shi, H.; Shi, L.; Xu, M.; Hwang, K.-S. End-to-End Navigation Strategy with Deep Reinforcement Learning for Mobile Robots. IEEE Trans. Industr. Inform.
**2020**, 16, 2393–2402. [Google Scholar] [CrossRef] - Chen, C.-M.; Zhang, Z.; Ming-Tai Wu, J.; Lakshmanna, K. High Utility Periodic Frequent Pattern Mining in Multiple Sequences. Comput. Model. Eng. Sci.
**2023**, 137, 733–759. [Google Scholar] [CrossRef] - Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv
**2015**, arXiv:1509.02971. [Google Scholar] - Wang, S.; Cao, Y.; Zheng, X.; Zhang, T. An End-to-End Trajectory Planning Strategy for Free-Floating Space Robots. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021. [Google Scholar]

**Figure 3.**RGB-D images captured using Kinect in the simulation environment; (

**a**,

**c**,

**e**) are RGB images and (

**b**,

**d**,

**f**) are grayscale maps representing depths.

**Figure 5.**Output of the fuzzy logic system during the initial stage: The figure on the left represents the first layer, while the right figure illustrates the second layer of the fuzzy logic system.

**Figure 6.**Output of the fuzzy logic system during the mid-course obstacle-avoidance stage. The figure on the left depicts the first layer, and the right figure showcases the second layer of the fuzzy logic system.

**Figure 7.**Output of the fuzzy logic system during the alignment and placement stage: The figure on the left shows the first layer, whereas the right figure displays the second layer of the fuzzy logic system.

Input1 | VG | G | M | B | VB | |
---|---|---|---|---|---|---|

Input2 | ||||||

VG | ${R}_{1}$ | ${R}_{2}$ | ${R}_{3}$ | ${R}_{4}$ | ${R}_{5}$ | |

G | ${R}_{6}$ | ${R}_{7}$ | ${R}_{8}$ | ${R}_{9}$ | ${R}_{10}$ | |

M | ${R}_{11}$ | ${R}_{12}$ | ${R}_{13}$ | ${R}_{14}$ | ${R}_{15}$ | |

B | ${R}_{16}$ | ${R}_{17}$ | ${R}_{18}$ | ${R}_{19}$ | ${R}_{20}$ | |

VB | ${R}_{21}$ | ${R}_{22}$ | ${R}_{23}$ | ${R}_{24}$ | ${R}_{25}$ |

${\mathit{r}}_{\mathit{a}\mathit{f}\mathit{f}}$ | VG | G | M | B | VB | |
---|---|---|---|---|---|---|

${\mathit{r}}_{\mathit{l}\mathit{i}\mathit{m}}$ | ||||||

VG | VG | VG | G | M | B | |

G | VG | G | G | M | B | |

M | G | G | M | M | VB | |

B | M | M | M | B | VB | |

VB | B | B | B | VB | VB | |

${\mathit{r}}_{\mathit{t}}$ | VG | G | M | B | VB | |

${\mathit{r}}_{\mathit{p}\mathit{o}\mathit{s}}$ | ||||||

VG | VG | VG | VG | G | M | |

G | VG | VG | G | M | B | |

M | G | G | M | B | VB | |

B | M | M | B | VB | VB | |

VB | B | B | VB | VB | VB |

${\mathit{r}}_{\mathit{a}\mathit{f}\mathit{f}}$ | VG | G | M | B | VB | |
---|---|---|---|---|---|---|

${\mathit{r}}_{\mathit{l}\mathit{i}\mathit{m}}$ | ||||||

VG | VG | VG | G | M | M | |

G | VG | VG | G | M | B | |

M | VG | G | M | B | VB | |

B | G | M | B | VB | VB | |

VB | M | M | B | VB | VB | |

${\mathit{r}}_{\mathit{t}}$ | VG | G | M | B | VB | |

${\mathit{r}}_{\mathit{p}\mathit{o}\mathit{s}}$ | ||||||

VG | VG | VG | VG | VG | G | |

G | VG | VG | G | G | M | |

M | G | G | M | M | B | |

B | M | M | B | VB | VB | |

VB | B | B | VB | VB | VB |

${\mathit{r}}_{\mathit{a}\mathit{f}\mathit{f}}$ | VG | G | M | B | VB | |
---|---|---|---|---|---|---|

${\mathit{r}}_{\mathit{l}\mathit{i}\mathit{m}}$ | ||||||

VG | VG | VG | G | M | B | |

G | VG | VG | M | B | B | |

M | VG | G | M | B | VB | |

B | G | M | B | VB | VB | |

VB | M | M | VB | VB | VB | |

${\mathit{r}}_{\mathit{t}}$ | VG | G | M | B | VB | |

${\mathit{r}}_{\mathit{p}\mathit{o}\mathit{s}}$ | ||||||

VG | VG | VG | VG | G | M | |

G | VG | VG | G | M | B | |

M | VG | G | M | B | VB | |

B | G | M | B | VB | VB | |

VB | M | M | VB | VB | VB |

Parameters | Value |
---|---|

Motion space actions | 6 |

Training rounds episodes | 1000 |

Maximum steps | 2000 |

Learning rate | 0.003 |

$\mathrm{Discount}\mathrm{factor}\gamma $ | 0.99 |

$\mathrm{Exploring}\mathrm{factor}\mathsf{\epsilon}$ | 0.9 |

$\mathrm{Soft}\mathrm{update}\mathrm{factor}\mathsf{\tau}$ | 0.005 |

Batch size | 64 |

Explore noise | $\mathrm{OU}\mathrm{noise}({\mu}^{ou}$$\mathrm{is}0,\mathrm{and}{\theta}^{ou}$ 0.2) |

Obstacles Number | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|

Baseline success | 0.892 | 0.765 | 0.643 | 0.562 | 0.513 |

End-to-end success | 0.940 | 0.931 | 0.924 | 0.901 | 0.876 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zhao, D.; Ding, Z.; Li, W.; Zhao, S.; Du, Y.
Cascaded Fuzzy Reward Mechanisms in Deep Reinforcement Learning for Comprehensive Path Planning in Textile Robotic Systems. *Appl. Sci.* **2024**, *14*, 851.
https://doi.org/10.3390/app14020851

**AMA Style**

Zhao D, Ding Z, Li W, Zhao S, Du Y.
Cascaded Fuzzy Reward Mechanisms in Deep Reinforcement Learning for Comprehensive Path Planning in Textile Robotic Systems. *Applied Sciences*. 2024; 14(2):851.
https://doi.org/10.3390/app14020851

**Chicago/Turabian Style**

Zhao, Di, Zhenyu Ding, Wenjie Li, Sen Zhao, and Yuhong Du.
2024. "Cascaded Fuzzy Reward Mechanisms in Deep Reinforcement Learning for Comprehensive Path Planning in Textile Robotic Systems" *Applied Sciences* 14, no. 2: 851.
https://doi.org/10.3390/app14020851