# Toward Self-Driving Bicycles Using State-of-the-Art Deep Reinforcement Learning Algorithms

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Background

#### 2.1. Reinforcement Learning and Policy Gradient Based Method

#### 2.2. Deep Deterministic Policy Gradient Algorithm

## 3. Extended Bicycle Dynamics

## 4. Method to Control Bicycle

#### 4.1. Network Structure

#### 4.2. Network Training

**Critic network training**. Critic networks include a main critic network (Q) parameterized by ${\theta}^{Q}$ and a target critic network (${Q}^{\prime}$) parameterized by ${\theta}^{{Q}^{\prime}}$. At every discrete time step, the main network is updated using a batch of samples, which obtains data from the replay memory. Particularly, ${\theta}^{Q}$ is optimized to minimize the loss function as follows:

**Actor network training**. Similarly, actor networks include a main actor network ($\mu $) parameterized by ${\theta}^{\mu}$ and a target actor network (${\mu}^{\prime}$) parameterized by ${\theta}^{{\mu}^{\prime}}$. The main actor network is updated using the deterministic policy gradient theorem [15] as follows:

#### 4.3. Learning Process

**(1)**The agent observes state ${s}_{t}$ from the bicycle and feeds it to the actor network (step [1] in Figure 8) for estimating the next action ${a}_{t}$ (step [2] in Figure 8) as follows:

**(2)**The bicycle transits to next state ${s}_{t+1}$ and returns reward ${r}_{t}$ to the agent.

**(3)**The sampled data $\left({s}_{t},{a}_{t},{r}_{t},{s}_{t+1}\right)$ is then stored the experience replay for later use (step [3] in Figure 8).

**(4)**From the experience replay memory, we randomly select a batch of N samples and use them to train the networks (step [4] in Figure 8).

**(5)**Train the critic network (step [5] and step [6] in Figure 8) by minimizing the loss function as Equation (14).

#### 4.4. Reward Function

## 5. Experiments

#### 5.1. Settings

#### 5.2. Baselines

#### 5.3. Results and Discussion

#### 5.3.1. Simulation without Controlling the Velocity

**Comparison with baselines**. In the first evaluation, we compare the performance of a bicycle trained by DDPG algorithms with the performance of the bicycle trained by other algorithms. Particularly, we compare DDPG with the NAF algorithm (Normalized Advantage Function algorithm) and PPO algorithm (Proximal Policy Optimization algorithm). Both algorithms are state-of-the-art deep reinforcement learning algorithms that can deal with a highly continuous action space. Both algorithms produce deterministic policies that are expected to have lower variance and predictable performance compared to a stochastic policy. Figure 10 shows the performance of a bicycle that randomly starts at $(50,50)$ m and wants to reach a goal at $(60,65)$ m. The speed of the bicycle is fixed at 10 km/h and the displacement d is in the range from −20 cm to 20 cm. We report the performance throughout three runs of 5000 episodes. Each episode has 400 time steps.

**Fixed start-fixed goal**. The second evaluation shows the performance of a bicycle with different values of d. The bicycle randomly starts at $(50,50)$ m with a random direction and wants to reach a goal at $(60,65)$ m. The speed is fixed at 10 km/h. We train for 5000 episodes of 400 time steps. We learn for 5000 episodes of 400 time steps. The result reported in Figure 12a shows that an agent using a big displacement to adjust the center of mass will outperform an agent using a small displacement and an agent without displacement (only steering the handlebar). Intuitively, without considering d, the bicycle will easily fall down when it tries to turn the bicycle at a high speed. Figure 12 shows the trajectories of the back wheel of the bicycle during the learning process. During this time, the bicycle gradually reaches the goal (blue lines are early trajectories and red lines are late trajectories). However, the bicycle that starts at the opposite position of the goal location often falls down.

**Random start-fixed goal**. The next evaluation is performed for a bicycle that starts at a random location and learns to reach a goal position at $(150,100)$ m. After around 700 episodes, the bicycle almost reaches the goal from any staring location. The reward converges to zero and the average number of steps to reach decreases from the initial steps (4000) to around 1000 (Figure 13a). Figure 13 shows the trajectories of the bicycle during the learning process.

**Random start-random goal**. In the last evaluation of this section, we report the performance of a bicycle controller that is trained to start at random locations and reach a random goal. Figure 14a shows the same behaviors as the previous evaluations. This means the average cumulative reward converges to zero and the average number of steps decreases. Figure 14b shows 100 trajectories for a trained bicycle that starts at random locations and reaches pre-defined random goals. In all of the cases, the bicycle reaches the goal.

#### 5.3.2. Simulation with Controlling the Velocity

**Learning to turn**. Understanding the force applied to the pedal is necessary to adjust the velocity of the bicycle. This evaluation shows the efficiency of the controller that learns the force applied to pedal simultaneously. The bicycle in this evaluation only uses a small displacement (from −2 cm to 2 cm) and randomly starts at a position that is in the opposite direction of the goal at $(60,65)$ m. We compare the performances between a controller that learns the velocity and a controller that does not. The performance is reported through 10,000 episodes of 500 steps each. The initial velocity is 2.5 m/s. Figure 15a shows the average reward, while Figure 15b,c show the trajectories of the bicycle during the learning process. From the figure, we can see that learning a pedal’s force helps the bicycle turn the wheel when the bicycle is facing the opposite direction from the goal. The way a bicycle turns the wheel by adjusting the pedal’s force is shown in Figure 15d. If the bicycle is facing the opposite direction from the goal’s location, the bicycle will adjust the pedal’s force to reduce the speed. Under this low speed, the bicycle can easily turn. After the bicycle turns to face direction of the goal’s location, the bicycle increases the speed to get to the goal as soon as possible. In the case the bicycle already faces the direction as the goal’s location, the bicycle simply gradually increases the speed and heads toward to the goal.

**Random start-random goal-learned velocity**. In the final evaluation, we increase the complexity of the domain’s state space by allowing the bicycle to start at any location with any direction and requiring it reach to a random location. The speed of the bicycle ranges from 1 m/s to 5 m/s and is adjusted by the force applied to the pedal, which ranges from $-100$ N to 100 N. The initial speed of the bicycle is 2 m/s. We average the performance of the controller after three runs of 10,000 episodes. The report shown in Figure 16 indicates that the controller is gradually improved by increasing the average reward and decreasing average step and distance to the goal. The speed of the bicycle is decreased to 2 m/s in the first 100 steps and then increases a little before being stabilizing at around 2 m/s. Intuitively, the trained controller decreases the velocity to help the bicycle become stable when turning time. However, due to the curve of dimensionality, the controller needs to explore more of the state space and action space, requiring training via a millions of episodes to obtain an optimal controller.

## 6. Conclusions and Future Works

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Nederland, G. Introducing the self-driving bicycle in The Netherlands. Available online: https://www.youtube.com/watch?v=LSZPNwZex9s (accessed on 10 December 2018).
- Keo, L.; Yamakita, M. Controlling balancer and steering for bicycle stabilization. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, St. Louis, MO, USA, 10–15 October 2009; pp. 4541–4546. [Google Scholar]
- Meijaard, J.P.; Papadopoulos, J.M.; Ruina, A.; Schwab, A.L. Linearized dynamics equations for the balance and steer of a bicycle: a benchmark and review. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences; The Royal Society: London, UK, 2007; Volume 463, pp. 1955–1982. [Google Scholar]
- Schwab, A.; Meijaard, J.; Kooijman, J. Some recent developments in bicycle dynamics. In Proceedings of the 12th World Congress in Mechanism and Machine Science; Russian Academy of Sciences: Moscow, Russia, 2007. [Google Scholar]
- Tan, J.; Gu, Y.; Liu, C.K.; Turk, G. Learning bicycle stunts. ACM Trans. Gr. (TOG)
**2014**, 33, 50. [Google Scholar] [CrossRef] - Lu, M.; Li, X. Deep reinforcement learning policy in Hex game system. In Proceedings of the IEEE Chinese Control And Decision Conference (CCDC), Shenyang, China, 9–11 June 2018; pp. 6623–6626. [Google Scholar]
- Bejar, E.; Moran, A. Deep reinforcement learning based neuro-control for a two-dimensional magnetic positioning system. In Proceedings of the IEEE 4th International Conference on Control, Automation and Robotics (ICCAR), Auckland, New Zealand, 20–23 April 2018; pp. 268–273. [Google Scholar]
- Yasuda, T.; Ohkura, K. Collective behavior acquisition of real robotic swarms using deep reinforcement learning. In Proceedings of the Second IEEE International Conference on Robotic Computing (IRC), Laguna Hills, CA, USA, 31 January–2 February 2018; pp. 179–180. [Google Scholar]
- Randløv, J.; Alstrøm, P. Learning to drive a bicycle using reinforcement learning and shaping. ICML
**1998**, 98, 463–471. [Google Scholar] - Le, T.P.; Chung, T.C. Controlling bicycle using deep deterministic policy gradient algorithm. In Proceedings of the 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), Jeju, Korea, 28 June–1 July 2017; pp. 413–417. [Google Scholar]
- Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 1. [Google Scholar]
- Peters, J.; Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural Netw.
**2008**, 21, 682–697. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv, 2015; arXiv:1509.02971. [Google Scholar]
- Le, T.P.; Quang, N.D.; Choi, S.; Chung, T. Learning a self-driving bicycle using deep deterministic policy Gradient. In Proceedings of the 18th International Conference on Control, Automation and Systems (ICCAS), Pyeongchang, Korea, 17–20 October 2018; pp. 231–236. [Google Scholar]
- Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529. [Google Scholar] [CrossRef] [PubMed] - Hwang, C.L.; Wu, H.M.; Shih, C.L. Fuzzy sliding-mode underactuated control for autonomous dynamic balance of an electrical bicycle. IEEE Trans. Control Syst. Technol.
**2009**, 17, 658–670. [Google Scholar] [CrossRef] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv, 2014; arXiv:1412.6980. [Google Scholar]
- Uhlenbeck, G.E.; Ornstein, L.S. On the theory of the Brownian motion. Phys. Rev.
**1930**, 36, 823. [Google Scholar] [CrossRef] - Gu, S.; Lillicrap, T.; Sutskever, I.; Levine, S. Continuous deep q-learning with model-based acceleration. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 2829–2838. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv, 2017; arXiv:1707.06347. [Google Scholar]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]

**Figure 3.**The bicycle seen from behind. The thick line represents the bicycle [9].

**Figure 4.**The bicycle seen from above. The thick line represents the front tire [9].

**Figure 5.**Axis for the moments of inertia for a tire [9].

**Figure 8.**Workflow of the deep deterministic policy gradient (DDPG) algorithm applied to the bicycle.

**Table 1.**Parameters of bicycle dynamics [9].

Notation | Description | Value |
---|---|---|

c | Horizontal distance between the point, where the front wheel touches the ground and the CM | 66 cm |

$CM$ | The Center of Mass of the bicycle and cyclist as a whole | |

d | The agent’s choice of the displacement of the CM perpendicular to the plane of the bicycle | |

${d}_{CM}$ | The vertical distance between the CM for the bicycle and for the cyclist | 30 cm |

h | Height of the CM over the ground | 94 cm |

l | Distance between the front tire and the back tyre at the point where they touch the ground | 111 cm |

${M}_{c}$ | Mass of the bicycle | 15 kg |

${M}_{d}$ | Mass of a tire | 1.7 kg |

${M}_{p}$ | Mass of the cyclist | 60 kg |

r | Radius of the tire | 34 cm |

$\dot{\sigma}$ | The angular velocity of a tire | $\dot{\sigma}=\frac{v}{r}$ |

T | The torque the agent applies to the handlebars | |

$dt$ | Time step | 0.025 s |

Name | Actor | Critic |
---|---|---|

Input layer | A state vector (${s}_{t}$) | State vector and action vector (${s}_{t},{a}_{t}$) |

1st fully-connected layer | 400 units | 200 units |

2nd fully-connected layer | 300 units | 200 units |

Output layer | An action vector (${a}_{t}$) | Q-value |

Initial parameters | Uniformly random between $\left[-3{e}^{-3},3{e}^{-3}\right]$ | Uniformly random between $\left[-3{e}^{-3},3{e}^{-3}\right]$ |

Learning rate | 0.001 | 0.001 |

Optimizer | ADAM [18] | ADAM [18] |

Name | Value |
---|---|

Input dimension | 6 (states) |

Output dimension | 3 (actions) |

Discounted factor | 0.99 |

Random noise | Ornstein-Uhlenbeck process [19] with $\theta =0.15$ and $\sigma =0.2$ |

Experience memory capacity | 500,000 |

Batch size | 64 samples |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Choi, S.; Le, T.P.; Nguyen, Q.D.; Layek, M.A.; Lee, S.; Chung, T.
Toward Self-Driving Bicycles Using State-of-the-Art Deep Reinforcement Learning Algorithms. *Symmetry* **2019**, *11*, 290.
https://doi.org/10.3390/sym11020290

**AMA Style**

Choi S, Le TP, Nguyen QD, Layek MA, Lee S, Chung T.
Toward Self-Driving Bicycles Using State-of-the-Art Deep Reinforcement Learning Algorithms. *Symmetry*. 2019; 11(2):290.
https://doi.org/10.3390/sym11020290

**Chicago/Turabian Style**

Choi, SeungYoon, Tuyen P. Le, Quang D. Nguyen, Md Abu Layek, SeungGwan Lee, and TaeChoong Chung.
2019. "Toward Self-Driving Bicycles Using State-of-the-Art Deep Reinforcement Learning Algorithms" *Symmetry* 11, no. 2: 290.
https://doi.org/10.3390/sym11020290