Sim-to-Real: Mapless Navigation for USVs Using Deep Reinforcement Learning

Wang, Ning; Wang, Yabiao; Zhao, Yuming; Wang, Yong; Li, Zhigang

doi:10.3390/jmse10070895

Open AccessArticle

Sim-to-Real: Mapless Navigation for USVs Using Deep Reinforcement Learning

by

Ning Wang

¹,

Yabiao Wang

^2,3,

Yuming Zhao

^2,3,

Yong Wang

¹ and

Zhigang Li

^2,3,*

¹

School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China

²

Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China

³

Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110169, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2022, 10(7), 895; https://doi.org/10.3390/jmse10070895

Submission received: 21 April 2022 / Revised: 21 June 2022 / Accepted: 22 June 2022 / Published: 28 June 2022

(This article belongs to the Special Issue Advances in Autonomous Underwater Robotics Based on Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, mapless navigation using deep reinforcement learning algorithms has shown significant advantages in improving robot motion planning capabilities. However, the majority of past works have focused on aerial and ground robotics, with very little attention being paid to unmanned surface vehicle (USV) navigation and ultimate deployment on real platforms. In response, this paper proposes a mapless navigation method based on deep reinforcement learning for USVs. Specifically, we carefully design the observation space, action space, reward function, and neural network for a navigation policy that allows the USV to reach the destination collision-free when equipped with only local sensors. Aiming at the sim-to-real transfer and slow convergence of deep reinforcement learning, this paper proposes a dynamics-free training and consistency strategy and designs domain randomization and adaptive curriculum learning. The method was evaluated using a range of tests applied to simulated and physical environments and was proven to work effectively in a real navigation environment.

Keywords:

deep reinforcement learning; mapless navigation; unmanned surface vehicle

1. Introduction

Unmanned surface vehicles (USVs) play an essential role in marine monitoring, emergency rescue, and environmental protection due to their strong autonomy and versatility [1,2]. As different missions will deploy them in different environments, USVs frequently face complicated and changing marine conditions. As a result, effective methods are needed to guide USVs to their target location without any collisions during the actual voyage, particularly in large-scale, harsh, and dangerous maritime environments.

There has recently been a surge of interest in mapless navigation using deep reinforcement learning (DRL) [3]. When DRL is used as a general framework for real-time autonomous navigation, a common approach is to train in a physical simulator and then migrate to the real world to accelerate the learning efficiency, as well as to reduce the cost [4]. However, the inability of the simulator to accurately simulate real scenarios, particularly the hydrodynamics in water environments, results in differences between the virtual environment and the real world, which will make policies trained in a virtual environment no longer effective when migrated to the real environment. This is a typical problem that is often referred to as sim-to-real, or the reality gap, which causes trained policies to rarely migrate to the real world.

In addition, the reward function for deep reinforcement learning is often difficult to design and is only developed independently by a researcher based on the research problem, with no uniform guiding rules, and the quality of the reward function design will directly affect the effectiveness of the training. In particular, for a navigation problem, which is a sparse reward task, since, in most cases, there is only one target location, it is not easy for a robot to obtain a reward during the initial stages of the training process, which significantly reduces the learning efficiency and even leads to non-convergence.

As models from deep reinforcement learning are considered to be black-box models with invariant structures, their generalization performance cannot be guaranteed. This is because the model is well trained for a specific task, and, therefore, it is difficult to apply it to other targets or environments that have not been encountered during the training process.

The above-mentioned problems limit the wide application of reinforcement learning and other learning-based methods in the field of robotics, especially in unmanned surface vehicles. In this paper, we conduct an in-depth study on this issue and propose a deep-reinforcement-learning-based method for mapless navigation of USVs, which can directly migrate models trained in virtual environments to the real world without retraining and fine-tuning and shows excellent performance in terms of generalization, convergence, and trajectory smoothness. In summary, our contributions are mainly four-fold:

A dynamics-free environment for training navigation policies was developed and was experimentally shown to not only save the tedium of modeling the maritime environment, but also to reduce the learning of imprecise dynamics information and improve the training speed of deep reinforcement learning models.
We introduced domain randomization to improve the generalization performance and transferability of our method by randomizing not only the locations of the obstacles in each training episode, but also the types, colors, sizes, and poses of the obstacles so that the data collected during training cover a wide range of real-world variations.
Considering practical applications, this paper proposes a consistency strategy to improve the loss function of deep reinforcement learning so that the action commands output by the navigation policy are smoother and easier for the controller to track.
To address the problem of difficult convergence of deep reinforcement learning, we improved adaptive curriculum learning, making the curriculum switching more rational while reducing the convergence time of deep reinforcement learning by about 1/4.

This article is divided into six parts. Section 2 reviews the related works. Section 3 gives the problem formulation. Then, our method is elaborated in Section 4, and Section 5 reports the experimental results. Finally, in Section 6, the paper ends with a conclusion.

2. Related Works

In fact, the use of reinforcement learning in robotics for policy search has been studied for decades (e.g., actuation, navigation, manipulation, etc.) [5]. However, previous methods are incapable of directly processing high-dimensional inputs, such as images. It was not until the emergence of deep reinforcement learning [6] that deep reinforcement learning was widely used in solving the motion planning problem in robotics. Nevertheless, robot navigation based on reinforcement learning usually suffers from the problem of sparse rewards and sim-to-real, both of which are essential to our research; therefore, we discuss some relevant work on these topics.

2.1. Sparse Rewards

For complex tasks, DRL algorithms usually require massive trial-and-error experiences to converge, and some algorithms even undergo tens of millions of interactions with the environment to learn a successful policy, which is largely caused by the sparse rewards.

The most commonly employed method is manual reward shaping [7], but it will introduce subjective human perceptions. Another option is to improve sample efficiency [8] to continually acquire new knowledge, such as through curiosity-based [9,10] and information-based [11] methods. Too many explorations, on the other hand, might be pointless for tasks with a vast state space. Apart from the above-mentioned methods, imitation learning [12,13] is also a related solution for improving data efficiency. However, the key challenge is that learned policy is greatly influenced by the expert policy, and even minor deviations from human demonstrations may cause a crash.

The essence of reward sparsity is that it is difficult for an agent to complete tasks by accidental means. Curriculum learning is a method of training from simple to difficult, which is analogous to human learning. It has been proven in numerous works [14,15,16,17,18] to increase the probability of task completion and alleviate the problem of sparse rewards. In order to speed up training in a complex environment, we carefully designed a series of lessons for our navigation framework, which can gradually increase the difficulty of the lesson according to the performance of the agent.

2.2. Sim-to-Real

Direct training in the real world is usually time consuming and can easily and irreversibly damage the robot. A more common approach is to learn in a physics simulator and then migrate to real robots. Unfortunately, a policy learned in a simulator often performs poorly in the real world due to the reality gap [19].

As a consequence, previous work using DRL mostly involved both training and evaluation in a virtual environment [20,21,22]. Even if a system is deployed in the real environment, it will inevitably need retraining or fine-tuning when the real world differs significantly from the simulated environment [23,24], or it will just be tested in structured indoor environments (e.g., hallways and offices) [25]. Overcoming the reality gap is challenging, but more and more researchers are committed to this problem.

One approach adopted by most studies is the establishment of an environment simulated with fidelity and the addition of perturbations to enhance robustness [22,23,26,27,28]. However, this approach has the drawback of requiring the training environment to be designed in accordance with the real world, making it difficult to generalize it to other environments. In addition, a related approach to bridging the reality gap is domain adaptation (DA) [29], which allows machine learning models trained using samples from the source domain to be generalized to the target domain [30], a key assumption behind this approach being that different domains share common characteristics; therefore, representations and behaviors learned in one will be useful in others [31]. Another alternative approach is domain randomization (DR) [32,33,34,35], a simple technique for increasing uncertainty in the training process by adding noise to sensors [19,33], changing dynamics information [36,37], and adding disturbances to actuators [26]. When there are multiple changes in the simulation environment, the real environment may be only one of the changes in the eyes of the policy; another perspective can also be said to enhance the robustness of the reinforcement learning policy, allowing the trained policy to adapt to multiple environmental changes.

Although DRL has recently shown promise for autonomous navigation of mobile ground robots and unmanned aerial vehicles, its usefulness in unmanned surface vehicles is not yet clear. Therefore, this paper proposes a sim-to-real method for USV navigation, which is a step toward real environments.

3. Problem Modeling

In this section, we will first model the mapless navigation problem based on deep reinforcement learning and then introduce the observation space, action space, and reward function of the navigation policy.

3.1. Problem Formulation

The problem of target-driven navigation can be described as follows: In unknown environments with widely distributed terrain, the USV needs only to be equipped with local sensors to move safely to a given destination under continuous control commands. This is described by the following equation:

a_{t} = π (x_{t})

(1)

where the variable

x_{t}

represents raw sensor information,

π

represents the mapless navigation policy, and

a_{t}

represents the output command of the navigation policy. Formally, we are trying to solve the navigation policy

π

using deep reinforcement learning.

The reinforcement learning problem can usually be formalized as a Markov decision process (MDP). A partially observable Markov decision process (POMDP), an extension of an MDP, describes the process by which an agent in a state obtains an observation of that state, takes action, moves to another state, and receives some reward, and this can be applied to uncertain interactions between a robot and its environment. Therefore, we denote the problem of USV navigation in an unknown environment as a POMDP [38], which captures the hidden states in the environment through the analysis of environmental observation data. It can be described as a 7-tuple

{S

,

A

,

P

,

R

,

Ω

,

O

,

γ}

, where

S

represents a state set with partially observable elements.

A

represents an action set that is available in each state.

P : s \times a \to s

represents the transition probability from a state to another state.

R : s \times a \to r

represents the feedback after taking action a at state s.

Ω

represents a set of observations with observation o generated from the state according to the probability distribution

o \sim O (s)

.

γ

is the discount factor that quantifies the importance between immediate rewards and future rewards. Observation space

Ω

, action space

A

, and reward function

R

are described in detail below.

3.2. Observation Space

The observation space o consists of range sensory data, the relative position between the target and the USV, and the motion information of the USV.

For range sensory data, monocular cameras can provide rich information about the unstructured physical world and can have potential for depth perception [39]. However, because visual sensing information typically has noise brought by the environment, such as illuminance, texture, color contrast, etc., only relying on RGB images is insufficient to deliver reliable depth information. Consequently, considering the USV used for the experiment, we also introduce a laser into the sensor module, and these accurate data allow the policy trained in simulation to migrate directly to the real world without any adjustments or other special processing. Since RGB images and LiDAR data are high-dimensional data, which will lead to a model with many parameters and one that is not easy to fit, we downscaled the RGB images and sampled the raw LiDAR data separately; Figure 1a shows the sampled LiDAR data.

Because the environment is unknown, we give up the absolute position of the target, and the relative position is employed to characterize the relationship between the target and the USV, as shown in Figure 1b. Target coordinates

(d_{t a r}, θ_{t a r})

include the Euclidean distance

d_{t a r}

between the USV and the destination, as well as the angle

θ_{t a r}

between the USV direction and the line linking the USV and the target.

The motion information of the USV includes the current linear and angular velocities.

3.3. Action Space

The output actions in robot navigation are usually expressed in one of two ways: (1) discrete actions, such as going forward, turning left, and turning right, where each action specifies a rated value and runs for a certain amount of time; (2) continuous actions, such as linear and steering speeds. Experimental evaluation shows that the discrete deep reinforcement learning for mapless navigation algorithm can be used as a feasible alternative to calculating more time-consuming continuous action space [40]. However, the discrete action space will limit the solution space of the problem, and it is likely that the best policy will not be included. Moreover, discontinuous action will also lead to the unstable driving of the USV. Therefore, in order to achieve more realistic and accurate control, the more challenging continuous action outputs, i.e., linear and angular velocities, are selected.

3.4. Reward Design

Since the design of the reward function has a significant impact on learning speed and quality, we carefully analyze the following scenarios that the USV may face throughout the navigation: (1) The USV is involved in a collision; (2) the USV safely reaches the target position; (3) the USV navigates freely without colliding. Based on the aforementioned analysis, we divide the rewards into two parts: process rewards and termination rewards.

Termination rewards include achievement incentives and collision penalties. Both of these conditions are sufficient to end the training, and the formula is expressed as follows.

r_{termination} = \{\begin{matrix} 2 & if ∥ x_{t} - x_{target} ∥ < ϵ_{target}, \\ - 10 & if ∥ x_{t} - x_{object} ∥ < ϵ_{collision}, \\ 0 & otherwise \end{matrix}

(2)

where

x_{t}

is the USV’s current position,

x_{t a r g e t}

is the target position, and

x_{o b j e c t}

is the obstacle or wall closest to the USV in the environment.

ϵ_{collision}

is the collision tolerance, which refers to the safe distance between the USV and the obstacle, and

ϵ_{target}

is the reach threshold. It can be seen that the collision factor is greater than the success factor. This is because, in practical application, we are more worried about the collision, which is a worse problem than failing to reach the target point. In addition, our goal is to hope that the success rate is higher than the failure rate, resulting in an imbalance in training data. Therefore, it is vital that the collision factor outweighs the success factor.

The process reward is divided into two parts; one part is determined by the distance from the target position last time and the current distance from the target position. When the USV gets closer to the target position, it will receive a little reward, and when it gets further away from the target point, it will receive a small penalty. This urges the USV to gradually approach the target position and arrive in the shortest time.

r_{position} = \{\begin{matrix} 0.01 & if d_{t} < d_{t - 1}, \\ - 0.01 & if d_{t} > d_{t - 1}, \\ 0 & otherwise \end{matrix}

(3)

where

d_{t}

is the distance between the USV and the target. Another part of the process reward is used to constrain the USVs to drive autonomously in a safe and common-sense manner:

r_{velocity} = λ_{1} v^{2} cos (λ_{2} v ω) - λ_{3}

(4)

where

λ_{1}

and

λ_{3}

are the scale factor and deviation of the reward function, respectively, used to define the appropriate reward distribution, while

λ_{2}

is used to regulate the effect of linear velocity on the reward at different angular velocities. The overall reward is defined as the sum of termination rewards and process rewards:

r_{t} = r_{t e r m i n a t i o n} + r_{p o s i t i o n} + r_{v e l o c i t y}

(5)

4. Methodology

In this section, we present our mapless navigation method for USVs. we first give a brief introduction to the DRL algorithm and to the network architecture that is used to train our navigation policy. Then, we discuss our optimization details.

4.1. Reinforcement Learning

The goal of reinforcement learning is to find the optimal policy

π^{*}

for maximizing the expected reward.

π^{*} = arg max_{θ_{π}} E [\sum_{t = 1}^{T} γ^{t - 1} r_{t}]

(6)

The policy

π (a_{t} | o_{t}; θ)

is determined by a neural network parameterized with weights

θ

and will be trained using proximal policy optimization (PPO) [41] to find the value of

θ

that maximizes the objective.

\begin{matrix} L^{C L I P} (θ) & = {\hat{E}}_{t} [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})] \\ r_{t} (θ) & = \frac{π_{θ} (a_{t} ∣ o_{t})}{π_{θ_{old}} (a_{t} ∣ o_{t})} \end{matrix}

(7)

where

A_{t}

represents the advantage estimation and

ϵ

represents the clipping range.

In practical application, if the action commands output by the navigation policy in adjacent decision-making processes have a large span, the USV not only wobbles very much and is difficult to control, but also has a large steady-state error in each execution. We quote the step response test on first-order systems as an example. The system is Equation (8), and the step function is Equation (9).

G (s) = \frac{1}{T s + 1}

(8)

R (s) = \frac{δ}{s}

(9)

where

δ

represents the difference between two adjacent action commands. We combine Equations (8) and (9) and then perform the inverse Laplace transform to get the output and error:

\begin{matrix} c (t) & = δ (1 - e^{- \frac{t}{T}}) \\ e (t) & = δ - c (t) = δ e^{- \frac{t}{T}} \end{matrix}

(10)

As we can see, the higher the difference

δ

between two adjacent action commands, the greater the absolute error. In addition, for higher-order systems, there is also the potential for a high overshoot and a long setting time. Therefore, we propose a consistency strategy that is introduced into the loss function to minimize the difference between two neighboring decisions as Equation (11). In this way, the previously executed action commands are used to determine the current action, which essentially improves the feasibility in real-world applications. We reshape the PPO loss function as:

L (θ) = L^{C L I P} (θ) + λ {(π_{θ} (a_{t} ∣ o_{t}) - π_{θ} (a_{t - 1} ∣ o_{t - 1}))}^{2}

(11)

where

λ

is a coefficient.

4.2. Network Structure

In this paper, the Actor–Critic network structure is selected, and Figure 2 illustrates the network structure that we finally designed according to experience and experiments, including the fully connected layers, convolutional layers, and recurrent layers. The top three networks in blue are input embeddings. Images and laser data, as high-dimensional data inputs, have a negative impact on overall real-time performance. As a result, each image frame is downsampled into a

30 \times 30

frame, and 20 laser beams are uniformly sampled from the laser range findings that cover 180 degrees in front of the USV. The critic outputs a state value, and the output of the actor is linear and angular velocities with bounds of [0, 1] and [−1, 1], respectively.

4.3. Dynamics-Free Training

Most studies pursue an environment simulated with fidelity when training a model with DRL. They believe that the more similar the simulator is to the real world, the better the model’s transferability. However, as demonstrated in Figure 3a, a policy trained in this manner not only learns the mapping from inputs to outputs, but also implicitly learns dynamics, flaws, and biases of the simulator, which are absorbed by the experience data. Even experienced experts cannot accurately describe a robot model, and no simulator exists to capture all of the physical phenomena of the real world [42], especially in a water environment, so a slight simulation deviation, such as frictional error or unmodeled dynamics, can cause a policy learned from the simulator to perform poorly after migrating to the real environment.

In our framework, we trained the navigation policy in a way that is diametrically opposed to previous works. We used Gazebo to build training environments and developed a Gazebo plugin that allows the policy to be trained only considering kinematic and geometric constraints, as shown in Figure 3b. In this way, the learned policy will not contain inaccurate dynamics information, and modeling of the water environment is also avoided. When deploying on the USV, we carefully designed an adaptive controller as the tracking controller according to the USV dynamics, which made it possible for the navigation policy from the dynamics-free training to be transferred to the real environment.

4.4. Domain Randomization

In most previous work [14,25,43], policies were trained in a simple and fixed environment containing single and structured obstacles, which might increase the risk of over-fitting and make it difficult to generalize to unknown and more complex environments. Domain randomization, a simple but promising approach to addressing the reality gap, has been demonstrated in many reinforcement learning efforts. The success of domain randomization rests on the assumption that if the variability in a simulation is sufficiently extensive, then the real world will just be a variation of the simulated environment. Therefore, the resulting model trained in the simulation will be generalizable to the real world without additional training.

Although, in principle, domain randomization can be applied to any part of the reality gap, such as observation space, action space, dynamics parameters, etc., in our training environment, dynamics have been removed, so we focus on the challenges of low-fidelity simulation and overfitting in a single training environment. Our domain randomization during training includes the following:

The shape, size, and color of obstacles.
The position and pose of obstacles.
The time interval between adjacent decisions.
The noise of RGB images and LiDAR.

We construct the scene layout at random to improve the robustness of the navigation policy. Given that differences in obstacle placement between the virtual environment and the real environment may cause the USV to undertake improper navigation behaviors in the real world, we arrange obstacles in accordance with uniform distribution so as to cover a wide range of changes during training. In addition, we also prepare a variety of obstacles and randomly initialize the pose, colors, and size of obstacles to increase the unstructured characteristics of the environment.

The USV should sample observations and output actions at a good frequency to control motion during navigation in time. If visits are too frequent, it may make the Markov chain too long, and the decay of the discount factor causes the termination reward to have little effect on the early data. If the frequency of visits is very low, then the USV may not respond in time to collide with obstacles during traveling. In the practical application of the navigation policy, we usually set a suitable frequency, but the time interval is often not strictly guaranteed to always be consistent due to hardware limitations and computational resource constraints. Therefore, the random time interval between adjacent decisions is a simple and effective way to simulate realistic delays.

Since real sensors often carry bias and noise, we also add Gaussian noise to lasers and cameras in the training process to improve the robustness of our policy, which helps to bridge the gap between the virtual and reality.

4.5. Curriculum Learning

It is extremely difficult for agents to learn a policy directly from scratch in a complex and changing environment containing a large number of obstacles with uneven distribution and diverse shapes. Manually designing dense rewards is a common approach to alleviating this issue, but too much manual tuning will introduce subjective human sense, and it is easy to fall into local optima [44]. By contrast, appropriate curriculum learning is a better choice for agents to reach their full potential and master difficult tasks quickly. Therefore, we designed an adaptive curriculum learning for navigation in a complex environment.

According to our experience, our curriculum ranges from easy to difficult and is separated into three lessons, with the degree of difficulty being determined by the number of obstacles in the environment. There are many metrics that we can use to determine how many episodes to use to train each lesson, such as the loss function, entropy value, or mean reward. However, these metrics do not fully describe the mastery of the lesson, and it is possible that they are influenced by random exploration, that the agent performs well in the last few episodes, or that, although a set threshold has been reached, the agent could perform better in that lesson. For this reason, we turn to measuring the stability of performance, which is difficult to detect given the noise in the learning curve. Therefore, we use the least square fitting of the success rate curve to measure the mastery of the current lesson by the policy. When the slope of the least square fitting is greater than a positive threshold, we believe that the agent does not yet fully understand the current lesson and that there is room for further performance improvement. When the slope of the least square fitting is less than a negative threshold, it means that the agent is still exploring and learning environments. When the least square fitting approaches the horizontal, this means that the policy has mastered this lesson and can move on to the next lesson of training. In this way, each lesson is learned gradually until a satisfactory result is achieved. This type of evaluation takes into account not only the average performance of the agent, but also the learning trends of the agent.

The first lesson is a relatively simple task with only two obstacles, and the purpose is to lay a solid foundation for reaching the target. With the gradual adaptation of lesson 1, the agent will enter the second lesson. The number of obstacles increases to five, and the goal of this lesson is to cultivate awareness of basic obstacle avoidance skills. In the last lesson, eight obstacles are added to the scene, and the agent will pursue an optimal navigation policy that meets reward preferences.

5. Experiment

In this section, we will first introduce the USV platform used for the experiments and the experimental details, followed immediately by the evaluation metrics and the baselines. Finally, USV navigation experiments were performed in simulated and real environments to evaluate the proposed method and the importance of each component in the pipeline.

5.1. Experimental Setups

5.1.1. Simulated Environments

The simulated world was constructed using Gazebo and is depicted in Figure 4. The environment was a 5 × 5 m square field with various obstacles, such as spheres, cubes, cylinders, etc., and all of the obstacles were randomly placed, as mentioned in Section 4. The information of the USV and the target was gained from Gazebo in real time.

For each episode, the USV started with a random position and heading direction at the bottom of the map, and the destination was also randomly selected at the top of the map. The USV needed to cross the obstacles to reach the target. Each episode was considered to be successful only when the distance between the USV and the target was less than 0.3 m and steps were less than 100. The maximum linear and angular velocities were scaled to 0.1 m/s and 0.1 rad/s, respectively, and Gaussian noise

\sim N (μ = 0, σ = 0.03)

was added to actions to facilitate policy exploration. The parameters of deep reinforcement learning and their corresponding values are listed in Table 1.

5.1.2. Real Environments

We demonstrate the proposed method using the USV that we designed. As shown in Figure 5, the USV measures 19 cm × 19 cm × 24 cm and weighs 3.274 kg, which is very suitable for navigation in a narrow and dense obstacle environment. It is controlled by six thrusters, and an IMU is carried inside to measure linear and angular velocities. A front-facing Hokuyo 2D LiDAR is mounted on the top, and an RGB camera is mounted in front of the USV. The relative position of the destination is obtained through UWB [45].

The overall framework had two computing units. The primary computing unit was a PC with 8 GB of RAM and a Core i7 processor, while the auxiliary computing unit was an onboard STM32 ARM microcontroller on the USV. The navigation policy was deployed in the main computing unit, and an adaptive controller was deployed in the auxiliary computing unit to track navigation commands. The details of the information interactions between two units can be seen in Figure 6.

5.2. Baselines and Evaluation Metric

5.2.1. Baselines

There are no studies that we are aware of that focused on sim-to-real navigation for USVs using deep reinforcement learning; thus, we mostly use ablated versions of our method and the traditional local navigation method as baselines. (1) Random is the simplest baseline, where the agent performs a random action at each timestep. (2) Ours is our method proposed in Section 4. (3) Ours-NoCL is an ablated variant of Ours. In this situation, the policy is to learn complex tasks directly without using curriculum learning. (4) Ours-NoCS is also an ablated variant of Ours. In this baseline model, the consistency strategy will be not contained. (5) Ours-NoDR is the model that was not trained using domain randomization. (6) PPO-UUV was trained with UUV Simulator [46], which provides a package for the simulation of a water environment, and we implemented it with deep reinforcement learning setups equal to those of Ours. (7) APF, i.e., artificial potential field [47], is an effective traditional method of local path planning.

5.2.2. Evaluation Metrics

The performance metrics utilized are as follows: (1) The cumulative reward: With the cumulative reward obtained in the training process, we can see how much the agent learns in the environment because the reward is essentially related to the agent’s performance in the training environment. (2) The success rate (SR) is the ratio of the total number of episodes to the number of episodes in which the agent successfully reaches the goal. (3) The timestep records the total number of decisions of the USV in an episode. (4) The travel distance (TD) indicates the distance traveled by the USV in one episode. (5) The trajectory smoothness (TS) reflects the smoothness of the path, which is defined as:

M_{T S} = \frac{1}{T} \sum_{t = 1}^{T} |ψ_{t} - ψ_{t - 1}|

(12)

where

ψ_{t}

represents the heading direction of the USV at time t, and T denotes the total time.

5.3. Simulation Evaluation

The proposed method was evaluated in terms of convergence speed, generalization performance, trajectory smoothness, etc. in a simulated environment, and an attempt was made to interpret the trained model.

5.3.1. Study of the Training Process

To increase the confidence of our conclusions, all methods used the same random seed, and each method selected the one with the highest average reward from five experiments for the evaluation. The total training process of each method lasted 12,000 episodes (approximately half a million steps) and took around 20 h. We compared the learning curves of Ours, Ours-NoCL, Ours-NoCS, and PPO-UUV in terms of reward, timestep, and success rate, and the learning curves of the four approaches are shown in Figure 7. The jitter of the curve was caused by the environmental change in each episode, and in order to intuitively observe the changes in curves, we made a sliding average of all curves. It can be seen that, in the training process, the trends of the reward curve and success rate curve were consistent, demonstrating that our reward design was reasonable and effective.

We analyzed the data efficiency of these methods using the learning curves in Figure 7b. Obviously, it can be seen that PPO-UUV had a lower slope of the SR curve than those of the other methods using curriculum learning. Even though we tried to train many times, there was no improvement in the convergence rate, and the training even did not converge to a valid policy many times. This was due to the influence of hydrodynamics, decisions from the policy that could not be perfectly executed by the agent, and collected data that contained dynamics information that needed to be learned, which made it difficult for the policy to converge. On the contrary, Ours and all of its variants could successfully learn an effective policy. This demonstrated that training in a physics simulator took more time than training in a dynamics-free simulator.

In addition, the experiments demonstrated that Ours-NoCL suffered from a low convergence rate. In contrast, both Ours and Ours-NoCS had fast learning curves, and reaching the final performance took nearly one-fourth of the iterations (≈3000 episodes) less than with Ours-NoCL. A possible reason is that direct learning from complex tasks without the help of curriculum learning led to a large observation space, as well as sparse rewards, and effective information could not be extracted from empirical data. Such a result justifies the preponderance of curriculum learning in a complex and changing environment containing various obstacles.

Meanwhile, Figure 7c shows that the timestep of each method increased slightly at the beginning of training, which was due to the USV gradually learning obstacle avoidance to increase its traveling time. After that, the timestep was highly connected with the training episode, and it decreased as the number of training episodes increased, which shows that the methods gradually converged to the optimal path.

Ours and Ours-NoCS achieved close results in terms of SR and finally stabilized between 60% and 65%, as shown in Figure 7b. It is worth noticing that the SR in the training process would be lower than in the test, mainly for the following four reasons. (1) The USV was unable to take actions in accordance with the policy output due to random exploration. (2) The environment was generated randomly, and there was no guarantee that there must be a feasible path from the beginning point to the destination position. (3) Accelerated training reduced the frequency of USV decision making and could not respond in time. (4) Timeout when receiving information from Gazebo caused the failure.

5.3.2. Study of the Generalization Performance

We verified the generalization performance by analyzing the methods’ cross-scene abilities in different scenarios. We constructed two 4 m × 4 m evaluation scenes, as shown in Figure 8.

In each experiment, the USV was randomly placed at the bottom of the map, and the destination was chosen at random at the top. We prepared 500 trials with different methods in each scene. Then, we assessed the performance of all methods on both scenes by calculating the average SR, TD, and TS, and Table 2 summarizes the results of all models.

As we can see in Table 2, all methods performed with good adaptability in Scene 1 except Ours-NoCL. The reason for this was that Our-NoCL was trained directly in a complex environment and had no experience in dealing with simple scenarios. In addition, Ours, Ours-NoCL, and PPO-UUV obviously performed better that Ours-NoCS, with 15.4–21.7% improvement for TS, and we believe that this could be attributed to the consistency strategy. In addition, the similar SR between Ours and Ours-NoCS showed that the introduction of the consistency strategy did not degrade the performance of the policy.

The obstacles that appeared in Scene 2 were not included in the training process, and the obstacle textures and shapes were unstructured, so the SR of the deep reinforcement learning algorithm was slightly reduced compared to that in Scene 1.

The physical simulator was introduced in Scene 3 and Scene 4, and Ours was close to PPO-UUV, which was trained directly in the water environment, in terms of SR, and even Ours had a slightly higher SR than that of PPO-UUV in Scene 4. Considering that Ours was never trained in a physical simulator, this means that our approach allowed the transfer of knowledge gained from the dynamics-free environment to the physical simulator, demonstrating the effectiveness of training in a dynamics-free environment.

The results from the above test scenarios show that the navigation policy trained in a dynamics-free environment possessed good generalization to the physical simulator and performed similarly to the policy trained in the physical simulator, which demonstrated that dynamics-free training could transfer the learning outcomes to a physical simulator and was an effective alternative to a high-quality simulator. In addition, curriculum learning is important for deep reinforcement learning not only to improve the convergence speed of the algorithm, but also to improve the data richness and, thus, the navigation performance of the algorithm. Finally, it is also clear that the methods using the consistency strategy gave advantages in TS in simulated environments.

5.3.3. Qualitative Analysis

In the previous subsection, we performed a quantitative evaluation of the generalization performance of the methods in a variety of unknown environments, and in this subsection, we performed a qualitative analysis of the proposed method under multi-goal navigation. We constructed a virtual test environment of 5 m × 5 m, as shown in Figure 9, in which various obstacles, such as ellipsoids, cubes, cylinders, pyramids, etc., were placed in the scene.

We selected four target points in the environment with the coordinates (2, −2), (2, 2), (−2, 2), and (−1, 0), where the USV was initialized at (−2, −2) in the positive direction of the x-axis. The USV needed to reach all of the pre-set target points in turn without collision, and its travel trajectory is shown in Figure 10a, where the red numbers represent the target points. The USV navigated to the four target locations in turn, with a total path length of 14.6 m, a total travel time of 259.5 s, an average linear velocity of 0.056 m/s, and an average angular velocity of 0.011 rad/s.

We plotted the control command curves of angular velocity and linear velocity during the USV voyage, as shown in Figure 10b. Comparing the linear velocity curve and angular velocity curve, we can find that the increase in linear velocity was accompanied by a decrease in the absolute value of the angular velocity, while the increase in the absolute value of the angular velocity was accompanied by a decrease in linear velocity. In addition, we can see that, at the times of 70, 145, and 219 s, the linear velocity decreased considerably, which was because, at these moments, the USV happened to complete the current navigation subtask and needed a large angular velocity to adjust its direction towards the next target point, while before these three moments, the linear velocity of the USV was continuously climbing because it gradually approached the target point and determined that there were no more obstacles.

Instead of considering the trained model as a black box, we tried to interpret the model obtained from training. We selected four representative moments to analyze the performance of the navigation policy, which were t = 9.7 s, t = 71.9 s, t = 100.4 s, and t = 207.7 s. As shown in Figure 11, the first image in each column represents the LiDAR data and relative position of the target, and the second represents the CNN feature map after interpolation. Regarding the global position information of the USV and the target, we can observe these with the four points A, B, C, and D in Figure 10a.

At t = 9.7 s, the USV had just started the navigation task, and it can be seen from the global map that the target was located in front of the USV, but with a cone-shaped obstacle in the path. At this time, the linear and angular velocities of the USV were 0.048 m/s and 0.014 rad/s, respectively, which shows that the USV was turning to the left to avoid the obstacle at the current moment. The reason for this action can be explained by the input of the navigation policy. The perceived distance of the LiDAR was obviously greater on the left than on the right, and in addition, the feature map was brighter on the left side.

At t = 71.9 s, the linear velocity of the USV was 0.026 m/s and the angular velocity was 0.094 rad/s. At this point, the linear velocity was significantly lower than the average linear velocity throughout the voyage process, while the angular velocity was higher than the average angular velocity, implying that the USV started to decelerate and turn left, which was caused by the heading error. Because the USV had just finished the last navigation task at this time, it was not facing the target position, so it needed to make a sharp turn to face the next target point.

At t = 100.4 s, it can be seen from the global map that the USV intended to go around the prismatic conical obstacle in front of it. The feature map at this time showed left light and right dark, which was the same as at t = 9.7 s. However, the linear and angular velocities of the USV were 0.059 m/s and −0.041 rad/s, respectively, which means that the USV was turning to the right at this time, contrary to the behavior at t = 9.7 s. This indicates that the USV did not only rely on the visual light–dark relationship for navigation, but integrated all sensor information to make decisions.

At t = 207.7 s, the linear velocity of the USV was 0.071 m/s and the angular velocity was −0.005 rad/s. The angular velocity was close to 0, and the linear velocity was higher than the average value during the drive. The global map shows that the target point was located directly in front of the USV, very close and without obstacles; therefore, the USV started to accelerate to complete the navigation task as soon as possible.

5.4. Real-World Experiment

The sim-to-real transferability of the proposed method will be verified through quantitative and qualitative analyses in real environments.

5.4.1. Quantitative Analysis

We first conducted the experiments in a simple real environment with an area of 3 m × 1.4 m, as shown in Figure 12.

We constructed two different scenarios, and the USV had to cross obstacles from one side of the site to the other side of the site in each trial. In Scene 1, a spherical obstacle blocked the connection between the USV and the destination. The distance between the obstacle and edges of the site was only 0.35 m wider than the USV, which is difficult enough for USVs because, unlike ground robots, USVs usually have a huge inertia and weak driving power, and it is impossible to make sharp turns in a narrow space. In Scene 2, we increased the obstacles in the center of the field to two, and the distance between the obstacles and the field was only 0.4 m. This put higher requirements on the navigation algorithm and control algorithm. Meanwhile, we also built the same environment in Gazebo to verify the sim-to-real capability of the algorithm.

We compared the performance of Ours, Ours-NoCS, and Ours-NoDR between simulation and reality. Figure 13 plots the trajectories generated by each method for the two scenarios.

It can be seen that most of the trajectories of Ours and Ours-NoCS in the real environment agreed well with those in the virtual environment, which implies that the decision processes of the two experiments were similar and that our framework has transferability. However, due to the inertia of the USV, the steering of the USV was more sluggish when it first started compared to that in the virtual environment, and its lateral travel was longer than that in the virtual environment. This phenomenon is an inherent property of the weak drive of surface vehicles and was not only present in Ours, but also in the other methods. Although Ours-NoCS was able to perform the navigation task in both scenarios and to choose a safer obstacle avoidance path compared to the other two methods, it is clear that it had a sharp steering operation during travel and the trajectory was not as smooth as that of Ours. Compared with the above two methods, Ours-NoDR had a larger trajectory difference between the virtual and real environment and did not maintain a safe distance from the obstacles during navigation.

For clarity, the lateral error (LE) was selected for analysis, as shown in Equation (13), which is defined as the lateral distance (difference in the y-direction) between two trajectories from the simulation and real world.

LE = \frac{1}{n} \sum_{i = 1}^{n} |y_{i}^{s i m} - y_{i}^{r e a l}|

(13)

where n represents the number of interpolation points,

y^{s i m}

represents the lateral movement distance in the simulated environment, and

y^{r e a l}

represents the lateral movement distance in the real environment. Considering that the boxplot can show the data distribution clearly, we present the lateral error results of each method in both scenarios through a boxplot, as shown in Figure 14, where the triangle indicates the mean lateral error and the solid brick red line indicates the median lateral error.

In Scene 1, the medians of lateral error from Ours and Ours-NoCS were close, but it was obvious that the mean lateral error was lower for Ours. In addition, the upper bound and upper quartile of Ours were lower, which indicates that the errors were more concentrated to ensure the worst-case performance. Compared to Ours and Ours-NoCS, Ours-NoDR performed poorly, with median and mean values of the lateral error being 47.3% and 48.3% higher than Ours, respectively.

In Scene 2, the median lateral error of Ours-NoCS was the lowest among the three methods, but its data dispersion was higher compared to Ours, and it could not guarantee the minimization of the largest lateral error. The mean and median errors of Ours-NoDR were similar to the results in Scene 1 and were higher than those of the other two methods, which indicates that the sim-to-real performance of Ours-NoDR was not as good as that of Ours and Ours-NoCS, and the difference in its method implementation was that it did not use domain randomization. Therefore, the experimental results also confirm that our proposed domain randomization method can effectively improve the sim-to-real migration ability of the deep reinforcement learning algorithm.

In addition to the lateral error, we also plotted the trajectory smoothness for the three methods, as shown in Figure 15.

It can be seen that, in Scene 1, the trajectory smoothness of the three methods in the virtual environment was similar, while in the real environment, that of Ours-NoCS and Ours-NoDR was significantly higher than that of Ours, and this result is consistent with our intuition because the Ours-NoCS method did not use the consistency strategy, which led to oscillation in driving. For Ours-NoDR, it could be found that its trajectory smoothness in the virtual environment was the same as that of Ours, but the smoothness in the real environment was significantly higher, which, we deduced, mainly came from the fact that there was no domain randomization training and led to a large difference between the performance in the real environment and that in the virtual environment.

In Scene 2, we increased the difficulty of the navigation task, so the difference in trajectory smoothness between the different methods in the virtual environment started to be significant. Despite the increased difficulty, Ours still showed good enough smoothness in both the virtual and real environments, but the trajectory was more coarse compared to those in the real Scene 1. The trajectory smoothness Ours-NoCS in both the virtual and real environments was not as good as those of the other two methods; we can see from the trajectory that Ours-NoCS made a sharp turn at 0.8 and 1.7 m, respectively, while, in comparison, the trajectory of Ours was always smoother, which further illustrates the role of the consistency strategy.

Through quantitative analysis, we can see that the consistency strategy can still make the USV exhibit a smooth trajectory in the real environment and prevent the loss of control due to large velocity changes. In addition, it can be clearly illustrated that domain randomization plays a crucial role in the migration of the model from virtual to real. However, there may still be deviations in the performance between virtual and real due to signal transmission delays, setting the time of the controller, inaccuracies of sensors and actuators, etc. In fact, these errors cannot be completely eliminated, and all we can do is narrow the reality gap as much as possible.

5.4.2. Qualitative Analysis

To further explore the performance of the proposed method in the real world, we conducted tests in a 10 m × 5 m pool with various floating obstacles arranged on the water surface, as shown in Figure 16.

The APF algorithm was selected for the comparative analysis, and the navigation trajectory was drawn as shown in Figure 17 and Figure 18. The black dot represents the starting position of the USV, and the yellow five-pointed star represents the destination.

It can be seen that Ours could successfully perform the navigation task in the static scenario, which verified the effectiveness of the proposed method in real environments. Although APF could also successfully complete the navigation task, compared with Ours, the trajectory of APF was not smooth enough. This was because the APF method obtained the current control action by balancing obstacle avoidance and approaching the target point in each decision, and this way of balancing risks and benefits would inevitably make it difficult to obtain the optimal path.

In addition, from the trajectories generated in the dynamic environment, both could adapt to low-speed dynamic scenarios, but APF was affected by the environmental potential field and had the problem of excessive turning, resulting in redundant trajectories and increased energy consumption. The algorithm proposed in this paper can achieve smoother navigation, which is beneficial for surface vehicles with limited energy.

6. Conclusions

In this work, we proposed a novel method for USV navigation in an unknown environment, which paves the way for the sim-to-real approach of USV point-to-point navigation using deep reinforcement learning and can, theoretically, be extended to all unmanned devices. Ablation experiments in the test environment, which was not covered in training, showed the importance of every module and highlighted the advantages of our proposed method in terms of training efficiency, trajectory smoothness, and success rate. Meanwhile, experiments in the physical world showed the sim-to-real transferability of our approach and confirmed the superiority of our method compared to the baselines.

Although our navigation policy gives up dynamics information during training, reliable dynamics will still help our model. For future work, we will try to introduce effective dynamics and kinematics information as network inputs to pursue the best performance in the real environment.

Author Contributions

Conceptualization, N.W. and Z.L.; methodology, N.W.; software, Y.Z.; validation, N.W., Y.W. (Yabiao Wang), and Y.W. (Yong Wang); formal analysis, N.W., Y.W. (Yabiao Wang), and Z.L.; investigation, N.W.; resources, Y.W. (Yabiao Wang); data curation, N.W. and Z.L.; writing—original draft preparation, N.W.; writing—review and editing, Y.W. (Yong Wang) and Z.L.; visualization, N.W.; supervision, Z.L.; project administration, Z.L.; funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (Grant No. 2021YFB3203000).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, Z.; Zhang, Y.; Yu, X.; Yuan, C. Unmanned surface vehicles: An overview of developments and challenges. Annu. Rev. Control 2016, 41, 71–93. [Google Scholar] [CrossRef]
Silva Junior, A.G.d.; Santos, D.H.d.; Negreiros, A.P.F.d.; Silva, J.M.V.B.d.S.; Gonçalves, L.M.G. High-level path planning for an autonomous sailboat robot using Q-Learning. Sensors 2020, 20, 1550. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Grando, R.B.; de Jesus, J.C.; Kich, V.A.; Kolling, A.H.; Bortoluzzi, N.P.; Pinheiro, P.M.; Neto, A.A.; Drews, P.L. Deep reinforcement learning for mapless navigation of a hybrid aerial underwater vehicle with medium transition. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2021; pp. 1088–1094. [Google Scholar]
Li, J.; Wang, X.; Tang, S.; Shi, H.; Wu, F.; Zhuang, Y.; Wang, W.Y. Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12123–12132. [Google Scholar]
Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef] [Green Version]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Laud, A.; DeJong, G. The influence of reward on the speed of reinforcement learning: An analysis of shaping. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 440–447. [Google Scholar]
Xie, L.; Miao, Y.; Wang, S.; Blunsom, P.; Wang, Z.; Chen, C.; Markham, A.; Trigoni, N. Learning with stochastic guidance for robot navigation. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 166–176. [Google Scholar] [CrossRef]
Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2778–2787. [Google Scholar]
Burda, Y.; Edwards, H.; Pathak, D.; Storkey, A.; Darrell, T.; Efros, A.A. Large-Scale Study of Curiosity-Driven Learning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Houthooft, R.; Chen, X.; Duan, Y.; Schulman, J.; De Turck, F.; Abbeel, P. Vime: Variational information maximizing exploration. arXiv 2016, arXiv:1605.09674. [Google Scholar]
Duan, Y.; Andrychowicz, M.; Stadie, B.C.; Ho, J.; Schneider, J.; Sutskever, I.; Abbeel, P.; Zaremba, W. One-Shot Imitation Learning. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Pfeiffer, M.; Shukla, S.; Turchetta, M.; Cadena, C.; Krause, A.; Siegwart, R.; Nieto, J. Reinforced imitation: Sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations. IEEE Robot. Autom. Lett. 2018, 3, 4423–4430. [Google Scholar] [CrossRef] [Green Version]
Hodge, V.J.; Hawkins, R.; Alexander, R. Deep reinforcement learning for drone navigation using sensor data. Neural Comput. Appl. 2021, 33, 2015–2033. [Google Scholar] [CrossRef]
Zhang, J.; Springenberg, J.T.; Boedecker, J.; Burgard, W. Deep reinforcement learning with successor features for navigation across similar environments. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 2371–2378. [Google Scholar]
Graves, A.; Bellemare, M.G.; Menick, J.; Munos, R.; Kavukcuoglu, K. Automated curriculum learning for neural networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1311–1320. [Google Scholar]
Sukhbaatar, S.; Lin, Z.; Kostrikov, I.; Synnaeve, G.; Szlam, A.; Fergus, R. Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Matiisen, T.; Oliver, A.; Cohen, T.; Schulman, J. Teacher–student curriculum learning. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 3732–3740. [Google Scholar] [CrossRef] [Green Version]
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 23–30. [Google Scholar]
Liu, G.H.; Siravuru, A.; Prabhakar, S.; Veloso, M.; Kantor, G. Learning end-to-end multimodal sensor policies for autonomous navigation. In Proceedings of the Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; pp. 249–261. [Google Scholar]
Josef, S.; Degani, A. Deep Reinforcement Learning for Safe Local Planning of a Ground Vehicle in Unknown Rough Terrain. IEEE Robot. Autom. Lett. 2020, 5, 6748–6755. [Google Scholar] [CrossRef]
Hu, H.; Zhang, K.; Tan, A.H.; Ruan, M.; Agia, C.; Nejat, G. A sim-to-real pipeline for deep reinforcement learning for autonomous robot navigation in cluttered rough terrain. IEEE Robot. Autom. Lett. 2021, 6, 6569–6576. [Google Scholar] [CrossRef]
Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3357–3364. [Google Scholar]
Huang, X.; Deng, H.; Zhang, W.; Song, R.; Li, Y. Towards Multi-Modal Perception-Based Navigation: A Deep Reinforcement Learning Method. IEEE Robot. Autom. Lett. 2021, 6, 4986–4993. [Google Scholar] [CrossRef]
Shi, H.; Shi, L.; Xu, M.; Hwang, K.S. End-to-end navigation strategy with deep reinforcement learning for mobile robots. IEEE Trans. Ind. Inform. 2019, 16, 2393–2402. [Google Scholar] [CrossRef]
Tan, J.; Zhang, T.; Coumans, E.; Iscen, A.; Bai, Y.; Hafner, D.; Bohez, S.; Vanhoucke, V. Sim-to-Real: Learning Agile Locomotion For Quadruped Robots. arXiv 2018, arXiv:1804.10332. [Google Scholar]
Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 9339–9347. [Google Scholar]
Xia, F.; Zamir, A.R.; He, Z.; Sax, A.; Malik, J.; Savarese, S. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9068–9079. [Google Scholar]
Daftry, S.; Bagnell, J.A.; Hebert, M. Learning transferable policies for monocular reactive mav control. In International Symposium on Experimental Robotics; Springer: Berlin/Heidelberg, Germany, 2016; pp. 3–11. [Google Scholar]
Shrivastava, A.; Pfister, T.; Tuzel, O.; Susskind, J.; Wang, W.; Webb, R. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2107–2116. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
Peng, X.B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Sim-to-real transfer of robotic control with dynamics randomization. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 3803–3810. [Google Scholar]
Sadeghi, F.; Levine, S. Cad2rl: Real single-image flight without a single real image. arXiv 2016, arXiv:1611.04201. [Google Scholar]
Bousmalis, K.; Irpan, A.; Wohlhart, P.; Bai, Y.; Kelcey, M.; Kalakrishnan, M.; Downs, L.; Ibarz, J.; Pastor, P.; Konolige, K.; et al. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 4243–4250. [Google Scholar]
Rajeswaran, A.; Ghotra, S.; Ravindran, B.; Levine, S. Epopt: Learning robust neural network policies using model ensembles. arXiv 2016, arXiv:1610.01283. [Google Scholar]
Yu, W.; Tan, J.; Liu, C.K.; Turk, G. Preparing for the unknown: Learning a universal policy with online system identification. arXiv 2017, arXiv:1702.02453. [Google Scholar]
Richard, A.; Aravecchia, S.; Schillaci, T.; Geist, M.; Pradalier, C. How to train your heron. IEEE Robot. Autom. Lett. 2021, 6, 5247–5252. [Google Scholar] [CrossRef]
Wang, C.; Wang, J.; Shen, Y.; Zhang, X. Autonomous navigation of UAVs in large-scale complex environments: A deep reinforcement learning approach. IEEE Trans. Veh. Technol. 2019, 68, 2124–2136. [Google Scholar] [CrossRef]
Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
Marchesini, E.; Farinelli, A. Discrete deep reinforcement learning for mapless navigation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10688–10694. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Wahid, A.; Toshev, A.; Fiser, M.; Lee, T.W.E. Long range neural navigation policies for the real world. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 82–89. [Google Scholar]
Tai, L.; Paolo, G.; Liu, M. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 31–36. [Google Scholar]
Wang, C.; Wang, J.; Wang, J.; Zhang, X. Deep-Reinforcement-Learning-Based Autonomous UAV Navigation With Sparse Rewards. IEEE Internet Things J. 2020, 7, 6180–6190. [Google Scholar] [CrossRef]
Cossette, C.C.; Shalaby, M.; Saussié, D.; Forbes, J.R.; Le Ny, J. Relative position estimation between two uwb devices with imus. IEEE Robot. Autom. Lett. 2021, 6, 4313–4320. [Google Scholar] [CrossRef]
Manhães, M.M.M.; Scherer, S.A.; Voss, M.; Douat, L.R.; Rauschenbach, T. UUV simulator: A gazebo-based package for underwater intervention and multi-robot simulation. In Proceedings of the OCEANS 2016 MTS/IEEE Monterey, Monterey, CA, USA, 19–23 September 2016; pp. 1–8. [Google Scholar]
Khatib, O. Real-time obstacle avoidance for manipulators and mobile robots. In Autonomous Robot Vehicles; Springer: Berlin/Heidelberg, Germany, 1986; pp. 396–404. [Google Scholar]

Figure 1. Range sensory data and target coordinates.

Figure 2. Description of our network structure.

Figure 3. Schematic view of the simulator.

Figure 4. Training map, where obstacles in different shapes, sizes, and postures were randomly placed in each training episode so as to randomize the training data.

Figure 5. Physical platform and simulated model used in the experiment.

Figure 6. Framework flowchart.

Figure 7. Convergence curves of Ours, Ours-NoCS, Ours-NoCL, and PPO-UUV. For clarity, we made a sliding average of all curves, and the transparent area of each curve represents the confidence interval of 5%.

Figure 8. Test scenes. (a) Scene 1 contains obstacles that appeared during training; (b) Scene 2 contains unstructured obstacles that were not encountered during training. (c) Scene 3 has the same layout as Scene 1, except the UUV simulator was used to simulate the water environment. (d) Scenario 4 used the UUV simulator, unlike Scene 2.

Figure 9. Test environment.

Figure 10. Travel trajectory and velocity curves.

Figure 11. Observation sequences.

Figure 12. Test map for quantitative analysis.

Figure 13. Trajectories generated by the different methods. The stars denote target positions, the gray dots denote starting points, the blue trajectories come from real experiments, and the green trajectories are from simulation experiments.

Figure 14. Lateral error.

Figure 15. Trajectory smoothness.

Figure 16. Test map for qualitative analysis.

Figure 17. Trajectories in the static scenario.

Figure 18. Trajectories in the dynamic scenario.

Table 1. Reinforcement learning parameter settings.

Parameter	Value	Remark
Learning rate (Actor)	$1 \times 10^{- 4}$	Adam optimizer learning rate of Actor
Learning rate (Critic)	$1 \times 10^{- 3}$	Adam optimizer learning rate of Critic
Update episodes	16	Training dataset at one time
$γ$	0.99	Discount factor
K	4	Update epochs over a minibatch
$λ$	0.1	Consistency strategy loss factor
$ϵ$	0.2	Clipping threshold

Table 2. Evaluation of the different methods in test scenarios.

Environment	Method	TS ( $^{\circ}$ )	TD (m)	SR (%)
	Rand	0.516	1.38	3.0
	PPO-UUV	0.300	3.46	85.4
Scene 1	Ours	0.318	3.46	85.6
	Ours-NoCS	0.383	3.48	85.8
	Ours-NoCL	0.324	3.51	83.0
	Rand	0.516	1.25	3.2
	PPO-UUV	0.310	3.39	77.8
Scene 2	Ours	0.335	3.40	78.4
	Ours-NoCS	0.405	3.47	81.0
	Ours-NoCL	0.348	3.44	78.6
	Rand	0.517	1.35	2.8
	PPO-UUV	0.290	3.48	77.0
Scene 3	Ours	0.312	3.50	76.6
	Ours-NoCS	0.388	3.49	73.2
	Ours-NoCL	0.341	3.52	70.4
	Rand	0.520	1.38	2.6
	PPO-UUV	0.306	3.51	71.0
Scene 4	Ours	0.328	3.53	71.4
	Ours-NoCS	0.403	3.53	69.4
	Ours-NoCL	0.355	3.51	66.0

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, N.; Wang, Y.; Zhao, Y.; Wang, Y.; Li, Z. Sim-to-Real: Mapless Navigation for USVs Using Deep Reinforcement Learning. J. Mar. Sci. Eng. 2022, 10, 895. https://doi.org/10.3390/jmse10070895

AMA Style

Wang N, Wang Y, Zhao Y, Wang Y, Li Z. Sim-to-Real: Mapless Navigation for USVs Using Deep Reinforcement Learning. Journal of Marine Science and Engineering. 2022; 10(7):895. https://doi.org/10.3390/jmse10070895

Chicago/Turabian Style

Wang, Ning, Yabiao Wang, Yuming Zhao, Yong Wang, and Zhigang Li. 2022. "Sim-to-Real: Mapless Navigation for USVs Using Deep Reinforcement Learning" Journal of Marine Science and Engineering 10, no. 7: 895. https://doi.org/10.3390/jmse10070895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sim-to-Real: Mapless Navigation for USVs Using Deep Reinforcement Learning

Abstract

1. Introduction

2. Related Works

2.1. Sparse Rewards

2.2. Sim-to-Real

3. Problem Modeling

3.1. Problem Formulation

3.2. Observation Space

3.3. Action Space

3.4. Reward Design

4. Methodology

4.1. Reinforcement Learning

4.2. Network Structure

4.3. Dynamics-Free Training

4.4. Domain Randomization

4.5. Curriculum Learning

5. Experiment

5.1. Experimental Setups

5.1.1. Simulated Environments

5.1.2. Real Environments

5.2. Baselines and Evaluation Metric

5.2.1. Baselines

5.2.2. Evaluation Metrics

5.3. Simulation Evaluation

5.3.1. Study of the Training Process

5.3.2. Study of the Generalization Performance

5.3.3. Qualitative Analysis

5.4. Real-World Experiment

5.4.1. Quantitative Analysis

5.4.2. Qualitative Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI