Research on MASS Collision Avoidance in Complex Waters Based on Deep Reinforcement Learning

Liu, Jiao; Shi, Guoyou; Zhu, Kaige; Shi, Jiahui

doi:10.3390/jmse11040779

Open AccessArticle

Research on MASS Collision Avoidance in Complex Waters Based on Deep Reinforcement Learning

by

Jiao Liu

^1,2,

Guoyou Shi

^1,2,*,

Kaige Zhu

^1,2 and

Jiahui Shi

^1,2

¹

Navigation College, Dalian Maritime University, Dalian 116026, China

²

Key Laboratory of Navigation Safety Guarantee of Liaoning Province, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(4), 779; https://doi.org/10.3390/jmse11040779

Submission received: 21 March 2023 / Revised: 27 March 2023 / Accepted: 29 March 2023 / Published: 3 April 2023

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The research on decision-making models of ship collision avoidance is confronted with numerous challenges. These challenges encompass inadequate consideration of complex factors, including but not limited to open water scenarios, the absence of static obstacle considerations, and insufficient attention given to avoiding collisions between manned ships and MASSs. A decision model for MASS collision avoidance is proposed to overcome these limitations by integrating the strengths of model-based and model-free methods in reinforcement learning. This model incorporates S-57 chart information, AIS data, and the Dyna framework to improve effectiveness. (1) When the MASS’s navigation task is known, a static navigation environment is built based on S-57 chart information, and the Voronoi diagram and improved A* algorithm are used to obtain the energy-saving optimal static path as the planned sea route. (2) Given the small main dimensions of an MASS, which is easily affected by wind and current factors, the motion model of an MASS is established based on the MMG model considering wind and current factors. At the same time, AIS data are used to extract the target ship (manned ship) data. (3) According to the characteristics of the actual navigation of ships at sea, the state space, action space, and reward function of the reinforcement learning algorithm are designed. The MASS collision avoidance decision model based on the Dyna-DQN model is established. Based on the DQN algorithm, the agent (MASS) and the environment interact continuously, and the actual interaction data generated are used for the iterative update of the collision avoidance strategy and the training of the environment model. Then, the environment model is used to generate a series of simulated empirical data to promote the iterative update of the strategy. Using the waters near the South China Sea as the research object for simulation verification, the navigation tasks are divided into three categories: only considering static obstacles, following the planned sea route considering static obstacles, and following the planned sea route considering both static and dynamic obstacles. The results show that through repeated simulation experiments, an MASS can complete the navigation task without colliding with static and dynamic obstacles. Therefore, the proposed method can be used in the intelligent collision avoidance module of MASSs and is an effective MASS collision avoidance method.

Keywords:

maritime autonomous surface ship (MASS); collision avoidance; Dyna-DQN; deep reinforcement learning (DRL); S-57

1. Introduction

1.1. Background

With the rapid development of economic globalization and the shipping industry, more than 82.5% of goods are transported by water. In recent years, the continuous progress of human science and technology has further promoted the development of ship science and technology, developing towards large-scale, specialized, high-speed, and unmanned ships. The safety of ships sailing at sea has also gradually attracted attention. The analysis conducted by EMSA during safety investigations [1] determined that from 2014 to 2020, at accident events or contributing factor levels, 89.5% of maritime safety accidents were related to human action. Therefore, with the development of the ship intelligence industry and to ensure that human factors will not affect the navigation safety of ships, the research and development of collision avoidance technology of intelligent ships/unmanned surface vehicles (USVs) have become a general trend. The International Maritime Organization (IMO) put forward a new work plan on Maritime Autonomous Surface Ships (MASSs) in 2017–2018 and formulated relevant conventions and regulations to solve a series of problems such as safety and environmental protection. MASSs is the term the IMO uses for ships that, to varying degrees, can operate in part or completely independent of human interaction [2,3]. MASS research has become a research hotspot in the international maritime field. The research, development, and adoption of MASSs are becoming the development trend in the shipbuilding industry. The shipping industry and relevant scientific research institutions have invested in relevant research on autonomous ships with different degrees of autonomy and different levels of intelligence.

The navigation environment of ships at sea is complex and changeable. There are static obstacles such as shorelines, islands, reefs, and sunken ships and unidentified dynamic obstacles. They are also interfered with by factors such as time-varying and uncertain winds and currents at sea. Accidents related to the nature of ship navigation, such as ship collisions, contacts, and groundings/strandings, occur occasionally. As shown in Figure 1 and Figure 2 below, according to the investigation results of maritime accidents of the European Maritime Safety Agency (EMSA) [1], over the period 2014–2020, accidents of navigational nature (collisions, contacts, and groundings/strandings) represented 43% of all occurrences related to the ship. These three types of accidents accounted for 13%, 17%, and 13% of the total number of accidents respectively. Therefore, the ship collision avoidance maneuver is crucial for the safe navigation of ships at sea. At present, the increasing traffic flow of ships at sea leads to increasingly crowded navigation channels, further increasing the risk of ship collision. Ship collision accidents usually cause casualties, property losses, and even more severe damage to the ecological environment. To ensure the autonomous and safe navigation of an MASS at sea, it is urgent to develop its autonomous intelligent collision avoidance technology under complex navigation conditions.

The development of unmanned ships is still at an early stage, and there will be a long coexistence between manned and unmanned ships in the future. However, the International Regulations for Preventing Collisions at Sea 1972 (COLREG) only aims at collision avoidance between manned ships. Therefore, the collision avoidance problem between manned and unmanned ships is an urgent problem to solve. Currently, unmanned ship/smart ship/MASS refer to the same object, only the name is different. Therefore, intelligent collision avoidance between MASSs and manned ships is an urgent problem.

1.2. The Literature Review

There are few studies on MASSs and manned ships and most focus on collision avoidance decision making between manned ships. According to the classification of research methods, ship collision avoidance models can be divided into traditional and intelligent ones.

The traditional ship collision avoidance model is based on traditional methods. The traditional methods include analytic geometry (AG), velocity obstacle (VO), fuzzy logic (FL), Swarm Intelligence Algorithm (SIA), and a mixture of these algorithms.

The application of the Marine Collision Avoidance System (MCAS) and Automatic Radar Plotting Aid (ARPA) system led to the subsequent proposal of the collision avoidance decision-making model based on AG. Wilson et al. [4] proposed a collaborative decision-making model of ship collision avoidance based on the line-of-sight method. Larson et al. [5,6] applied the Morphin algorithm to draw multiple arcs before the ship, covering the local obstacle map to consider all safe paths. Casalino et al. [7] proposed a local obstacle avoidance method based on the concept of the Bounding Box, where the Bounding Box is defined as a rectangular area that ships should avoid. Simetti et al. [8] introduced a safe Bounding Box around the original collision boundary box in the continuous workspace. Szlapczynski et al. [9] proposed a method for determining, organizing, and displaying collision avoidance information based on the Collision Threat Parameter Area (CTPA), which can improve the handling measures for MASSs in heavy weather conditions. Kim et al. [10] considered the COLREG. They proposed a collision avoidance algorithm based on the isochron method, including calculating collision risk and planning the optimal path for multi-vessel collision avoidance. Gail et al. [11] put forward the Collision Avoidance Dynamic Critical Area (CADCA) concept based on the minimum maneuvering area required by the two ships in the encounter. They explained the relevant hydrodynamic effects of ship dynamics and different rudder angles and forward speeds. Subsequently, Gail [12] improved the CADCA and proposed a collision avoidance model between ships and static obstacles.

The idea of the VO appeared in 1980 when it was named CTPA [13]. Later, Pederson et al. [14] applied it to the navigation field and proved that this method could provide better collision avoidance decision support for the officer on watch (OOW) compared with traditional ARPA equipment. Kuwata et al. (2014) [15] considered the COLREG and designed a ship local path planner based on the VO. Chen et al. [16] proposed a new risk detection method for ship collisions based on the VO and AIS data. Huang et al. [17] proposed a ship collision avoidance model based on the VO algorithm for collision avoidance scenarios with nonlinear motion characteristics and predictable target ship trajectories. Subsequently, this model was improved, and a maritime collision avoidance system for ships based on the Generalized VO (GVO) algorithm was proposed [18]. Li et al. [19] proposed a ship dynamic path planning method based on a multi-level Morphin adaptive search tree algorithm and VO. This method considers the ship’s maneuverability and the COLREG.

As a nonlinear control method independent of the controlled object, the fuzzy mathematics method can be applied well to the intelligent collision avoidance of ships. Liu et al. [20] combined the FL method with a neural network to quantify ship collision avoidance decisions. Perera et al. [21] considered the COLREGS and proposed an intelligent collision avoidance decision-generation system based on FL. Subsequently, Perera et al. [22] proposed a collision avoidance decision-generation model and execution of collision avoidance behavior based on FL theory and the Bayesian network. Ahn et al. [23] used the network-based adaptive fuzzy inference system (ANFIS), expert system, and multilayer perceptron (MLP) in the ship collision avoidance system.

SIO algorithms mainly include Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), and so forth. Tsou et al. [24] proposed a path-planning system based on the ACO after considering the COLREG and navigation practice. Lazarowska [25] proposed an ACO-based path planning method for USVs in the dynamic marine environment, which can be applied to the decision support system on board or the intelligent obstacle detection and collision avoidance system.

The intelligent collision avoidance model includes the model established using advanced artificial intelligence (AI) technology to study the decision-making problem of ship collision avoidance. AI technology includes Artificial Neural Networks (ANNs) and Reinforcement Learning (RL).

The ANN has been a research hotspot in AI since the 1980s. It abstracts the human brain neural network from the perspective of information processing, establishes a simple model, and forms different networks according to different connection methods. As early as the 1990s, some scholars successively used this method to establish decision-making models of ship collision avoidance. Simsir et al. [26] established a ship position prediction model using the ANN to solve the collision avoidance problem of ships in narrow waterways, laying a foundation for subsequent determination of the possibility of collision between two ships. Praczyk et al. [27] proposed an automatic multi-ship collision avoidance for system based on an evolutionary neural network to solve the collision avoidance task in complex, multi-objective, and rapidly changing environments. Xu et al. [28] proposed an automatic collision avoidance method based on a deep convolution neural network (CNN) using the solid visual processing ability of machine vision technology. Lin et al. [29] proposed a recursive neural network (RNN) with a convolution unit to improve the autonomy and intelligence of obstacle avoidance planning for unmanned underwater vehicles. This method uses a convolution layer to replace the entire connection layer in the standard RNN, thus reducing the number of parameters and improving the ability to extract features. Johansen et al. [30] used a Bayesian belief network from systems theoretical process analysis (STPA) to establish an online risk assessment model for ships. As an advanced method in the field of AI, RL has made remarkable achievements in the field of games and control. RL is suitable for solving decision-making problems with time sequence, which is consistent with the decision-making process of intelligent driving, so it is gradually applied to the intelligent decision making of USVs, unmanned aerial vehicles, and unmanned vehicles. RL can be divided into model-based RL and model-free RL according to whether the environment model is established. At present, model-free RL methods are primarily used in the field of intelligent collision avoidance, mainly including Q-learning, State Action Reward State Action (SARSA), Deep Q-Network (DQN), Policy Gradient (PG), Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), and Actor Critical (AC) and its derivative algorithms. Yin et al. [31] proposed a simple obstacle avoidance algorithm based on Q-Networks to deal with complex navigation situations and unknown environmental dynamics. Shen et al. [32] put forward a multi-level intelligent collision avoidance algorithm for unmanned ships based on the DQN after fully considering the COLREGs, experience in maritime navigation, and ship maneuverability. Zhang et al. [33] proposed a MASS autonomous navigation decision-making method based on the DQN and artificial potential field. Woo et al. [34] considered the COLREG and proposed a collision avoidance model for unmanned vessels based on machine vision and the DQN. Wu et al. [35] proposed an autonomous navigation and intelligent collision avoidance algorithm based on the Dueling DQN (DDQN). Liu et al., [36] aiming at the problem that RL is time-consuming and infeasible in dealing with complex tasks, combined DQN with Transfer Learning (TL), which can transfer the knowledge learned in simple tasks to closely related but more complex tasks, providing a new idea for ship collision avoidance. Ryohei et al. [37] proposed an automatic collision avoidance system for ships based on grid sensors and the PPO, then improved this algorithm (Ryohei et al., 2020) [38], and proposed a multi-ship collision avoidance and waypoint navigation model based on the Long Short-Term Memory networks (LSTM) and the Proximal Policy Optimization (PPO). Zhao et al. [39] considered the COLREGs and the ship’s maneuverability and used DQN to directly map the ship’s status to the ship’s rudder angle command and used the PG-based AC algorithm to train the multi-ship collision avoidance model. Jiang et al. [40] considered COLREGs and proposed an autonomous ship collision avoidance method based on deep RL (DRL) and the attention mechanism. The method includes collision risk assessment and motion planning and analyzes the rationality and effectiveness of collision avoidance decisions from the perspective of collision risk and the nearest safe distance. Heiberg et al. [41] proposed a decision-making model of ship intelligent collision avoidance based on the DRL and the PPO. They applied the most advanced collision risk index to the reward design of the model.

Model-free RL methods can be applied to many fields. Because there is no need to establish a model, all agents’ decisions are obtained through interaction with the environment, so this method applies to scenes that are difficult to model or cannot be modeled at all. Similarly, without models, agents need to interact with and explore the environment constantly, requiring a lot of trial and error, leading to the most significant disadvantage of model-free RL: low data efficiency. This algorithm often needs to interact with the environment hundreds of thousands, millions, or even tens of millions of times. There are few intelligent collision avoidance models based on model-based RL. In order to solve the problem of low efficiency of model-free RL, Xie et al. [42] proposed a compound learning method based on the asynchronous advantage actor-critic (A3C) algorithm, LSTM, and Q-learning. This method combines the advantages of model-based and model-free RL. The author applies this method to multi-ship intelligent collision avoidance at sea, and the simulation results show that the decision-making performance of this method is superior to the standard A3C model and the traditional optimization model.

Through the analysis and comparison, it can be concluded that the main problems of the decision-making model of ship collision avoidance are: (1) They only consider static obstacles and do not consider the shape of obstacles or simplify their shapes to a circle. (2) To simplify the collision avoidance situation and the generation of collision avoidance decisions, the motion state of the own ship and the target ship is excessively simplified. For example, if the ship’s motion model is simplified or the motion factors are ignored, the collision avoidance measures taken by ships in the close encounter may fail easily. The ship’s dynamic model rarely considers the effects of wind and flow. (3) The interference of environmental factors needs to be fully considered, such as visibility, sea conditions, narrow or open waters, and the impact of traffic flow density at that time. (4) In the collision avoidance problem of MASSs, it is assumed that MASSs, like manned ships, should comply with the COLREGS [43,44]. However, MASSs differs significantly from manned ships regarding intelligence, main dimensions, maneuverability, and tasks performed [45]. (5) For the decision algorithm of ship collision avoidance based on the RL, the biggest problem of model-free RL is the low data efficiency. In contrast, the biggest challenge of the model-based RL method is the model’s error. (6) For the decision-making algorithm of ship collision avoidance based on the ANN, there is a problem with local optimization.

1.3. Research Content

In order to solve the above problems, this study refers to Gao et al.’s hypothesis on MASS maritime navigation rules [45] (due to the strong maneuverability of an MASS, its navigation should not interfere with the standard navigation of manned ships) and applies the previous studies on collision risk [46] and track prediction model [47]. A MASS collision avoidance decision-making model based on the Dyna-DQN is proposed, considering wind and current:

(1): The static navigation environment is established using the S-57 chart and grid method, and the navigation safety weight of each grid is calculated. On the premise of known MASS navigation tasks, the optimal energy-saving static path is generated based on the Voronoi chart and improved A* algorithm [48].
(2): The handling performance of an MASS is fully considered, taking into account its small main dimensions and that its maneuverability are easily affected by wind and current. The MASS motion model, under the influence of wind and current, is established based on the MMG model.
(3): To make the navigation track of the target ship more realistic, the AIS track of a region near the South China Sea in January 2018 is extracted and cleaned. A specific track is selected as the navigation track for dynamic obstacles. As the AIS data are sent and received at unequal intervals, the AIS trajectory of the target ship is linearly interpolated here to match the trajectory points of the MASS. To better simulate the real navigation environment, ensure a full understanding of the target ship’s navigation trajectory, and better determine the potential risks and collision avoidance decisions, the online multiple outputs Least-Squares Support Vector Regression model based on selection mechanism (SM-OMLSSVR) [47] is used to predict the future trajectory of the target ship.
(4): The RL model’s state space, behavior space, and reward function are designed. In the state space, the state of the own ship, the current and predicted state of the dynamic obstacle, the collision risk between two ships, and the environmental state are considered. The reward function considers whether to exceed the chart boundary, the extent of deviation from the planned route, whether to collide with static and dynamic obstacles, the extent of drift, and whether to reach the target point.
(5): Based on the advantages of model-based and model-free RL, a decision-generation model of collision avoidance based on the Dyna DQN algorithm is established. In the model-based part, agents will continue to interact with the environment based on the DQN algorithm to generate real interactive data and use these data to optimize decision-making behavior and build a world model simultaneously. A virtual world model is used to generate virtual interactive data, and these data are used to optimize decision-making behavior.
(6): To verify the effectiveness of the proposed model, three collision avoidance tasks are designed: only considering static obstacles; considering both static obstacles and planned sea route; and considering static obstacles, dynamic obstacles, and planned sea route. The experimental results show that the proposed model is an effective and efficient decision-making model of collision avoidance.

The remainder of this study is organized as follows. In Section 2, theoretical knowledge and concepts related to the model are described. In Section 3, the structure of the collision avoidance decision model between MASSs and manned ships based on Dyna-DQN is described. The verification and case study are carried out in Section 4. In Section 5, conclusions and future research are presented.

2. Methods and Tools

2.1. DQN Algorithm

Q-learning is a time-difference (TD) learning method used to estimate the state-action value function Q, and it is an off-policy method. When the environment is known and the states and actions are limited, an exhaustive search can be performed on all possible state-action pairs

(s, a)

to find the optimal Q value (

Q^{*}

). If an environment has many states with multiple actions in each state, then traversing the entire environment will waste a lot of time. Therefore, it is better to find some parameter

θ

to approximate the Q function, namely:

Q (s, a; θ) \approx Q^{*} (s, a),

(1)

The basic idea of the DQN algorithm comes from Q-Learning. However, its difference from Q-Learning is that its state-action value function Q is not directly calculated by state value

s

and action

a

but is calculated by the Q network. The input is the state

s

; the output is the state-action value function Q of all actions in this state. This Q network is the approximate Q function mentioned above. It is a neural network, which can be DNN, CNN, or RNN.

When nonlinear functions (such as neural networks) are used to represent the state-action value function Q, they are often unstable or even nonconvergent, which is caused by the strong correlation between data and does not meet the requirements of independent and identical distribution of neural networks. DQN takes the following two measures to solve this problem: (1) The method of Experience Replay (ER) is used to save the rewards and status updates obtained from each interaction between the agent and the environment. In the subsequent Q value update, only random data sampling is needed, which can eliminate the correlation in the observation sequence and smooth the changes in the data. (2) Using the iterative update, a new target network is added to calculate the target state-action value function

Q'

, and the target network is updated periodically to reduce the correlation of data.

The algorithm minimizes the loss function through adjustment

θ

and improves the state-action value function

Q (s, a; θ)

. The loss function is shown as follows:

L (θ) = E_{(s, a, r, s')} [{(y_{i} - Q (s, a; θ))}^{2}],

(2)

y_{i} = r + γ \max_{a'} Q' (s', a'; θ'),

(3)

where

γ \in [0, 1]

is the discount factor, and

Q' (\cdot)

is the target state-action value function. By differentiating

θ

in the loss function, we can obtain the following gradient:

\nabla_{θ} L (θ) = E_{(s, a, r, s')} [(r + γ \max_{a'} Q' (s', a'; θ') - Q (s, a; θ)) \nabla_{θ} Q (s, a; θ)],

(4)

The pseudocode for DQN is shown in Algorithm 1. Among them,

ε

is the

ε

parameter in the

ε

-greedy strategy. The

ε

-greedy strategy is a common choice behavior strategy, which aims to balance the relationship between exploration and exploitation. It means that when the agent makes a decision, there is a small probability of a positive number

ε

to randomly select an unknown action, and the remaining probability of

1 - ε

is to select the action with the greatest value in the existing action.

Algorithm 1: DQN

1: Initialize replay memory D to capacity N

2: Initialize action-value function Q with random weights

θ

3: Initialize target action-value function

Q'

with weights

θ' = θ

4: For episode = 1, M do

5: Initialize the sequence of state

s

6: For

t = 1

, T do

7: With probability

ε

select a random action

a_{t}

8: otherwise select

a_{t} = \arg \max_{a} Q (s_{t}, a; θ)

9: Execute action

a_{t}

in emulator and observe reward

r_{t}

and

s_{t + 1}

10: Store transition

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in D

11: Sample random minibatch of transitions

(s_{t}, a_{t}, r_{t}, s_{t + 1})

from D

12: Set

y_{j} = \{\begin{cases} r_{j} for terminal s_{j + 1} \\ r_{j} + γ \max_{a'} Q (s_{j + 1}, a'; θ) for non - terminal s_{j + 1} \end{cases}

13: Perform a gradient descent step on Equation (2) with respect to the network parameter

θ

.

14: Every C steps reset

Q' = Q

15: End For

16: End For

2.2. Dyna Framework

RL can be divided into the model-based method and the model-free method. Here, the model represents the environmental dynamics characteristics constructed by individuals, which is called the virtual world model in this paper. The model-free RL method makes the agent interact with the environment directly to obtain data and uses the interactive data directly to optimize the agent’s action. The model-based RL method first learns the virtual world model from the data and then optimizes the strategy based on this model. Unlike model-free RL, which can only obtain values of other unknown states by trying to interact with the environment, model-based RL can use the virtual world model learned from the data to predict values of other unknown states. This method does not need to be tried too often and often has a strong generalization ability, which can significantly improve data utilization efficiency. However, model-based RL algorithms cannot be used if the system cannot be modeled, such as games or natural language processing. A model system, such as a ship, can be adapted to this approach. The ship’s motion conforms to the most fundamental laws of physics, and the system can be modeled using the developed kinematics and dynamics principles of rigid body and fluid. It significantly improves the data utilization rate, reduces the model’s training time, and promotes the real-time decision making of the ship. The biggest challenge encountered by this learning method is a model error; that is, the model learning through data has a model error. Especially at the beginning of training and with a small amount of data, the model learned is bound to be inaccurate. Using an inaccurate model make predictions will produce a significant error. Unlike model-based RL methods, model-free RL algorithms have an excellent property: asymptotic convergence. The model-free RL algorithm can guarantee the optimal solution for the agent after countless interactions with the environment. However, because the model-free RL method does not fit the model, the agent can only perceive and recognize the environment through continuous interaction with the environment, which is enormous, tens of thousands or even millions of times. So many interactions make model-free RL algorithms inefficient and difficult to apply to the physical world.

In order to solve the problem of model-based and model-free RL, full use is made of the advantages of the two. When they are combined, the Dyna framework emerges: the model-free RL method is used to learn the virtual world model from experience, whereas the experience and the virtual experience based on the virtual world model sampling are used to learn, plan, and update the value or strategy function, as shown in Figure 3. The Dyna-Q algorithm’s flow based on state-action value is shown in Algorithm 2.

Algorithm 2: Dyna-Q

1: Initialise

Q (s, a)

and

M (s, a)

for all

s \in S

and

a \in A (s)

2: Do forever:

3:

s \leftarrow

current (nonterminal) state

4:

a \leftarrow ε - g r e e d y (S, Q)

5: Execute action a; observe resultant reward

r

, state

s'

6:

Q (s, a) \leftarrow Q (s, a) + α [r + γ \max_{a} Q (s', a') - Q (s, a)]

7:

M (s, a) \leftarrow s, a, r, s'

(assuming deterministic environment)

8: Repeat

n

times:

9:

s \leftarrow

random previously observed state

10:

a \leftarrow

random action previously taken in

s

11:

r, s' \leftarrow M (s, a)

12:

Q (s, a) \leftarrow Q (s, a) + α [r + γ \max_{a} Q (s', a') - Q (s, a)]

2.3. Dyna-DQN

The Q-learning algorithm only applies to the situation where the state space and the action space are discrete distributions, and both spaces are small. Because the state space of ships is relatively large and continuous, to better apply the RL algorithm to the collision avoidance decision making of ships, the model-free part of the Dyna-Q Algorithm is replaced by the Q-learning algorithm with the DQN algorithm; therefore, the Dyna-DQN model is formed. The pseudo code of this model is shown in Algorithm 3. It can be seen that the algorithm is divided into three parts: (1) based on the model-free RL, the agent interacts with the environment based on the DQN algorithm to collect real empirical data and improve the strategy; (2) learning of the virtual world model, using real experience to learn and improve this model, which includes two classification tasks of

s'

and

d o n e

and a regression task

r

; and (3) using the virtual world model for planning. The agent uses the empirical data obtained from this model to improve the strategy.

Algorithm 3: Dyna-DQN

Require:

N, ε, K, L, C, Z

Ensure:

Q (s, a; θ_{Q}), M (s, a; θ_{M})

1: Initialize

Q (s, a; θ_{Q})

and

M (s, a; θ_{M})

2: Initialize

Q^{'} (s, a; θ_{Q^{'}})

with

θ_{Q^{'}} = θ_{Q}

3: Initialize real experience reply buffer

D^{u}

sing Reply Buffer Spiking (RBS), and simulated experience reply buffer

D^{s}

as empty

4: for n = 1:N do

5: # Direct RL starts

6: generate an initial state s

7: while s is not a terminal state do

8: with probability

ε

select a random action a

9: otherwise select

a = \arg \max_{a'} Q (s, a'; θ_{Q})

10: execute a, and observe reward r and the state of the next moment

s'

11: store

(s, a, r, s^{'})

to

D^{u}

12:

s = s^{'}

13: end while

14: sample random minibatches of training samples

(s, a, r, s^{'})

from

D^{u}

15: update

θ_{Q}

via Z-step minibatch Q-learning according to Equation (2)

16: # Direct RL ends

17: # World Model Learning starts

18: sample random minibatches of training samples

(s, a, r, s^{'})

from

D^{u}

19: update

θ_{M}

via Z-step minibatch SGD of multi-task learning

20: # World Model Learning ends

21: # Planning starts

22: for k = 1:K do

23: done = FALSE, l = 0

24: generate an initial state s

25: while done is FALSE

\cap

l \leq L

do

26: with probability

ε

select a random action a

27: otherwise select

a = \arg \max_{a'} Q (s, a'; θ_{Q})

28: execute

a

29: world model responds with r and the state of the next moment

s'

30: store

(s, a, r, s^{'})

to

D^{s}

31:

l = l + 1

, s = s'

32: end while

33: sample random minibatches of training samples

(s, a, r, s^{'})

from

D^{s}

34: update

θ_{Q}

via Z-step minibatch according to Equation (4)

35: end for

36: # Plannning ends

37: every C steps reset

θ_{Q'} = θ_{Q}

38: end for

2.4. MMG Model Considering Wind and Flow Factors

The Manoeuvring Model Group (MMG) model, also known as the separated model, is an internationally popular mathematical model of ship motion [49].

The MMG model can simulate the ship’s motion with six degrees of freedom at most, but the research of pitching, rolling, and heaving motion is of little significance in the study of ship collision risk in open water, good weather, and sea conditions. Therefore, only the ship’s swaying, surging, and yawing motions are studied here, and the influence of wind and current is taken into account. According to the basic idea of the MMG model, the external force and external torque acting on the hull are decomposed into the fluid force of the bare hull, propeller force, rudder force, etc. When the origin of the attached coordinate system is the ship’s center of gravity, the details are as follows:

\{\begin{cases} (m + m_{x}) \dot{u} - (m + m_{y}) v r = X_{H} + X_{P} + X_{R} \\ (m + m_{y}) \dot{v} + (m + m_{x}) u r = Y_{H} + Y_{P} + Y_{R} \\ (I_{Z Z} + J_{Z Z}) \dot{r} = N_{H} + N_{P} + N_{R} - Y_{H} x_{c} \end{cases},

(5)

where

X_{i}

,

Y_{i}

, and

N_{i}

(i = H, P, R)

are the transverse, longitudinal, and yaw forces and moments of the bare hull, propeller, and rudder, respectively, acting on the ship;

I_{Z Z}

and

J_{Z Z}

are the moment of inertia of yawing and additional moment of inertia, respectively;

u

,

v

, and

r

are the speed along the

x

-axis, the

y

-axis, and the rotation speed of yawing, respectively;

\dot{u}

,

\dot{v}

, and

\dot{r}

are the acceleration along the

x

-axis,

y

-axis, and rotation acceleration of yawing, respectively; and

x_{c}

is the transverse coordinate of the center of the ship.

Considering the influence of fluid on ship motion under uniform current, the velocity of current

V_{e}

is decomposed into the

G_{x}

-axis and

G_{y}

-axis of the ship’s moving coordinate system:

\{\begin{cases} u_{c} = V_{e} \cos (C_{f} - C) \\ v_{c} = V_{e} \sin (C_{f} - C) \end{cases},

(6)

where

u_{c}

is the component of the current’s velocity on the

G_{x}

-axis;

v_{c}

is the component of the current’s velocity on the

G_{y}

-axis;

C_{f}

is the angle of current (calculated clockwise from true north of 0°, consistent with the direction of flow in navigation); and

C

is the course of the ship.

Then, the components

u_{1}

and

v_{1}

of the speed over ground (after considering the flow) on the

G_{x}

-axis and the

G_{y}

-axis are expressed as:

\{\begin{cases} u_{1} = u + u_{c} \\ v_{1} = v + v_{c} \end{cases} .

(7)

As for the ship’s yaw angular velocity, it remains the same whether on water or on the ground. The relation between accelerations can be obtained by differentiating the time in Equation (7):

\{\begin{cases} \dot{u} = {\dot{u}}_{1} - r v_{c} \\ \dot{v} = {\dot{v}}_{1} + r u_{c} \end{cases} .

(8)

The force of the wind on a ship is mainly related to the superstructure and its layout, the wind direction, and the wind speed. The relationship is as follows (the wind pressure remains unchanged):

\{\begin{cases} X_{W} = \frac{1}{2} ρ_{a} A_{r} V_{W}^{2} C_{W X} (θ_{r}) \\ Y_{W} = \frac{1}{2} ρ_{a} A_{L} V_{W}^{2} C_{W Y} (θ_{r}) \\ N_{W} = \frac{1}{2} ρ_{a} A_{L} L_{o a} V_{W}^{2} C_{W N} (θ_{r}) \end{cases},

(9)

where

X_{W}

,

Y_{W}

, and

N_{W}

are longitudinal wind pressure, transverse wind pressure, and yaw moment of wind, respectively;

ρ_{a}

is the air density;

A_{r}

and

A_{L}

are the forward projected area and the side projected area of the hull above the waterline, respectively;

V_{W}

and

θ_{r}

are relative wind speed and relative windward angle, respectively; and

C_{W X}

,

C_{W Y}

, and

C_{W N}

are longitudinal wind pressure coefficient, transverse wind pressure coefficient. and torque coefficient, respectively. Their values can be obtained from wind tunnel tests or Isherwood’s regression equation if no experimental data are available.

2.5. Transformation of Coordinates

The ship position point data in AIS equipment is the latitude and longitude coordinates in the WGS-84 geodetic coordinate system. In order to accurately calculate the basic elements, such as azimuths and distances between two ships, the latitude and longitude coordinates need to be converted into Mercator coordinates under the Mercator projection. Mercator projection, also known as “positive axis isometric cylindrical projection”, is based on the characteristics of “isometric” to ensure that the shape of the projection object does not deform and further ensures the correctness of the orientation and mutual position, so it is often used in the field of navigation and aviation. Assuming that the longitude and latitude coordinates of the ship position point are

(λ, φ)

, the Mercator coordinates under the Mercator projection are

(x, y)

, and the formula for converting the longitude and latitude coordinates into Mercator projection coordinates is as follows:

\{\begin{cases} r_{0} = a \cos φ / \sqrt{1 - e^{2} \sin^{2} φ} \\ q = \ln \tan (\frac{π}{2} + \frac{φ}{4}) - \frac{e}{2} \ln \frac{1 + e \sin φ}{1 - e \sin φ} \\ x = r_{0} \times λ \\ y = r_{0} \times q \end{cases},

(10)

where

r_{0}

is the radius of latitude circle of the reference latitude,

a

is the radius of the major axis of the earth ellipse,

q

is the isometric latitude, and

e

is the first eccentricity of the Earth ellipsoid.

3. Intelligent Decision-Making Model of Ship Collision Avoidance Based on the Dyna-DQN

Based on the S-57 chart data, AIS data, the Dyna-DQN model, and the MMG model, the decision model of collision avoidance based on the RL is designed as shown in Figure 4: (1) First, the shoreline and water depth data are extracted from the S-57 electronic chart when the navigation task is known. The Voronoi diagram and improved A* algorithm are used to obtain the optimal static path for energy saving [48]. (2) To ensure that the MASS will not reencounter static obstacles during its dynamic collision avoidance, the static environment extracted from S-57 is divided into two-dimensional plane grids. The grids containing obstacles are set as non-navigable areas, and the other are navigable. The navigable area grids are given corresponding safe navigation weights based on certain principles. (3) Because the main dimensions of an MASS are small, its maneuverability is easily affected by wind and current. In order to ensure the accurate modeling of MASS motion, based on the MMG model and using the wind and current data from the National centers for environmental prediction (NCEP) climate forecast system version 2 (CFSv2), the MASS motion model considering the impact of wind and current is established. (4) The AIS data of a particular area are used to extract the target ship’s trajectory and obtain the navigation trajectory of the target ship (manned ship). At the same time, in order to better simulate the real-time collision avoidance at sea and make forward-looking collision avoidance decisions, the ship track prediction model based on the SM-OMLSSVR algorithm [47] is used to predict the target ship’s navigation track for some time in the future. (5) The RL model’s state space, action space, and reward function are designed. The state space is divided into three aspects: the own ship, the target ship, and the navigation environment. The state space of the target ship includes the current position point of the target ship and the predicted future position point. The action space is the difference between the own ship’s course that can be safely navigated in this encounter situation and the course of the previous moment. The reward function can be divided into six aspects: exceeding chart boundaries, tracking paths, reaching the goal, collision with static obstacles, collision with dynamic obstacles, and drift. (6) The MASS collision avoidance decision model based on the DQN-Dyna model is established. The DQN algorithm enables the agent (ship) to interact with the environment continuously. The interaction data generated are used for the iterative update of the collision avoidance strategy and the training of the environment model simultaneously. Then, a series of simulated empirical data is generated using the virtual world model to promote the iterative update of the strategy. Finally, from a global perspective, based on the environment of S-57 chart rendering, state space, action space, and reward function, integrating the ship’s handling characteristics, taking into account the static obstacles, dynamic obstacles, and the COLREG, the MASS intelligent collision avoidance decision-making model based on the Dyna-DQN and the MASS collision resolution module are built.

3.1. Assumptions and Conceptual Definitions

3.1.1. Assumptions

According to the maritime navigation regulations for MASSs formulated by Gao et al. [35], due to the strong maneuverability of MASSs, their avoidance priority should be lower than that of manned ships. That is, when a MASS is navigating at sea and meets a manned ship, the passage of a MASS shall not interfere with the normal passage of any other manned ship. Therefore, we make the following assumptions:

(1): No matter what kind of encounter situation the MASS and the target ship (manned ship) are in when there is a collision risk between the MASS and the target ship (manned ship), it is assumed that the MASS takes the action of collision avoidance. In contrast, the target ship (manned ship) keeps course and speed. The target ship (manned ship) here includes power-driven vessels, vessels not under command, vessels restricted in her ability to maneuver, vessels engaged in fishing, and sailing vessels.
(2): Due to the strong manipulability of the MASS and its good communication performance, the MASS does not have to comply with Rule16 of the COLREGS, that is, “Every vessel which is directed to keep out of the way of another vessel shall, so far as possible, take early and substantial action to keep well clear.”

3.1.2. Conceptual Definitions

(1): Complex waters: usually refers to waters with poor natural conditions, complex ship traffic flow, and great difficulty in navigation, including offshore construction waters, multi-channel intersection waters, channel bend sections, narrow channel sections, shoal navigation areas, reef waters, etc.
(2): Virtual world model: a model trained with accurate interactive data based on a particular method. In the model, when the state sequence and action variables are input, the corresponding next-time state sequence, reward value, and whether the termination state value is reached will be output.

3.2. Construction of Environment

3.2.1. Analysis and Extraction of S-57 Electronic Chart Data

When an MASS navigates at sea, radar, camera, and other sensors cannot provide global navigation environment information. In addition, an MASS cannot directly use the electronic chart to avoid collision independently, so it is necessary to study the S-57 data structure of the electronic vector chart. Through analyzing the electronic chart file, the marine geographic information and the marine environment information required by the MASS for collision avoidance decision making are extracted, and irrelevant information is deleted. The chart information is rendered into an information pattern that the MASS can recognize.

The electronic chart comprises sea area elements such as submarine terrain, navigation obstacles, signs, port facility, etc. The standard library format of the S-57 electronic chart is ISO/IEC8211 international standard. Based on the ISO8211 open-source library, this paper analyzes its electronic chart information according to the library structure of the S-57 chart. The format of the S-57 electronic chart standard library is shown in Figure 5 below. It selects the regions with longitude and latitude ranges of 112.50° E–113° E and 21.50° N–21.98° N. Rendering the S-57 chart of this area and the extracted chart information is shown in Figure 6 below.

3.2.2. Generation of Optimal Static Path

Using the previous research on the MASS’s global and energy-saving path planning, based on the consideration of water depth, tide, wind, and current factors, the Voronoi diagram and the improved A* algorithm are used to generate the optimal energy-saving static path for the MASS. Assuming that the starting point and end point coordinates of the MASS are (112°36′ E, 21°44′ N) and (112°55′ E, 21°40′ N), respectively, the optimal static path generated is shown in the following Figure 7.

3.2.3. Grid the Static Environment

When there is no dynamic obstacle in the environment, the navigation environment model of the MASS can be regarded as its arbitrary movement in the finite area of two-dimensional sea level, where the finite area is the navigable area without static obstacles. Because the number of static obstacles on the sea is limited, the shape and distribution are uncertain. In order to ensure that the MASS will choose to avoid obstacles in the navigable area during the subsequent dynamic obstacle avoidance, this paper will divide the original environment extracted from the S-57 chart into several grids of equal size through grid division. Determining whether there are static obstacles (such as land, island, and shoal) parsed from the electronic chart in the grid, in turn, divides the grid map of the environment into the navigable area and non-navigable area.

In addition, the MASS may deviate from the planned route when avoiding collisions or performing tasks in navigable grids close to non-navigable grids. However, if there are no obstacles in the grids, there is still a specific potential risk of navigation. Therefore, this paper referred to [50] to set navigation safety weights to guide the MASS not to enter potential risk areas as much as possible. It is assumed that the navigation safety weight of the non-navigable grid is 0, and the navigation safety weight

w (C_{i}, R_{i})

of the navigable grid

(C_{i}, R_{i})

will be affected by the number of non-navigable grids in the eight adjacent grids. The specific relationship is as follows:

w (C_{i}, R_{i}) = 1 - \frac{1}{8} n,

(11)

Figure 8 below shows the environment map after grid division, in which the darkest grid is a non-navigable grid, and the rest are navigable grids. The smaller the navigation safety weight of the navigable grid, the darker the color. For more transparent observation grid processing, the black box area in the figure is enlarged to produce the image on the right, where the darkest shadow area is the unnavigable area, and the rest are navigable areas. The depth of the shadow color in the navigable area represents the navigation safety weight, and the darker the color, the lower the navigation safety weight of the grid.

3.2.4. Adding Wind and Current Data

The study area selected in this paper is the sea area near the South China Sea, with longitude and latitude ranges of 112.5° E−113° E and 21.5° N−21.98° N, respectively. The wind data are from the second version of the climate prediction system of the National Center for Environmental Prediction (NCEP), and the wind data from January 2018 in the study area are selected. The current data come from the global Hybrid Coordinate Ocean Model (HYCOM) and the Navy Coupled Ocean Data Assimilation (NCODA) analysis, and the time update frequency is 3 h. The current data from January 2018 in the study area are selected. It should be noted that the selected area is small, resulting in less wind and current data in this area. To solve this problem, the Kriging interpolation method [51] is used to interpolate the spatial points’ wind and flow data. This method is a regression algorithm for spatial modeling and prediction (interpolation) of random process/random field based on covariance function. It is a typical geostatistical algorithm widely used in geographic science, environmental science, atmospheric science, and other fields. The interpolated wind and current data in this area are shown in Figure 9. Because this paper only uses the wind data in the offshore area, the wind data in the land area are not visualized here.

3.2.5. Extraction and Prediction of the Target Ship’s Trajectory

The AIS data of this water area from 1 January to 7 January 2018 are taken and visualized, as shown in Figure 10. As the AIS data are sent by ship-based AIS equipment according to specific rules and unequal intervals based on the alteration of course, speed, and other factors, the update interval of Class A AIS information is shown in Table 1 below. In order to better set the collision scenario where the MASS and the target ship are in danger of collision, the selected target ship AIS data are interpolated at equal intervals in this paper. Figure 11 shows an AIS track of a ship with the Maritime Mobile Service Identity (MMSI) of 412000212 in the water area. The red curve in this figure is the AIS original track, and the blue dot is obtained by interpolating the original track points at equal intervals. At the same time, to make the generated collision avoidance decision have a certain forward-looking element, the ship track prediction model based on the SM-OMLSSVR [47] can predict the future trajectory of the target ship.

3.3. Setting of State Space, Action Space, and Reward Function of Collision Avoidance Decision Model Based on the Dyna-DQN

3.3.1. State Space

The states in the state space are divided into three categories: the state space of the own ship

S_{OS}

, the state space of the target ship

S_{TS}

, and the environmental information

S_{env}

. Among them, the environmental information

S_{env}

comprises the navigation safety weights of 24 grids near the grid where the ship is located, as shown in Figure 12. If all 24 grids are navigable squares,

S_{env}

is composed of 24 ones; if all the grids are not navigable,

S_{env}

is composed of 24 zeros. When the ship is at the edge of the chart, there may be no grid around it; therefore, it is outside the boundary of the chart. At this time, the navigation safety weight of the corresponding position is supplemented by 0, indicating that the area is not navigable.

S_{OS} = (x_{o}, y_{o}, C_{o}, δ_{o}, {\tilde{φ}}_{o}, C_{e}, y_{e}, ψ_{e}, ‖P_{g o a l} - P_{o}‖, t),

(12)

S_{TS} = (x_{t}, y_{t}, C_{t}, V_{t}, ‖P_{o} - P_{t}‖, ‖C_{o} - C_{t}‖, R_{t}, x_{_{p}}^{1}, y_{_{p}}^{1}, x_{_{p}}^{2}, y_{_{p}}^{2}, x_{_{p}}^{3}, y_{_{p}}^{3}),

(13)

S_{env} = (w_{0}, w_{1}, w_{2}, \dots, w_{23}),

(14)

where

x_{o}

,

y_{o}

and

x_{_{t}}^{i}

,

y_{_{t}}^{i}

are the positions of the own ship and the i-th target ship, respectively;

C_{o}

and

C_{t}

are the course of the own ship and the target ship, respectively;

‖P_{g o a l} - P_{o}‖

is the distance between the current position of the own ship and the destination;

‖P_{o} - P_{t}‖

is the distance between the own ship and the target ship;

‖C_{o} - C_{t}‖

is the heading intersection angle of the own ship and the target ship;

δ_{o}

is the rudder angle of the own ship;

y_{e}

is the cross error, i.e., the distance between the own ship’s position and the planned sea route, as shown in Figure 13;

ψ_{e}

is the bow error of the own ship, which is used to characterize the influence of wind and current on the ship motion model;

C_{e}

is heading deviation; and

{\tilde{φ}}_{o}

is the relative angle between the own ship’s course and the own ship’s point to the destination, that is, the included angle from the own ship’s course line to the line between the own ship’s current position and the destination. The clockwise sign is positive, and the counterclockwise sign is negative.

R_{o, t}

is the risk of collision between the own ship and the target ship [46];

t

is the operation time of the own ship;

x_{_{p}}^{1}, y_{_{p}}^{1}, x_{_{p}}^{2}, y_{_{p}}^{2}, x_{_{p}}^{3}, y_{_{p}}^{3}

is the track point of the target ship at the next three moments obtained from the ship track prediction model using the SM-OMLSSVR [47]; and

w_{0}, w_{1}, w_{2}, \dots, w_{23}

are the navigation safety weights of 24 grids around the own ship.

3.3.2. Action Space

Because the main dimensions of the MASS are small and the maneuverability is more robust than that of general large-scale manned ships, it is assumed here that the MASS does not have to comply with Rule16 of the COLREGS, that is, “Every vessel which is directed to keep out of the way of another vessel shall, so far as possible, take early and substantial action to keep well clear.”

The course difference

Δ C

is taken as the output parameter of this model, that is, the parameter of the action space. The range of course difference is set as

Δ C \in [- 10 °, 10 °]

and discretized into:

Δ C \in [- 10 °, - 8 °, - 6 °, - 4 °, - 2 °, 0 °, 2 °, 4 °, 6 °, 8 °, 10 °] .

(15)

3.3.3. Reward Function

Out chart boundaries

R_{out} = \{\begin{cases} - r_{o u t}, if the boundary of the chart is exceeded \\ 0, else \end{cases} .

(16)

Tracking paths

R_{p a t h} = - r_{p a t h} y_{e} .

(17)

Reaching the goal

R_{goal} = \{\begin{cases} r_{goal} if |P_{t} - P_{goal}| < 5 \\ - r_{goal} (|P_{t} - P_{goal}| - |P_{t - 1} - P_{goal}|) otherwise \end{cases} .

(18)

Collision with dynamic obstacles

R_{d c o l} = - r_{d c o l} R_{o, t} .

(19)

Drift

R_{drift} = \{\begin{cases} - r_{crosserr} if |u| < |v| \\ 0 otherwise \end{cases} .

(20)

Collision with static obstacles

R_{s c o l} = \{\begin{cases} - r_{_{s c o l}}, if w (C_{i}, R_{i}) = 0 \\ - r_{_{s c o l}} (9 - (\begin{array}{l} w (C_{i - 1}, R_{i - 1}) + w (C_{i}, R_{i - 1}) + w (C_{i + 1}, R_{i - 1}) \\ + w (C_{i - 1}, R_{i}) + w (C_{i}, R_{i}) + w (C_{i + 1}, R_{i}) + \\ w (C_{i - 1}, R_{i + 1}) + w (C_{i}, R_{i + 1}) + w (C_{i + 1}, R_{i + 1}) \end{array})), otherwise \end{cases} .

(21)

The total reward in the model is as follows:

R_{t o t a l} = R_{o u t} + R_{p a t h} + R_{goal} + R_{d c o l} + R_{drift} + R_{s c o l},

(22)

where

r_{o u t}

,

r_{p a t h}

,

r_{g o a l}

,

r_{d c o l}

,

r_{d r i f t}

, and

r_{s c o l}

are the weight parameters of six awards. The MASS has two navigation modes during navigation: (1) Autonomous navigation mode. When the distance between two ships is significant, the collision risk

R_{o, t}

is 0; that is, the dynamic obstacle collision risk reward cannot be considered at this time. (2) Collision avoidance mode. When two ships have potential collision risk, navigation task and collision avoidance risk shall be considered simultaneously.

4. Simulation Experiment

Under the machine learning framework of Pytorch developed by Facebook-AI-research (FAIR), the Dyna-DQN model is implemented using Python. A desktop computer with an i7 3.00 GHz CPU, 16 GB RAM, and NVIDIA TITAN RTX GPU was used to train the network.

In order to verify the effectiveness of the algorithm proposed in this paper, three scenarios are considered, namely, autonomous navigation of ships under static obstacles only, autonomous navigation of ships along planned routes under static obstacles only, and autonomous navigation of ships along planned routes under both static and dynamic obstacles. At the same time, the algorithm is compared with the DQN algorithm to verify its efficiency. The basic settings of the parameters are as follows: the learning rate of DQN and the neural network in the virtual world model is 10⁻³, the batch size in the DQN is set to 4000, the memory capacity is set to 10,000, and the discount factor

γ

is set to 0.88. During the training process, the value

ε

decreases linearly, with the initial value set at 1.0 and the minimum value set at 0.01. The purpose is to make the proportion of exploration in the initial stage of the navigation task larger. With training, the proportion of exploration decreases, and the proportion of exploitation increases, that is, the

ε

linear decreases.

4.1. Autonomous Navigation of Ships with Static Obstacles Only

The autonomous navigation of ships with static obstacles only refers to the navigation of MASSs along the planned route with static obstacles considered. When the navigation task of the ship has been determined, that is, the starting point and destination of the navigation have been determined, it is assumed that the speed of the own ship during the whole navigation process remains unchanged, and the speed is 12.35 knots. Only by altering the course to avoid static obstacles is the scene under the task, as shown in Figure 6 above. Currently, both

ω_{d c o l}

and

ω_{p a t h}

in Formula (1) are set to 0. The training is divided into three rounds every 100 times. Table 2 shows the times of exceeding the boundary, touching static obstacles, exceeding the time threshold, and success in three training rounds.

Because the training environment is relatively complex, the training track with the return value

R_{total} \geq - 2000

will not be displayed to visualize the training process more clearly. The area surrounded by black curves is static, such as land or islands and reefs. The area where the small red flag is located is the destination. Here, we assume that the MASS navigates to a distance of 500 m from the destination to reach the destination. At the same time, the color of the track being trained will be set to pink, the color of the track that has been trained and with

R_{total} \geq - 2000

but has not reached the endpoint will be set to blue, and the color of the track that has successfully reached the endpoint will be set to bright red. The training results are shown in Figure 14. As seen from this figure, due to the complex navigation environment, the training effect is not good at Run1. With increased training times, the ship can reach its destination, and the success rate continues to increase.

4.2. Training of Ships Sailing According to the Planned Sea Route under the Condition of Only Considering Static Obstacles

In the case of only considering static obstacles, the autonomous navigation of the own ship is trained according to the planned sea route in Figure 8. The training goal is that the own ship will navigate along the planned sea route from the starting point to the destination. Compared with Section 4.1, the subtask of following the planned sea route is added. The following Figure 15, Figure 16 and Figure 17 show the training results, in which the bold black curve is the planned sea route of the MASS. The planned sea route is the optimal energy-saving path of the MASS generated using the Voronoi diagram and improved A* algorithm considering water depth, tide, wind, current, and other factors. The training is divided into three rounds, 500 times per round. The training results are shown in Table 3.

Similarly, the training environment is more complex due to the addition of the sub-task of following the planned sea route. In order to present the training results more clearly, only the training track meeting the condition

R_{total} \geq - 2500

is displayed for visualization. The color settings are the same as those in Section 4.1. Figure 15 is the schematic diagram of the visualization of the training results in the first round. Due to the numerous and disordered tracks that failed in training, in Figure 16 and Figure 17, the visual conditions were adjusted to display only the training tracks that met

R_{total} \geq - 2000

, and the color settings were changed. The areas surrounded by the blue curve were static obstacles such as land, islands, and reefs. The bold pink broken line was the planned sea route designed under the known navigation task, and

R_{total} \geq - 2000

and the tracks that did not reach the destination were displayed with black curves. The track showing the successful arrival at the destination is displayed with the bold red curve. It can be seen from Figure 15, Figure 16 and Figure 17 that with the increase in training times, the training success rate continues to increase.

4.3. Collision Avoidance of Ships Considering Both Static and Dynamic Obstacles

Figure 18 shows the situation of two ships encountering in consideration of a static environment. The red curve is the ship’s planned sea route, and the blue curve is the extracted AIS track of the target ship. It is assumed that the ship and the target ship will meet at point B and collide when their respective course and speed remain unchanged. According to assumption (1), because the target ship is a manned ship and the own ship is an MASS, the own ship should take collision avoidance actions so as not to hinder the safe navigation of the target ship (manned ship). The training is divided into four rounds. Because this scenario considers dynamic obstacles and the conditions are more complex, the number of training times per round is set to 1500, and the training results are shown in Table 4. The specific training situation is shown in Figure 19, Figure 20 and Figure 21. The curve settings in these figures are consistent with Figure 16 and Figure 17, except that the green curve represents the extracted target ship track. It can be seen from these figures that with the increase in training times, the success rate gradually increases.

4.4. Comparison Experiment with Decision Model of Ship Collision Avoidance Based on DQN Algorithm

In order to prove the superiority of the model proposed in this paper, under the condition of considering both static obstacles and planned sea routes as well as dynamic obstacles, it is compared with the decision-making model of ship collision avoidance based on the DQN algorithm. The setting of the parameters in the DQN algorithm is the same as in the Dyna-DQN. The two models are trained three times, each time for 3000 episodes. The average reward values for the two models under three training times are shown in Figure 22 and Figure 23.

As shown in Figure 22 and Figure 23, the average reward value obtained using the Dyna-DQN algorithm fluctuates within [–3500, –7500] and eventually converges around −5000. The average reward value obtained by the DQN algorithm always fluctuates within [−6000, −8000]. At the 3000 episodes, the average reward value returns to near the initial reward value again. It can be seen that under 3000 episodes of training, compared to the decision-making model of ship collision avoidance based on the DQN, the initial and late average reward values obtained by the decision-making model of ship collision avoidance based on the Dyna-DQN are relatively large, and the overall trend is upward. It was found through a complex and thorough search that the DQN is a model-free reinforcement learning method that requires a large amount of interactive data for strategy update, resulting in low data utilization and a huge training time. The Dyna-DQN algorithm uses interactive data to establish a virtual world model. It simultaneously uses interactive data and virtual data generated by the virtual world model to update strategies, improving data utilization and thus improving training efficiency. Although the initial stage has a small amount of data and overlapping training results, in the later stage, as the number of episodes increases, the model accuracy gradually improves; the average reward value of the decision-making model of ship collision avoidance based on the Dyna-DQN shows an overall upward trend.

5. Conclusions and Future Studies

5.1. Conclusions

This paper considers environmental factors such as wind and current, static obstacles, dynamic obstacles, and the ship’s maneuverability. Moreover, it combines the advantages of model-based RL and model-free RL and designs a decision model of ship collision avoidance based on the Dyna-DQN. In order to verify the effect of this model, three levels of training tasks were designed: MASS autonomous navigation considering only static obstacles, MASS navigation according to the planned sea route considering only static obstacles, and MASS autonomous navigation considering both static and dynamic obstacles. Through repeated trial and error training, it can be concluded that an MASS can reach the destination without colliding with obstacles. With increased training times, the task’s success rate gradually increases. At the same time, to prove this model’s superiority, the traditional DDQN model is used for comparison. The test results show that the proposed model can converge faster, and the total reward value is higher than the DDQN model.

The differences between this study and other studies are as follows:

(1): In the current research on ship collision avoidance decisions, most studies are set in open waters and do not consider static obstacles. This paper selects S-57 chart information to build a static navigation environment and calculates the navigation safety weight of each grid through grid processing to ensure that static obstacles will not be touched when performing collision avoidance actions.
(2): The problem of ship maneuverability is not considered in the existing collision avoidance decision-making research. This study reflects that the main dimensions of an MASS are relatively small and it is vulnerable to wind and current, and an MMG motion model considering the influence of wind and current factors is established.
(3): The existing research assumes that MASSs should comply with the COLREGs when avoiding collision and does not consider the collision between MASSs and manned ships. Given the strong maneuverability of an MASS, this study determined that an MASS should not interfere with the normal navigation of the target ship (manned ship).
(4): Currently, most collision avoidance decision models based on RL use the model-free RL methods. The sample utilization rate is low, and the training time is long. This paper combines the model-based RL method and model-free RL to establish a Dyna framework, which improves the sample utilization rate and reduces the training time.

To conclude, the model has great application potential and can be applied to the autonomous navigation of MASSs and the intelligent collision avoidance between MASSs and manned ships. It is an efficient MASS autonomous navigation and collision avoidance model.

5.2. Future Studies

Although the model presented in this paper has shown good training results in the simulation test, there are still some problems that need to be supplemented and improved in future research, as follows:

(1): Considering the problem of time cost, the study only considers collision avoidance of two ships and does not consider collision avoidance decisions involving three or more ships. The next step is to expand the number of target ships gradually.
(2): Considering the increase of training complexity caused by the increase of the dimension of collision avoidance behavior, only altering is considered in collision avoidance behavior. The next step is to consider adding speed to the action space.
(3): Currently, the model belongs to the simulation test stage, and the next step is to carry out the sea trial and use the model on the unmanned ship.
(4): Currently, the frontier of RL includes imitation RL, reverse RL, meta RL, hierarchical DRL, multi-task transfer DRL, and DRL based on memory and reasoning. In the future, we should consider whether these methods can be better applied to unmanned ship intelligent collision avoidance from theoretical and practical perspectives.
(5): The collision avoidance study between MASSs and manned ships is carried out under “MASS should avoid manned ships”. The next step should also consider the collision avoidance of manned ships under the condition of complying with COLREGs and the collision avoidance between two MASSs.

Author Contributions

Conceptualization, J.L.; methodology, J.L.; software, J.L. and K.Z.; validation, J.L. and K.Z.; formal analysis, J.L. and K.Z.; writing—original draft preparation, J.L. and K.Z.; writing—review and editing, J.L. and K.Z.; visualization, J.L.; supervision, G.S. and J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China [52201414; 51579025; 51709165]; the Provincial Natural Science Foundation of Liaoning [20170540090]; and supported by the Navigation College of Dalian Maritime University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors are grateful for the support of the key laboratory of navigation safety guarantee of Liaoning province, China.

Conflicts of Interest

The authors declare no conflict of interest.

References

European Maritime Safety Agency. Annual Overview of Marine Casualties and Incidents; EMSA: Lisbon, Portugal, 2021. [Google Scholar]
IMO. Scoping Exercise on Autonomous Vessels Put on Agenda. 2017. Available online: http://www.imo.org/en/MediaCentre/IMOMediaAccreditation/Pages/MSC-98-preview.aspx (accessed on 17 April 2018).
IMO. Maritime Safety Committee (MSC). London: [s.n.]. 2019. Available online: http://www.imo.org/en/-MediaCentre/MeetingSummaries/MSC/Pages/Default.aspx (accessed on 1 January 2020).
Wilson, P.A.; Harris, C.J.; Hong, X. A Line of Sight Counteraction Navigation Algorithm for Ship Encounter Collision Avoidance. J. Navig. 2003, 56, 111–121. [Google Scholar] [CrossRef]
Larson, J.; Bruch, M.; Ebken, J. Autonomous navigation and obstacle avoidance for unmanned surface vehicles. In Proceedings of the 2006 Defense and Security Symposium, Orlando, FL, USA, 17 April 2006; pp. 17–20. [Google Scholar]
Larson, J.; Bruch, M.; Halterman, R.; Rogers, J.; Webster, R. Advances in Autonomous Obstacle Avoidance for Unmanned Surface Vehicles; Space & Naval Warfare Systems Center: San Diego, CA, USA, 2007; pp. 1–15. [Google Scholar]
Casalino, G.; Turetta, A.; Simetti, E. A three-layered architecture for real time path planning and obstacle avoidance for surveillance USVs operating in harbour fields. In Proceedings of the OCEANS 2009-EUROPE, Bremen, Germany, 11–14 May 2009; pp. 1–8. [Google Scholar]
Simetti, E.; Torelli, S.; Casalino, G.; Turetta, A. Experimental results on obstacle avoidance for high speed unmanned surface vehicles. In Proceedings of the 2014 Oceans, St. John’s, NL, Canada, 14–19 September 2014; pp. 1–6. [Google Scholar]
Szlapczynski, R.; Krata, P. Determining and visualizing safe motion parameters of a ship navigating in severe weather conditions. Ocean Eng. 2018, 158, 263–274. [Google Scholar] [CrossRef]
Kim, D.; Kim, J.S.; Kim, J.H.; Im, N.K. Development of ship collision avoidance system and sea trial test for autonomous ship. Ocean Eng. 2022, 266, 113120. [Google Scholar] [CrossRef]
Gil, M.; Montewka, J.; Krata, P.; Hinz, T.; Hirdaris, S. Determination of the dynamic critical maneuvering area in an encounter between two vessels: Operation with negligible environmental disruption. Ocean Eng. 2020, 213, 107709. [Google Scholar] [CrossRef]
Gil, M. A concept of critical safety area applicable for an obstacle-avoidance process for manned and autonomous ships. Reliab. Eng. Syst. Saf. 2021, 214, 107806. [Google Scholar] [CrossRef]
Lenart, A.S. Collision Threat Parameters for a new Radar Display and Plot Technique. J. Navig. 1983, 36, 404–410. [Google Scholar] [CrossRef]
Pedersen, E.; Inoue, K.; Tsugane, M. Simulator Studies on a Collision Avoidance Display that Facilitates Efficient and Precise Assessment of Evasive Manoeuvres in Congested Waterways. J. Navig. 2003, 56, 411–427. [Google Scholar] [CrossRef]
Kuwata, Y.; Wolf, M.T.; Zarzhitsky, D.; Huntsberger, T.L. Safe Maritime Autonomous Navigation With COLREGS, Using Velocity Obstacles. IEEE J. Ocean. Eng. 2014, 39, 110–119. [Google Scholar] [CrossRef]
Chen, P.; Huang, Y.; Mou, J.; van Gelder, P. Ship collision candidate detection method: A velocity obstacle approach. Ocean Eng. 2018, 170, 186–198. [Google Scholar] [CrossRef]
Huang, Y.; van Gelder, P.; Wen, Y. Velocity obstacle algorithms for collision prevention at sea. Ocean Eng. 2018, 151, 308–321. [Google Scholar] [CrossRef]
Huang, Y.; Chen, L.; van Gelder, P.H.A.J.M. Generalized velocity obstacle algorithm for preventing ship collisions at sea. Ocean Eng. 2019, 173, 142–156. [Google Scholar] [CrossRef]
Li, M.; Mou, J.; He, Y.; Zhang, X.; Xie, Q.; Chen, P. Dynamic trajectory planning for unmanned ship under multi-object environment. J. Mar. Sci. Technol. 2021, 27, 173–185. [Google Scholar] [CrossRef]
Liu, Y.H.; Shi, C.J. A fuzzy-neural inference network for ship collision avoidance. In Proceedings of the 4th International Conference on Machine Learning and Cybernetics, Guangzhou, China, 18–21 August 2005. [Google Scholar]
Perera, L.P.; Carvalho, J.P.; Soares, C.G. Fuzzy logic based decision making system for collision avoidance of ocean navigation under critical collision conditions. J. Mar. Sci. Technol. 2011, 16, 84–99. [Google Scholar] [CrossRef]
Perera, L.P.; Carvalho, J.P.; Soares, C.G. Intelligent Ocean Navigation and Fuzzy-Bayesian Decision/Action Formulation. IEEE J. Ocean. Eng. 2012, 37, 204–219. [Google Scholar] [CrossRef]
Ahn, J.-H.; Rhee, K.-P.; You, Y.-J. A study on the collision avoidance of a ship using neural networks and fuzzy logic. Appl. Ocean Res. 2012, 37, 162–173. [Google Scholar] [CrossRef]
Tsou, M.-C.; Hsueh, C.-K. The Study of Ship Collision Avoidance Route Planning by Ant Colony Algorithm. J. Mar. Sci. Technol. 2010, 18, 16. [Google Scholar] [CrossRef]
Lazarowska, A. Ship’s Trajectory Planning for Collision Avoidance at Sea Based on Ant Colony Optimisation. J. Navig. 2015, 68, 291–307. [Google Scholar] [CrossRef] [Green Version]
Simsir, U.; Amasyalı, M.F.; Bal, M.; Çelebi, U.B.; Ertugrul, S. Decision support system for collision avoidance of vessels. Appl. Soft Comput. 2014, 25, 369–378. [Google Scholar] [CrossRef]
Praczyk, T. Neural anti-collision system for Autonomous Surface Vehicle. Neurocomputing 2015, 149, 559–572. [Google Scholar] [CrossRef]
Xu, Q.; Yang, Y.; Zhang, C.; Zhang, L. Deep Convolutional Neural Network-Based Autonomous Marine Vehicle Maneuver. Int. J. Fuzzy Syst. 2018, 20, 687–699. [Google Scholar] [CrossRef]
Lin, C.; Wang, H.; Yuan, J.; Yu, D.; Li, C. An improved recurrent neural network for unmanned underwater vehicle online obstacle avoidance. Ocean Eng. 2019, 189, 106327.1–106327.13. [Google Scholar] [CrossRef]
Johansen, T.; Blindheim, S.; Torben, T.R.; Utne, I.B.; Johansen, T.A.; Sørensen, A.J. Development and testing of a risk-based control system for autonomous ships. Reliab. Eng. Syst. Saf. 2023, 234, 109195. [Google Scholar] [CrossRef]
Cheng, Y.; Zhang, W. Concise deep reinforcement learning obstacle avoidance for underactuated unmanned marine vessels. Neurocomputing 2018, 272, 63–73. [Google Scholar] [CrossRef]
Shen, H.; Hashimoto, H.; Matsuda, A.; Taniguchi, Y.; Terada, D.; Guo, C. Automatic collision avoidance of multiple ships based on deep Q-learning. Appl. Ocean Res. 2019, 86, 268–288. [Google Scholar] [CrossRef]
Zhang, X.; Wang, C.; Liu, Y.; Chen, X. Decision-Making for the Autonomous Navigation of Maritime Autonomous Surface Ships Based on Scene Division and Deep Reinforcement Learning. Sensors 2019, 19, 4055. [Google Scholar] [CrossRef] [Green Version]
Woo, J.; Kim, N. Collision avoidance for an unmanned surface vehicle using deep reinforcement learning. Ocean Eng. 2020, 199, 107001. [Google Scholar] [CrossRef]
Wu, X.; Chen, H.; Chen, C.; Zhong, M.; Xie, S.; Guo, Y.; Fujita, H. The autonomous navigation and obstacle avoidance for USVs with ANOA deep reinforcement learning method. Knowl.-Based Syst. 2020, 196, 105201. [Google Scholar] [CrossRef]
Liu, X.; Jin, Y. Reinforcement learning-based collision avoidance: Impact of reward function and knowledge transfer. Artif. Intell. Eng. Des. Anal. Manuf. 2020, 34, 207–222. [Google Scholar] [CrossRef]
Sawada, R.; Sato, K.; Majima, T. Automatic ship collision avoidance using deep reinforcement learning with LSTM in continuous action spaces. J. Mar. Sci. Technol. 2020, 26, 509–524. [Google Scholar] [CrossRef]
Sawada, R. Automatic collision avoidance using deep reinforcement learning with grid sensor. In Proceedings of the 23rd Asia Pacifc Symposium on Intelligent and Evolutionary Systems, Tottori, Japan, 6–8 December 2019; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 17–32. [Google Scholar]
Zhao, L.; Roh, M.-I. COLREGs-compliant multiship collision avoidance based on deep reinforcement learning. Ocean Eng. 2019, 191, 106436. [Google Scholar] [CrossRef]
Jiang, L.; An, L.; Zhang, X.; Wang, C.; Wang, X. A human-like collision avoidance method for autonomous ship with attention-based deep reinforcement learning. Ocean Eng. 2022, 264, 112378. [Google Scholar] [CrossRef]
Heiberg, A.; Larsen, T.N.; Meyer, E.; Rasheed, A.; San, O.; Varagnolo, D. Risk-based implementation of COLREGs for autonomous surface vehicles using deep reinforcement learning. Neural Netw. 2022, 152, 17–33. [Google Scholar] [CrossRef]
Xie, S.; Chu, X.; Zheng, M.; Liu, C. A composite learning method for multi-ship collision avoidance based on reinforcement learning and inverse control. Neurocomputing 2020, 411, 375–392. [Google Scholar] [CrossRef]
Maza, J.A.G.; Argüelles, R.P. COLREGs and their application in collision avoidance algorithms: A critical analysis. Ocean Eng. 2022, 261, 112029. [Google Scholar] [CrossRef]
Wróbel, K.; Gil, M.; Huang, Y.; Wawruch, R. The Vagueness of COLREG versus Collision Avoidance Techniques—A Discussion on the Current State and Future Challenges Concerning the Operation of Autonomous Ships. Sustainability 2022, 14, 16516. [Google Scholar] [CrossRef]
Gao, M.; Kang, Z.; Zhang, A.; Liu, J.; Zhao, F. MASS autonomous navigation system based on AIS big data with dueling deep Q networks prioritized replay reinforcement learning. Ocean Eng. 2022, 249, 110834. [Google Scholar] [CrossRef]
Liu, J.; Shi, G.-Y.; Zhu, K.-G. A novel ship collision risk evaluation algorithm based on the maximum interval of two ship domains and the violation degree of two ship domains. Ocean Eng. 2022, 255, 111431. [Google Scholar] [CrossRef]
Liu, J.; Shi, G.; Zhu, K. Online Multiple Outputs Least-Squares Support Vector Regression Model of Ship Trajectory Prediction Based on Automatic Information System Data and Selection Mechanism. IEEE Access 2020, 8, 154727–154745. [Google Scholar] [CrossRef]
Zhang, Y.; Shi, G.; Liu, J. Dynamic Energy-Efficient Path Planning of Unmanned Surface Vehicle under Time-Varying Current and Wind. J. Mar. Sci. Eng. 2022, 10, 759. [Google Scholar] [CrossRef]
Jia, X.; Yang, Y. Mathematical Model of Ship Motion—Mechanism Modeling and Identification Modeling; Dalian Maritime University Press: Dalian, China, 1999. [Google Scholar]
Wang, Y.; Liang, X.; Li, B.; Yu, X. Research and Implementation of Global Path Planning for Unmanned Surface Vehicle Based on Electronic Chart. In Proceedings of the International Conference on Mechatronics & Intelligent Robotics, Kunming, China, 20–21 May 2017; Springer: Cham, Switzerland, 2017. [Google Scholar]
Le, N.D.; Zidek, J.V. Statistical Analysis of Environmental Space-Time Processes; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]

Figure 1. Histogram of the distribution of casualty events involving ships over 2014–2020 [1].

Figure 2. Pie chart of the distribution of casualty events involving ships over 2014–2020 [1].

Figure 3. General Dyna framework.

Figure 4. Schematic diagram of ship collision avoidance decision model based on the RL.

Figure 5. Original S-57 Chart.

Figure 6. Obstacle information extracted from S-57 chart.

Figure 7. Static path planning of an MASS based on the Voronoi diagram and improved A* algorithm.

Figure 8. Ship static environment after grid division.

Figure 9. Wind and current data of selected areas in January 2018. (a) Wind. (b) Current.

Figure 10. AIS data of selected waters in January 2018.

Figure 11. AIS track diagram of ships with MMSI 412000212 selected in designated waters before and after interpolation.

Figure 12. Navigation safety weights of 24 grids around the MASS as the environment state.

Figure 13. Schematic diagram of the cross error

y_{e}

when the own ship is sailing.

Figure 13. Schematic diagram of the cross error

y_{e}

when the own ship is sailing.

Figure 14. Training results of two ships considering only static obstacles. (a) The training results of Run 1; (b) Training results of Run 2; (c) Training results of Run 3.

Figure 15. Training results of Run 1. (a) Trained 200 times; (b) Trained 300 times; (c) Trained 500 times.

Figure 16. Training results of Run 2. (a) Trained 200 times; (b) Trained 300 times; (c) Trained 500 times.

Figure 17. Training results of Run 3. (a) Trained 200 times; (b) Trained 300 times; (c) Trained 500 times.

Figure 18. Schematic diagram of the planned sea route of the own ship and the trajectory of the target ship under the static environment.

Figure 19. Training results of Run 1. (a) Trained 500 times; (b) Trained 1000 times; (c) Trained 1500 times.

Figure 20. Training results of Run 2. (a) Trained 500 times; (b) Trained 1000 times; (c) Trained 1500 times.

Figure 21. Training results of Run 3. (a) Trained 500 times; (b) Trained 1000 times; (c) Trained 1500 times.

Figure 22. Training results of the ship collision avoidance decision model based on Dyna-DQN.

Figure 23. Training results of the ship collision avoidance decision model based on DQN.

Table 1. Update interval of Class A AIS Information.

Ship Status	Report Interval
Anchor/berthing vessel, speed < 3 knots	3 min ¹
Anchor/berthing vessel, speed > 3 knots	10 s ¹
Speed < 14 knots	10 s ¹
Speed > 14 knots and change course	$3 \frac{1}{3}$ s ¹
Speed is 14–23 knots	6 s ¹
Speed is 14–23 knots and change course	2 s
Speed > 23 knots	2 s
Speed > 23 knots and change course	2 s

¹ When the shipboard AIS is confirmed as a synchronous sign station, its dynamic in-formation update interval is 2 s.

Table 2. Only considering the training results under static obstacles.

	Total Episodes	Out of the Boundary	Collision with the Static Obstacle	Out of Time	Number of Successes
Run 1	100	10	88	2	0
Run 2	100	12	86	1	1
Run 3	100	7	91	0	2

Table 3. The training results when both static and dynamic obstacles are considered.

	Total Episodes	Out of the Boundary	Collision with Static Obstacles	Out of Time	Number of Successes
Run 1	200	23	177	0	0
	300	31	260	4	5
	500	38	443	12	7
Run 2	200	16	184	0	0
	300	22	278	0	0
	500	40	456	2	2
Run 3	200	18	177	4	1
	300	36	256	6	3
	500	56	425	15	4

Table 4. The training results when both static and dynamic obstacles are considered.

	Total Episodes	Out of the Boundary	Collision with Static Obstacles	Collision with Ship	Out of Time	Number of Successes
Run 1	500	11	465	16	7	2
	1000	22	936	33	8	2
	1500	47	1390	49	10	5
Run 2	500	9	470	16	3	2
	1000	26	943	19	9	3
	1500	56	1382	39	18	5
Run 3	500	6	489	4	0	1
	1000	30	1246	11	5	8
	1500	42	1415	24	42	11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Shi, G.; Zhu, K.; Shi, J. Research on MASS Collision Avoidance in Complex Waters Based on Deep Reinforcement Learning. J. Mar. Sci. Eng. 2023, 11, 779. https://doi.org/10.3390/jmse11040779

AMA Style

Liu J, Shi G, Zhu K, Shi J. Research on MASS Collision Avoidance in Complex Waters Based on Deep Reinforcement Learning. Journal of Marine Science and Engineering. 2023; 11(4):779. https://doi.org/10.3390/jmse11040779

Chicago/Turabian Style

Liu, Jiao, Guoyou Shi, Kaige Zhu, and Jiahui Shi. 2023. "Research on MASS Collision Avoidance in Complex Waters Based on Deep Reinforcement Learning" Journal of Marine Science and Engineering 11, no. 4: 779. https://doi.org/10.3390/jmse11040779

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on MASS Collision Avoidance in Complex Waters Based on Deep Reinforcement Learning

Abstract

1. Introduction

1.1. Background

1.2. The Literature Review

1.3. Research Content

2. Methods and Tools

2.1. DQN Algorithm

2.2. Dyna Framework

2.3. Dyna-DQN

2.4. MMG Model Considering Wind and Flow Factors

2.5. Transformation of Coordinates

3. Intelligent Decision-Making Model of Ship Collision Avoidance Based on the Dyna-DQN

3.1. Assumptions and Conceptual Definitions

3.1.1. Assumptions

3.1.2. Conceptual Definitions

3.2. Construction of Environment

3.2.1. Analysis and Extraction of S-57 Electronic Chart Data

3.2.2. Generation of Optimal Static Path

3.2.3. Grid the Static Environment

3.2.4. Adding Wind and Current Data

3.2.5. Extraction and Prediction of the Target Ship’s Trajectory

3.3. Setting of State Space, Action Space, and Reward Function of Collision Avoidance Decision Model Based on the Dyna-DQN

3.3.1. State Space

3.3.2. Action Space

3.3.3. Reward Function

4. Simulation Experiment

4.1. Autonomous Navigation of Ships with Static Obstacles Only

4.2. Training of Ships Sailing According to the Planned Sea Route under the Condition of Only Considering Static Obstacles

4.3. Collision Avoidance of Ships Considering Both Static and Dynamic Obstacles

4.4. Comparison Experiment with Decision Model of Ship Collision Avoidance Based on DQN Algorithm

5. Conclusions and Future Studies

5.1. Conclusions

5.2. Future Studies

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI