A Novel Path Planning Approach for Mobile Robot in Radioactive Environment Based on Improved Deep Q Network Algorithm

Wu, Zhiqiang; Yin, Yebo; Liu, Jie; Zhang, De; Chen, Jie; Jiang, Wei

doi:10.3390/sym15112048

Open AccessArticle

A Novel Path Planning Approach for Mobile Robot in Radioactive Environment Based on Improved Deep Q Network Algorithm

by

Zhiqiang Wu

^1,2,

Yebo Yin

¹,

Jie Liu

^1,*,

De Zhang

^3,*

,

Jie Chen

² and

Wei Jiang

²

¹

School of Computer Science, University of South China, Hengyang 421001, China

²

Science and Technology on Reactor System Design Technology Laboratory, Nuclear Power Institute of China, Chengdu 610213, China

³

School of Resource Environment and Safety Engineering, University of South China, Hengyang 421001, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2023, 15(11), 2048; https://doi.org/10.3390/sym15112048

Submission received: 23 October 2023 / Revised: 7 November 2023 / Accepted: 9 November 2023 / Published: 11 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

The path planning problem of nuclear environment robots refers to finding a collision-free path under the constraints of path length and an accumulated radiation dose. To solve this problem, the Improved Dueling Deep Double Q Network algorithm (ID3QN) based on asymmetric neural network structure was proposed. To address the issues of overestimation and low sample utilization in the traditional Deep Q Network (DQN) algorithm, we optimized the neural network structure and used the double network to estimate action values. We also improved the action selection mechanism, adopted a priority experience replay mechanism, and redesigned the reward function. To evaluate the efficiency of the proposed algorithm, we designed simple and complex radioactive grid environments for comparison. We compared the ID3QN algorithm with traditional algorithms and some deep reinforcement learning algorithms. The simulation results indicate that in the simple radioactive grid environment, the ID3QN algorithm outperforms traditional algorithms such as A*, GA, and ACO in terms of path length and accumulated radiation dosage. Compared to other deep reinforcement learning algorithms, including DQN and some improved DQN algorithms, the ID3QN algorithm reduced the path length by 15.6%, decreased the accumulated radiation dose by 23.5%, and converged approximately 2300 episodes faster. In the complex radioactive grid environment, the ID3QN algorithm also outperformed the A*, GA, ACO, and other deep reinforcement learning algorithms in terms of path length and an accumulated radiation dose. Furthermore, the ID3QN algorithm can plan an obstacle-free optimal path with a low radiation dose even in complex environments. These results demonstrate that the ID3QN algorithm is an effective approach for solving robot path planning problems in nuclear environments, thereby enhancing the safety and reliability of robots in such environments.

Keywords:

deep reinforcement learning; Deep Q Network (DQN); Improved Dueling Deep Double Q Network (ID3QN); nuclear environment; mobile robot

1. Introduction

Nuclear energy, regarded as a significant clean energy source, is extensively utilized worldwide. However, its development and utilization have also introduced severe environmental challenges [1]. The nuclear environment is characterized by unique conditions, including high temperature, high pressure, and intense radiation emitted by nuclear reactors, which pose substantial safety risks to workers. To safeguard personnel safety, enhance work efficiency, and ensure environmental stability, nuclear environment robots have been introduced into the nuclear energy environment. As a robot specifically designed for the nuclear energy environment, nuclear environment robots have the ability to adapt to extreme conditions and replace humans in performing dangerous tasks. In the process of realizing the autonomous navigation of nuclear environment robots, path planning technology is particularly important. The path planning problem of nuclear environment robots refers to finding a collision-free path under the constraints of path length and an accumulated radiation dose [2].

In recent years, path planning in nuclear environments has been a research hotspot of great interest. Numerous scholars have provided innovative ideas for the path planning problem in nuclear environments. Pei et al. [3] proposed a Dijkstra algorithm based on the cumulative dose to find the optimal evacuation path, and they used the equilibrium model considering the road resistance to formulate the optimal evacuation path planning under a nuclear accident; the obtained paths were rationally arranged to minimize the collective dose to all the personnel. Chen et al. [4] proposed an improved version of the A* algorithm, which provides a path with good results for the nuclear power plant environment that not only has a lower cumulative dose, but also has the advantage of suitable length. According to the experimental results, this improved version of the A* algorithm has a strong antiinterference ability and can effectively cope with environmental disturbances. Chao et al. [5] proposed a grid-based rapidly exploring random trees star (GB-RRT*) algorithm, which is more efficient than the traditional rapidly exploring random trees star (rapidly exploring random trees star) algorithm. The rapidly exploring random trees star (RRT*) algorithm performs better in terms of convergence and the reliability of the minimum dose path. Therefore, the GB-RRT* algorithm provides an efficient realization of the minimum dose path for occupational workers. Zhang et al. [6] proposed a path planning scheme with double constraints in which both the time and dose values are taken into account. In order to better minimize the effect of the radiation dose on path planning, they introduced a radiation dose weighting factor and incorporated it into an ant colony algorithm for correction. According to the experimental results, the algorithm provides new ideas and effective solutions for path planning with respect to nuclear accident emergency rescue. Ya et al. [7] proposed two bio-inspired minimum dose path planning methods using the particle swarm optimization (PSO) algorithm and genetic algorithm (GA), respectively, which resulted in optimal paths for robots in radioactive environments with low radiation exposure. Zhou et al. [8] proposed an improved hierarchical swarm ant colony algorithm known as the improved hybrid elephant swarm ant colony algorithm. With the improved algorithm, the paths obtained in radioactive environments are more reliable and converge faster, especially in environments containing concave obstacles, which yielded good search results. The results showed that the application of this algorithm in nuclear accident emergency rescue has a broad prospect. Wang et al. [9] improved the particle swarm algorithm and established a mathematical dosage computational model for the path planning problem. The experimental results showed that the algorithm is able to reduce the effective dose value while ensuring that the overall convergence speed is not affected.

However, although these algorithms have achieved some results in path planning in kernel environments, they still have some limitations and disadvantages. For example, they often can only find the local optimal solution, and it is difficult to find the global optimal solution for complex nonlinear problems; at the same time, it is more difficult to handle multiple constraints, and it is difficult to take into account all of the constraints. In contrast, path planning algorithms based on deep reinforcement learning (DRL) [10,11,12] have significant advantages. First of all, deep reinforcement learning can model and solve complex nonlinear problems and can find better path planning schemes. Second, through machine learning and autonomous decision making, deep reinforcement learning is able to perform path planning with multiple constraints considered, thereby leading to better trade-offs and optimization. To the best of our knowledge, there is currently no research on solving multiobjective path planning problems in radioactive environments using deep learning. This article proposes improvements upon traditional deep reinforcement learning algorithms and applies them to path planning in nuclear environments.

At present, the most widely used deep reinforcement learning path planning is the Deep Q Network (DQN) proposed by the DeepMind team [13]. This algorithm not only solves the dimensionality catastrophe in the traditional Q-learning algorithm, but also can use the environmental information acquired by the robot as the training data to train the network so as to achieve the purpose of network convergence. However, the traditional DQN algorithm has the following drawbacks: (1) over-estimation [14], i.e., the estimated value of the network model output is larger than the true value, and the degree of being overestimated varies for each point of the value function, which can cause the robot to choose the suboptimal strategy and ignore the optimal strategy in many cases, thus affecting the effectiveness of path planning; and (2) low sample utilization [15], which occurs since the DQN algorithm adopts a random and uniform way to draw samples from the experience base for training, thus causing it to repeatedly select one part of the samples for training while ignoring the other part, which reduces the convergence speed of the network and affects the training effect of the network. In order to solve the above problems, many scholars have proposed various improvement algorithms.

To address the overestimation problem, Hado et al. [16] proposed the Double Deep Q Network (DDQN) algorithm, which solves the overestimation problem caused by the greedy strategy by utilizing the dual network to estimate the value of the action twice. Liu et al. [17] proposed the EN-DQN algorithm with exploratory noise, which selects the action based on the change of the network output value caused by the added noise and optimizes the network training effect by memorizing the multistep changes affecting the decision making of the intelligences through recurrent neural networks. Wang et al. [18] introduced a competitive network to balance the effects of actions on Q values in the current and target networks, respectively, as well as remove reward bias. Kim et al. [19] proposed a dynamic real-time fusion DDQN method, which trains the network parameters through a priori knowledge, and adjusting the weights to the overestimation is reduced to ensure that the selected policy is optimal. Dong et al. [20] proposed an algorithm for a dynamic target DDQN (DTDDQN), which combines the DDQN with the average DQN algorithm to improve the network parameters and solve the overestimation problem.

To address the problem of low sample utilization, Schaul et al. [21] proposed a preferred experience playback mechanism to improve the utilization of high-quality samples. Zhang et al. [22] proposed a DQN method for sample storage, which utilizes a playback pool to store network data to solve the problem of low sampling efficiency, and they verified the effectiveness of the algorithm through an analysis of the simulation environment and the real environment; Zhao et al. [23] proposed a secondary active sampling method, which selects samples from both sequence cumulative returns and the TD error, as well as uses the samples obtained from the two samplings to train to obtain the optimal policy. The work of [24] proposed the RSV-Dueling DQN method for state value reuse, which standardizes the reward values and then combines them with the Q values obtained by a dueling DQN to improve the sample utilization rate. Li et al. [25] proposed the IDQNPER algorithm. The stored samples are given weights and sent to the network for training in a certain prioritized order, while the sequences with high similarity are removed to improve the network training effect.The aforementioned improved DQN algorithms have to some extent improved the performance of the DQN algorithm, but they still suffer from slow network convergence speed and poor training effectiveness.

This paper studies the global path planning problem in a static radioactive environment with known information of the radioactive sources and obstacles. That is, this information is detected by the sensors (e.g., radiation detectors and LiDARs) mounted on the mobile robot before path planning. In this paper, we proposed the Improved Dueling Deep Double Q-Network (ID3QN) algorithm based on an asymmetric neural network structure to solve the global path planning problem of the mobile robot in a radioactive environment by considering the dual constraints of the path length and cumulative radiation dose rate. The algorithm we proposed not only accelerates convergence speed, but also improves the quality of the path planning solution. In general, the main contributions of this paper are as follows:

(1) Some improvements of traditional DQN algorithm have been developed for the path planning of a mobile robot in a radioactive environment. In the proposed ID3QN algorithm, the neural network has been optimized to a decomposed network structure consisting of a ‘value network’ for estimating the state value and a ‘dominant network’ for estimating the advantage of each action relative to the average action. The decomposed network structure can estimate the value of each action more accurately. Furthermore, we proposed an improved action selection mechanism to achieve a balance between exploration and exploitation for enhancing the convergence speed of the algorithm; In addition, a prioritized experience replay mechanism has been introduced to maintain sample diversity and improve the utilization of important samples; Finally, the reward function has been modified by considering the path length and cumulative radiation dose to provide effective feedback of every action taken by the mobile robot. All these enhancement mechanisms can improve the efficiency and flexibility of the proposed ID3QN algorithm and thus improve the solution quality of the dual-objective path planning problem in a complex radiation environment.

(2) The deterministic algorithm A*, intelligent algorithms ACO and GA, traditional DQN algorithms, D3QN algorithm and our proposed ID3QN algorithm were used for comparison experiments in two radioactive environments of different complexities, and the effectiveness and advantages of the proposed algorithm were proven.

The remaining sections of this paper are structured as follows: Section 2 presents the ID3QN algorithm. Section 3 provides an analysis and comparison of the experimental results. In Section 4, we discuss the proposed algorithm and some advanced evolutionary algorithms. Finally, Section 5 presents a summary of our findings and suggestions for future research.

2. Materials and Methods

2.1. ID3QN Algorithm

2.1.1. Target Network Optimization

The traditional DQN algorithm uses a single Q network to estimate the value function of actions. This approach may result in overestimation of the values for certain actions, thereby leading to the selection of inappropriate actions and reducing the convergence speed of the algorithm. The calculation formula of the target network in the traditional DQN is shown in Equation (1):

y_{i}^{D Q N} = r_{t + 1} + γ m a x Q^{^{'}} (s_{t + 1}, a, θ_{t}^{-})

(1)

In Equation (1),

y_{i}^{D Q N}

is the Q value of the target network in the DQN algorithm. It represents the expected reward obtained after performing an action.

γ

is the discount factor coefficient,

s_{t + 1}

represents the next state, a is the action performed,

θ_{t}^{-}

refers to the parameters of the target network at time t s, and refers to the parameters of the estimation network.

m a x Q^{^{'}} (s_{t + 1}, a, θ_{t}^{-})

represents the maximum Q value corresponding to the action in state

s_{t + 1}

as estimated by the estimation network at time t s.

The ID3QN algorithm incorporates the advantages of the double DQN improvement algorithm and introduces a target network to avoid the overestimation issue. Specifically, ID3QN uses the target network to calculate the maximum Q value for the next state, and it uses the current main network to select the action with the maximum Q value for the current state, thereby reducing the overestimation of values. The calculation formula of the target values in the ID3QN algorithm is shown in Equation (2):

y_{i}^{I D 3 Q N} = r_{t + 1} + γ Q (s_{t + 1}, m a x Q (s_{t}, a_{t}, θ_{t}), θ_{t}^{-})

(2)

In Equation (2),

y_{i}^{I D 3 Q N}

is the Q value of the target network in the ID3QN algorithm. It represents the expected reward obtained after performing an action.

γ

is the discount factor coefficient,

s_{t + 1}

represents the next state, and

θ_{t}^{-}

represents the parameters of the target network and estimation network at time t s, respectively.

m a x Q (s_{t}, a_{t}, θ_{t})

represents the action corresponding to the maximum Q value in state

s_{t}

as estimated by the target network at time t s.

2.1.2. Optimize Network Structure

While the symmetric network structure of traditional DQN provides model stability, it struggles to handle the understanding of model decisions and exhibits lower efficiency in capturing differences between different actions. Therefore, building upon this foundation, this paper adopts an asymmetric neural network structure to accurately model different actions in order to cater to the requirements of path planning in radioactive environments, thereby enhancing the model’s performance and efficiency. This network decomposes the Q value function into two components: `value’ and `advantage’. It separately calculates the value of states and the advantage of actions, and it then combines them to obtain the Q value for each action in the current state. Specifically, the network structure of ID3QN includes two branches: a “value” network and an “advantage” network. The “value” network estimates the value of states, while the “advantage” network estimates the advantage of each action relative to the average action. The calculation of the Q value in the competing network of ID3QN is shown in Equation (3):

Q (s, a; θ, α, β) = V (s; θ, β) + A (s, a; θ, α) - \frac{1}{| A |} \times \sum A (s, a^{^{'}}; θ, α)

(3)

Therein, s represents the state, a represents the action,

θ

represents the network parameters,

α

and

β

respectively represent the parameters for the advantage network and value network,

V (s; θ, β)

represents the value for state s,

A (s, a; θ, α)

represents the advantage for the current action a, |A| represents the size of the action space, and

\sum A (s, a^{^{'}}; θ, α)

represents the average advantage of all actions. The final Q value denotes the estimated reward for selecting action a in state s.

The method of decomposing the Q value function can more accurately estimate the value of each action while reducing variance, improving learning efficiency, and training stability. In addition, ID3QN also adopts a learnable parameter sharing mechanism, which reduces the number of parameters in the network, improves the network’s training speed, and improves computational efficiency. The symmetric network structure of traditional DQN and the network structure of the proposed algorithm in this paper are illustrated in Figure 1.

2.1.3. Improved Action Selection Mechanism

When using Equation (1) to approximate the value function as the optimization target in DQN, the action corresponding to the maximum Q value in the next state is selected every time. Both action selection and evaluation are based on the parameters of the target value network, which can lead to overestimation of Q values, thus resulting in poor exploration ability, low training efficiency, and negative impact on the convergence speed of the network, as well as the effectiveness of path planning [14].

To address this issue, the ID3QN algorithm adopts an action selection strategy based on obstacle memory units and probabilities. Specifically, by modifying the calculation method of Q values, the number of times the robot enters obstacle states and repeated states during training is reduced, thereby allowing it to explore new states with more capacity. At the same time, the convergence speed of the network is improved, and the effectiveness of path planning is optimized. The obstacle memory unit is represented as Obstacle(

s_{t}

, a); after each action selection, it is determined whether the state is an obstacle, and Obstacle(

s_{t}

, a) is updated accordingly. If it is an obstacle, Obstacle(

s_{t}

, a) is updated to 1; otherwise, it is updated to 0. Finally, the calculation method of Q(

s_{t}

, a) is modified as follows:

Q^{^{'}} (s_{t}, a) = \{\begin{matrix} | Q (s_{t}, a) | \times i n f & s_{t + 1} = G o a l \\ - | Q (s_{t}, a) | \times i n f & O b s t a c l e (s_{t}, a) = 1 \\ Q (s_{t}, a) \times μ + (1 - μ) \times Q (s_{t}, a) \times P (s_{t}, a) & O b s t a c l e (s_{t}, a) = 0 \end{matrix}

(4)

where ‘inf’ represents infinity,

μ

is the parameter for probability adjustment, and the calculation formula for the probability of selecting action a in state s is defined as follows:

P (s_{t}, a) = 1 / (T + 1)

(5)

where T is the number of times action a has been selected in state

s_{t}

. As the number of times action a is selected increases (represented by T becoming larger), the value of

P (s_{t}, a)

becomes smaller. Since the robot selects the action corresponding to the maximum Q value, when

P (s_{t}, a)

is decreased, it tends to explore other actions and new states, thereby avoiding getting stuck in local optima.

In order to balance the exploration and exploitation trade-off in decision making, this method still adopts an

ε

-greedy strategy for action selection. Under this strategy, the algorithm randomly selects actions with a certain probability to explore unknown states while also retaining the currently known optimal action for exploitation. The new action selection strategy is defined as follows:

π (s_{t}) = \{\begin{matrix} a r g m a x_{a \in A} Q^{^{'}} (s_{t}, a) & μ < ε \\ r a n d o m (A) & o t h e r w i s e \end{matrix}

(6)

Among them,

a r g m a x_{a \in A} Q^{^{'}} (s_{t}, a)

means that if the random value

μ

is less than

ε

, then the robot keeps the current known optimal action; otherwise, it randomly selects an action.

2.1.4. Prioritized Experience Replay Mechanism

In traditional DQN experience replay, previous experiences (including state, action, reward, and next state) are continuously stored in an experience pool during training. When the experience pool is full, new data will overwrite old data. During learning, a batch of experiences is randomly sampled from the experience pool for training the neural network. The traditional sampling method is uniform sampling from the experience pool, which is simple to implement but cannot differentiate the importance of different samples. Additionally, the limited memory of the experience replay pool leads to low sampling efficiency. Therefore, it is necessary to improve the experience replay mechanism of DQN to increase the frequency of sampling helpful samples and reduce the frequency of sampling unhelpful samples to improve learning efficiency. Prioritized experience replay mechanism is an effective approach for this purpose. Specifically, the traditional experience replay mechanism can be modified to prioritize experience replay as follows:

Step 1: Add a prioritized sample memory pool, which is specifically used for storing samples of path planning.

Step 2: During the learning process, whenever a better path is learned, update the prioritized sample memory pool; add the new sample to the prioritized sample memory pool and remove old samples from it.

Step 3: During training of the neural network, randomly sample a batch of data from the normal sample memory pool and combine it with all the data from the prioritized sample memory pool for training the neural network.

Step 4: Adopt prioritized sampling method by assigning higher priority to samples in the prioritized sample memory pool to increase the likelihood of being selected for training the neural network.

This method can improve learning efficiency, better differentiate sample importance, and increase sampling efficiency.

2.1.5. Improvement of the Reward Function

The agent in reinforcement learning must interact with the environment in order to obtain reward values and update network parameters to optimize its behavior policy. Therefore, the reward function plays a crucial role in reinforcement learning path planning, and its quality directly affects the effectiveness of the algorithm and the convergence of the network. Based on the traditional DQN algorithm, the reward function is set as follows:

R_{t} = \{\begin{matrix} C_{1} & G o a l \\ C_{2} & C o l l i s i o n o r T i m e o u t \\ 0 & O t h e r s \end{matrix}

(7)

In Equation (7), when the robot selects an action and reaches the target state, we provide a positive reward by setting

C_{1}

as a positive integer. However, if the robot reaches a state occupied by an obstacle after selecting an action, we consider it to be a collision, and we impose a negative penalty by setting

C_{2}

as a negative number. The traditional DQN algorithm’s reward function indicates that the robot can only receive a reward value when it reaches the goal point or fails to plan, and there is no timely and effective feedback in other states. Even with training, the robot still needs to go through many rounds to plan the desired path, and the planned path is often not optimal. In order to plan a comprehensive optimal path with a short length and a small cumulative radiation dose, this paper redesigns the reward function, as shown in Equation (8).

R_{t} = \{\begin{matrix} C_{1} & G o a l \\ C_{2} & C o l l i s i o n / R e s t r i c t e d A r e a \\ K_{1} \times (((D_{i g} - D_{j g}) / | D_{i g} - D_{j g} |) \times J^{D_{i g} - D_{j g}}) + K_{2} \times R_{i j} & O t h e r s \end{matrix}

(8)

In Equation (8), the selection of values for

C_{1}

and

C_{2}

is typically such that

C_{1}

is chosen as a positive value while

C_{2}

is assigned a negative value.

K_{1}

represents the parameter used to adjust distance, while

K_{2}

corresponds to the parameter utilized for irradiation adjustment. The restricted area refers to the region where the radiation dose exceeds the specified radiation dose. Additionally, J represents a constant in the equation. Furthermore,

D_{i g}

signifies the distance between the current state and the endpoint,

D_{j g}

represents the distance between the subsequent step state and the endpoint, and

R_{i j}

denotes the cumulative irradiation dosage between the current position and the subsequent position.

We use Euclidean distance to measure the cost of movement between two points, and the calculation formula is as follows:

D_{i j} = \sqrt{{(x^{j} - x^{i})}^{^{2}} - {(y^{j} - y^{i})}^{^{2}}}

(9)

where

(x_{i}, y_{i})

and

(x_{j}, y_{j})

respectively represent the coordinates of the current state and the next state, and

D_{i j}

represents the distance between the two points.

The calculation formula for the radiation dose at the current position and the next position is defined as follows:

R_{i j} = D_{i j} \times (R_{i} (x_{i}, y_{i}) + (R_{j} (x_{j}, y_{j}))) / 2

(10)

where

R_{i} (x_{i}, y_{i})

represents the average radiation dose rate of grid i,

R_{j} (x_{j}, y_{j})

represents the average radiation dose rate of grid j, and

D_{i j}

represents the Euclidean geometric distance between the two positions.

In order to better evaluate the goodness of a path, we have defined the following evaluation function:

F (n) = D (n) + R (n)

(11)

D (n) = \sum_{1}^{n - 1} D_{i, i + 1}

(12)

R (n) = \sum_{1}^{n - 1} R_{i, i + 1}

(13)

In Equation (11), F(n) represents the heuristic function of reaching the target point via any node n from the starting point, D(n) represents the cumulative distance of reaching the target point via any node n from the starting point, and R(n) represents the cumulative radiation dose of reaching the target point via any node n from the starting point. When the robot plans a collision-free path, we calculate F(n). If F(n) is smaller than the cost of the previously planned path, then we consider it to be a better path.

2.1.6. The Proposed ID3QN Algorithm for Path Planning of Mobile Robot in Nuclear Environment Flowchart

The flowchart of the ID3QN algorithm proposed in this paper for path planning of mobile robots in nuclear environments is shown in Algorithm 1.

Algorithm 1 The proposed ID3QN algorithm for path planning of mobile robot in a nuclear environment

1:: Input: Radiative environment information, obstacle information, starting point, endpoint
2:: Output: Optimal path
3:: Initialize replay memory D and prioritize replay memory $D^{^{'}}$ to capacity N
4:: Initialize action-value function Q with random weights $θ$
5:: Initialize target action-value function $Q^{^{'}}$ with weights $θ^{-} = θ$
6:: Initialize memory units for obstacles
7:: for $e p i s o d e = 1$ to K do
8:: Initialize state s = $s_{1}$
9:: for $t = 1$ to T do
10:: if $R a n d o m V a l u e > ε$ then
11:: Select action $a_{t}$ = $a r g m a x Q (s_{t}, a_{t}, θ)$
12:: else
13:: Randomly select action $a_{t}$
14:: end if
15:: Excecute action $a_{t}$ in emulator and observe reward $r_{t}$ calculated by Equation (15)
16:: Set $s_{t + 1}$ = $s_{t}$ and update obstacle memory units
17:: Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in D
18:: Sample random minibatch of transitions $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ from D and sample all transtions $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ from $D^{^{'}}$
19:: Update the Q values for all the samples using Equation (2)
20:: Use the MSELoss loss function and update all the parameters of the Q network through gradient backpropagation in the neural network
21:: Every C steps reset $Q^{^{'}}$ = Q
22:: if $s_{t + 1} = = s_{e n d}$ then
23:: Calculate the cost value F(n) by Equation (11)
24:: if F(n) < $F {(n)}_{m i n}$ then
25:: Update prioritize replay memory $D^{^{'}}$
26:: Set $F {(n)}_{m i n}$ = F(n)
27:: else
28:: Turn to the eighth line.
29:: end if
30:: else
31:: Turn to the tenth line.
32:: end if
33:: end for
34:: end for

2.2. Robot Path Planning Based on ID3QN

This paper proposes a path planning algorithm based on ID3QN, the overall process of which is shown in Figure 2 and mainly involves the following steps:

Step 1: Sensor collects state information. The robot perceives the status information of the surrounding environment, such as the position of obstacles and the current position of the robot, through the laser radar and sensors and inputs it into the ID3QN network.

Step 2: Calculate Q values and select actions. This paper proposes a state exploration mechanism based on obstacle memory units and probabilities. By improving the Q value calculation method and the method of selecting actions, the final Q value and the current action a are obtained.

Step 3: Reward evaluation. After the robot executes an action, it obtains a reward R based on the new state of the robot and the changes in the environment. The reward R refers to the contribution of the robot to the path planning goal, such as avoiding obstacles, reducing cumulative radiation dose, and reaching the target position as soon as possible. Evaluate the quality of the selected actions based on the reward R to provide guidance for the next step of path planning.

Step 4: Update experience pool. Store the current state information, selected actions, rewards, and other data in the experience pool. At the same time, use the improved prioritized experience replay mechanism to randomly select samples from the experience pool for training, and iteratively update the network parameters.

Step 5: Generate iterative loop. By constantly looping the above process with the goal of maximizing the cumulative reward value, the optimal action corresponding to the optimal action value function can be obtained.

In grid map path planning tasks, each state of the robot is unique and can be represented by the following formula:

s_{i} = (x_{i}, y_{i})

(14)

In the experiment, the robot adopts eight-directional movement methods [18], which are left-up, up, right-up, left, right, left-down, down, and right-down—as shown in Figure 3—and its corresponding movement methods are marked as actions 0, 1, 2, 3, 4, 5, 6, 7, respectively.

The experiment utilizes the improved reward function proposed in this paper. The parameters in Equation (8) are assigned specific values for the purpose of maximizing the expected cumulative reward from the initial state to the target state, thereby directly influencing the outcome of path planning. In this study, when the robot reaches the endpoint, a reward value is assigned with

C_{1}

set to 5 [20]. Conversely, encountering obstacles incurs a penalty, with

C_{2}

set to −1 [20]. To strike a balance between path length and cumulative dose impact on path selection,

K_{1}

and

K_{2}

are both set to 1 when the robot is in free grid areas. Additionally, the constant J is set as e. The resulting reward function is expressed in Equation (15).

R_{t} = \{\begin{matrix} 5 & G o a l \\ - 1 & C o l l i s i o n / H i g h R a d i a t i o n A r e a \\ ((D_{i g} - D_{j g}) / | D_{i g} - D_{j g} |) \times e^{D_{i g} - D_{j g}}) + R_{i j} & O t h e r s \end{matrix}

(15)

3. Results

3.1. Experimental Environment and Parameter Configuration

Based on a decommissioned plant of a nuclear facility, an experimental environment was established, which is shown in Figure 4. In order to establish the navigation map of the robot, reference [2], the grid method was used to model the environment. The 3D space was first divided into an obstacle grid and a free grid, where the obstacle grid included nuclear facilities and obstacles (non-nuclear facilities). In the process of mapping the 3D scene to a 2D map, different colors were used to indicate the irradiation intensity, including red, yellow, and green. In this case, the black grid represents obstacles, the white grid represents the free grid, the red color indicates the strongest irradiation, the yellow color is the next strongest, and the green color represents the lower irradiation. With such a modeling approach, the robot can be provided with critical navigation information and assisted decision making to ensure the safe operation of the robot in the nuclear energy experimental environment. In order to demonstrate the adaptability of this algorithm to environments with different levels of complexity, simple radioactive raster maps and complex radioactive raster maps were set up. The environmental setups are illustrated in Figure 4a,b respectively. In the simple radioactive raster map, four radioactive sources were placed with sparse obstacle setups. In the complex radioactive raster map, seven radioactive sources were set up with varying intensities, and the obstacle setups were more densely configured.

In this paper, we conducted two sets of path planning experiments in a raster map environment. We compared the ID3QN algorithm with the traditional A* [4], GA [20], improved ACO [2], DQN [13], and D3QN [26] in both the simple and complex radioactive raster environments. The goal was to validate the superiority, effectiveness, and robustness of the ID3QN algorithm over the other algorithms. We used algorithmic scores [27], path length, the number of corners, the number of collisions [20], and the accumulated radiation dose [2] as comparison indexes to verify the ID3QN algorithm’s superiority, effectiveness, and robustness compared to the other algorithms.

Among them, the number of iterations of the DQN, D3QN, and ID3QN algorithms was set to 6000 times. The robot started from the starting point, and when it reached the end point or the movement step exceeded the maximum step, this marked the end mark of an iteration. In the path planning experiment of this paper, the robot that successfully reaches the end point without collision will be regarded as a successful path planning and receive one point, while any collision or timeout will be regarded as a failed path planning and receive zero points [27].

The experimental configuration of this paper is shown in Table 1:

The experimentally relevant parameters for all the algorithms in this paper are shown in Table 2:

In the experiment, several key parameters influence the training effectiveness of algorithms such as the ID3QN, D3QN, and DQN. These parameters include the neural network architecture, learning rate, sample size for random replay, experience pool capacity, exploration factor, discount factor, iteration count for updating target network parameters, and probability exploration parameter. We carefully set these parameters based on the characteristics of the grid map environment and the insights gained from previous studies [20,21,22,23,24,25,26,27].

To begin with, our neural network adopts a three-layer fully connected structure comprising an input layer, a hidden layer with 256 neurons, and an output layer. The input layer consists of two neurons, which receive input information in the form of two-dimensional coordinates. The output layer comprises eight neurons, which represent the Q values of the eight actions within the action space. This output layer generates an eight-dimensional tensor. We utilized the rectified linear unit (ReLU) activation function [20], Adam optimizer [27], and mean squared error (MSE) loss function [27].

The DQN uses a single neural network to estimate the value function for each action. The ID3QN and D3QN introduce dual estimation (Q value and target Q value) to reduce estimation bias, and they also include “Advantage” and “Value” branches to separately estimate the advantage of each of the actions and the value functions of the states. The neural network structures of the two models are different and have inconsistent complexities. The time complexity of the neural network model depends on the number of layers, the number of neurons per layer, and the dimensionality of the input states. [28] The time complexity can be estimated using the following method:

O (\sum_{D}^{l = 1} (n_{l}^{2} + C_{l - 1} \times n_{l}))

(16)

Among them, D represents the number of convolutional layers, which is the depth of the network. l represents the lth layer of the network.

n_{l}

represents the number of neurons in the lth layer.

C_{l - 1}

represents that the input state of the previous layer has C features.

n_{l}^{2}

represents the forward propagation complexity of the neural network, and

C_{l - 1} * n_{l}

represents the backward propagation complexity of the gradient descent algorithm.

Since the DQN uses a single neural network, the overall time complexity is defined in Equation (16). In the network structures of the D3QN and ID3QN, there are two branches, namely, the advantage branch and the value branch. The forward propagation time complexity of these branches is twice that of the DQN. Additionally, since the D3QN requires two neural networks (the main network and target network) for double estimation, the overall time complexity becomes four times that of the DQN, which is defined in Equation (17).

O (\sum_{D}^{l = 1} (4 n_{l}^{2} + C_{l - 1} \times n_{l}))

(17)

Taking into account the unique properties of the 20 × 20 grid map, we have set specific values for the following parameters: the learning rate (0.01), sample size for random replay (64), experience pool capacity (3200), discount factor (0.9), and iteration count for updating the target network parameters (10). The exploration factor is crucial for striking a balance between exploration and exploitation during the early stages of the algorithm. Initially, we set the exploration factor to 0.99 and gradually decreased it using a simulated annealing strategy until it reached 0.01. By applying this annealing process, the robot continually explored unfamiliar environments, gathered more diverse samples for network training, and ultimately achieved superior path planning results. Additionally, the probability exploration parameter was set to 0.6 to leverage the benefits of the Q network training results for action selection while avoiding the pitfall of getting trapped in local optima.

In the path planning experiment using the GA, there are several key parameters that affect the experimental results, including the population size, maximum number of evolution generations, crossover factor, and mutation factor. We set these main parameters that impact the GA based on the characteristics of the grid map path planning and the experiences of some scholars in parameter settings [7,29,30]. After continuously adjusting through experiments, we determined the values for these parameters to be 100 for the population size, 500 for the maximum number of evolution generations, 0.8 for the crossover factor, and two for the mutation factor.

In the path planning experiment using the ACO algorithm, there are several key parameters that affect the experimental results, including the number of iterations, number of ants, pheromone influence factor, heuristic factor, pheromone evaporation coefficient, and pheromone reinforcement coefficient. We set these main parameters that impact the ACO algorithm based on the characteristics of the grid map path planning and the experiences of some scholars in parameter settings [2,31,32]. After continuously adjusting through experiments, we determined the values for these parameters to be 500 for the number of iterations, 100 for the number of ants, three for the pheromone influence factor, seven for the heuristic factor, 0.3 for the pheromone evaporation coefficient, and one for the pheromone reinforcement coefficient.

3.2. Experimental Results Comparison and Analysis

3.2.1. Comparison of Scenario 1

The ID3QN algorithm proposed in this paper was simulated and tested in the Graph 4 environment. The comparison was made with the A*, GA, and ACO, with the comparison indicators being the path length, the number of corners [20], accumulated radiation dose [2], and the training duration. Based on the comparison data, the path planning route map was obtained and is presented in Figure 5, and the comparison indicator data was summarized and is presented in Table 3.

The ID3QN algorithm proposed in this paper was simulated and tested in the environment shown in Figure 4a, and it was compared with the A* algorithm, GA algorithm, ACO algorithm, DQN algorithm, and D3QN algorithm; the metrics of comparison are the algorithmic scores [27], path length, the number of corners, number of collisions [20], and accumulated radiation dose [2]. By comparing the data, this paper derives the path planning line diagram, which is shown in Figure 5, and summarizes the metrics comparison data in Table 3.

According to the data in Table 3 and Figure 5, it can be concluded that compared with the traditional A*, GA, and ACO, the ID3QN algorithm has the shortest path length of 29.7990 m; the lowest cumulative irradiation dose of 15.5711 mSv; and the number of inflection points of 10, which is optimal for integrated path planning. It can be seen that the ID3QN algorithm proposed in this paper outperformed the traditional algorithms such as the A*, GA, and ACO in terms of the length of path planning as well as the cumulative irradiation dose. In real environments, the path planning algorithm proposed in this paper is conducive to improving the safety and reliability of robots in nuclear environments by reducing the magnitude of the cumulative irradiation dose.

Compared to the DQN, D3QN, and other algorithms, the ID3QN algorithm proposed in this paper has the shortest path, which is only 29.799 m, and the DQN algorithm has the longest path, which is 35.3137 m; in terms of the cumulative irradiation dose, the ID3QN algorithm has the smallest cumulative irradiation dose, which is 15.5717 mSv, and the DQN algorithm has the largest cumulative dose dose, which is 20.3404 mSv; in terms of the number of inflection points, the ID3QN has the highest number of inflection points, and the DQN has the lowest number of inflection points, because the main objectives of ID3QN’s planning is the shortest path with the lowest cumulative irradiation dose, which makes the number of inflection points increase; in terms of the number of collisions, the ID3QN algorithm has the fewest number of collisions at 25,908, and the DQN algorithm has the highest number of collisions at 79,789; and lastly, in terms of the algorithmic scores, the ID3QN algorithm scores the is the highest at 5025 and the DQN algorithm has the lowest score at 7; in terms of the algorithm training time, the ID3QN algorithm took the longest time to train, followed by the D3QN algorithm, and the DQN algorithm took the shortest time to train.

Based on the results, it is evident that the ID3QN algorithm surpasses the original DQN algorithm in path planning. This improvement is achieved through the optimization of the neural network structure, the enhancement of action exploration, and the refinement of experience replay mechanisms. Compared to the DQN algorithm, the ID3QN algorithm reduced the path length by 15.6%, the cumulative radiation dose by 23.5%, and the collision frequency by 67.5%, thus resulting in a 717-fold increase in the final score.The objective of path planning in this paper is to provide a comprehensive optimal path with the minimum path length and cumulative radiation dose for a robot operating in a static radiative environment. The proposed ID3QN algorithm significantly improves the convergence speed of the algorithm and the quality of the path planning solution through the optimization of the neural network structure, improved action exploration strategies, and enhanced experience replay, as shown in Table 3. However, at the same time, it incurs additional time cost compared to the traditional DQN algorithm. Nevertheless, this paper aims to provide a multiobjective offline path planning method for robots entering dangerous radiative environments. Therefore, obtaining shorter and safer paths at the expense of certain time costs is acceptable.

As shown in Figure 6a, the reward values of the DQN algorithm, D3QN algorithm, and ID3QN algorithm for a single round of path planning are plotted. It can be seen that the reward values of the DQN algorithm, D3QN algorithm, and ID3QN algorithm were stabilized within 6000 training rounds. However, the ID3QN algorithm converged around 200 rounds, while both the DQN algorithm and D3QN algorithm converged around 2500 rounds, thus indicating that the ID3QN algorithm converged faster than DQN algorithm and D3QN algorithm. And the reward value of the ID3QN algorithm after convergence is higher than the DQN algorithm and D3QN algorithm, which proves that ID3QN algorithm performs better than the DQN algorithm.

The collision count variation during each iteration of training for the DQN algorithm, D3QN algorithm, and ID3QN algorithm is shown in Figure 6b. It can be observed that within 6000 training episodes, the collision counts for all three algorithms remained stable within a certain range. However, the ID3QN algorithm converged around 400 episodes with a much lower collision count compared to both the DQN and D3QN algorithms. This indicates that the obstacle avoidance performance of the ID3QN algorithm is superior to that of the DQN and D3QN algorithms.

3.2.2. Comparison of Scenario 2

The ID3QN algorithm proposed in this paper was tested in simulation in the complex raster map environment shown in Figure 4b, which has added several new radioactive sources and a large number of obstacles to further validate the effectiveness, as well as the robustness, of the ID3QN algorithm. The ID3QN algorithm was also compared with the A*, GA, ACO algorithm, DQN algorithm, and D3QN algorithm, and the metrics for comparison are the algorithm score [27], the path length, the number of corners, the number of collisions [20], the accumulated radiation dose [1], and the training duration. By comparing the data, this paper derives the path planning line diagram, which is shown in Figure 7, and summarizes the metrics comparison data in Table 4.

According to the data in Table 4 and Figure 7, it can be seen that under the complex raster map environment and compared with the traditional conventional A*, GA, and ACO algorithms, the ID3QN algorithm has the shortest path length of 29.7990 m; the lowest cumulative irradiation dose of 18.3488 mSv; and the number of inflection points of 13; The overall training time was the longest, which is the optimal result of integrated path planning.

In the scenarios with increasing radiation sources and obstacles, the performance scores and training speeds of the ID3QN, D3QN, and DQN algorithms all decreased. However, the ID3QN still performed very well in comparison to the D3QN and DQN algorithms. With the highest score, the shortest path length, and the lowest accumulated radiation dose, the ID3QN demonstrated superior performance. This indicates that under complex map conditions, the ID3QN still has good adaptability and can plan a comprehensive optimal path.

As shown in Figure 8a, the reward value changes of the DQN algorithm, D3QN algorithm, and ID3QN algorithm in the complex radioactive grid environment are plotted. It can be seen that within 6000 training rounds, the ID3QN algorithm still converged around 800 rounds, and the reward value of the ID3QN algorithm after convergence is higher than that of DQN algorithm and D3QN algorithm, which proves that ID3QN algorithm performs better than the DQN and D3QN algorithm.

Figure 8b illustrates the variation in the collision count during each iteration of training for the DQN algorithm, D3QN algorithm, and ID3QN algorithm in a complex radioactive grid environment. It can be observed that both the ID3QN and D3QN algorithms outperformed the DQN algorithm in terms of obstacle avoidance in the complex environment. The ID3QN algorithm reached stability after 600 iterations, and the collision count remained lower than that of both the DQN and D3QN algorithms. This indicates that the ID3QN algorithm exhibits superior obstacle avoidance performance compared to the DQN and D3QN algorithms.

In summary, the ID3QN algorithm proposed in this paper has better performance in path planning, which enables the robot to obtain higher scores and plan better paths in fewer training rounds.

4. Discussion

The simulation results demonstrate that in environments of different complexities, the performance of the ID3QN algorithm is significantly superior to algorithms such as the A*, GA, ACO, DQN, and D3QN. It can obtain a comprehensive optimal path with a shorter path length and lower accumulated radiation dose. Firstly, the ID3QN algorithm optimizes the neural network structure and utilizes the dual-network double estimation of action values. This method decomposes the Q value function, thereby allowing for more accurate estimation of the value of each action while reducing variance. It improves learning efficiency and training stability. It effectively avoids overestimating the Q values, which may lead to a bias toward selecting overestimated actions and neglecting actions that actually have higher values. This enables better decision making in path planning, reduces the number of training iterations, and improves the training speed. Secondly, the algorithm adopts a prioritized experience replay mechanism by adding a priority replay sample pool. This effectively utilizes important sample data, thus improving learning efficiency and better distinguishing the importance of samples to enhance sampling efficiency. Lastly, considering the constraints of path length and accumulated radiation dose, the reward function was redesigned. This ensures that every decision made by the robot receives better feedback, thereby accelerating the convergence speed of the algorithm.

The ID3QN algorithm proposed in this article offers a high-quality solution for path planning problems in radioactive environments. It outperforms traditional algorithms like the A*, GA, and ACO in terms of path planning effectiveness, as well as exhibits faster convergence speed in training compared to the DQN, D3QN, and other algorithms. To enhance the quality of finding the optimal path, the algorithm incorporates strategies such as dual-network dual-estimation of the action value and prioritized sample replay. However, these strategies result in slightly longer training times compared to traditional algorithms like the DQN and D3QN, and there is also a challenge with respect to adjusting numerous parameters. In the future, the algorithm’s parameters can be optimized, and decision making can be assisted through excellent evolutionary and decision-making algorithms, thereby ultimately reducing the training time and speeding up the training process.

5. Conclusions

To address the path planning problem of mobile robots in radioactive environment, this paper proposes an improved approach based on the traditional DQN algorithm. The proposed method is the ID3QN algorithm, which utilizes an asymmetric neural network structure for path planning. By enhancing the action selection mechanism, introducing a prioritized experience replay mechanism, and improving the reward function, the performance of path planning is greatly improved. The ID3QN algorithm was compared with traditional algorithms like the A*, GA, and ACO, as well as deep reinforcement learning algorithms, to validate its effectiveness and robustness.

The experiment shows that in a simple radioactive grid map environment, the proposed ID3QN algorithm outperforms traditional algorithms such as the A*, GA, and ACO in terms of path length and cumulative radiation dose. The overall planned path is more optimal. Compared to the DQN and partially improved DQN algorithms, the path length was reduced by 15.6%, the cumulative radiation dose was reduced by 23.5%, the collision count was reduced by 67.5%, and the algorithm score was improved by 717 times, thereby enabling the planning of an collision-free optimal path in a shorter training session. In a complex radioactive grid map environment, the ID3QN algorithm also outperformed the A*, GA, ACO, and other algorithms in terms of the planned path, and it was more capable of stably planning a comprehensive optimal path with a shorter path length and lower radiation dose than the DQN and partially improved DQN algorithms.

In future work, we will consider how to improve the efficiency of the algorithm and reduce its training time. We will also aim to enhance the adaptability of the algorithm by optimizing it and extending its application to path planning problems involving dynamic obstacles.

Author Contributions

Conceptualization, resources and project administration Y.Y., J.L., W.J. and J.C.; methodol- ogy and software Z.W., Y.Y. and J.C.; data curation, Z.W. and Y.Y.; formal analysis, W.J. and J.C.; writing—original draft preparation, Z.W. and Y.Y.; writing—review and editing J.L. and D.Z.; supervision, J.L. and D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Foundation of the Hunan Provincial Education Department of China (grant No. 22C0223), the National Natural Science Foundation of China (grant No. 62003157), the State Key Laboratory of Nuclear Reactor Systems Design for Open Projects (grant No. SQ-KFKT-24-2021-006), and The Scientific Research Foundation of the Hunan Provincial Education Department of China [grant No. 19B502].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Acknowledgments

The authors acknowledge the support of the China Nuclear Power Design and Research Institute.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tsitsimpelis, I.; Taylor, C.J.; Lennox, B.; Joyce, M.J. A review of ground-based robotic systems for the characterization of nuclear environments. Prog. Nucl. Energy 2019, 111, 109–124. [Google Scholar] [CrossRef]
Zhang, D.; Luo, R.; Yin, Y.; Zou, S. Multi-objective path planning for mobile robot in nuclear accident environment based on improved ant colony optimization with modified A*. Nucl. Eng. Technol. 2023, 55, 1838–1854. [Google Scholar] [CrossRef]
Pei, Q.Y.; Hao, L.J.; Chen, C.H.; Zheng, X.L.; Tao, H. Minimum collective dose based optimal evacuation path-planning method under nuclear accidents. Ann. Nucl. Energy 2020, 147, 107644. [Google Scholar] [CrossRef]
Chen, C.; Cai, J.; Wang, Z.; Yi, W. An improved A* algorithm for searching the minimum dose path in nuclear facilities. Prog. Nucl. Energy 2020, 126, 103394. [Google Scholar] [CrossRef]
Chao, N.; Liu, Y.; Xia, H.; Ayodeji, A.; Bai, L. Grid-based RRT* for minimum dose walking path-planning in complex radioactive environments. Ann. Nucl. Energy 2018, 115, 73–82. [Google Scholar] [CrossRef]
Zhang, G.C.; He, T.; Zheng, X.L.; Chen, C.H.; Wang, J.; Tao, L.L. Simulation of nuclear emergency path planning based on dose weight factor and ant colony algorithm coupling. J. Radiat. Res. Radiat. Process. 2020, 38, 020601. [Google Scholar]
Lai, Y.C.; Smith, S. Metaheuristic minimum dose path planning for nuclear power plant decommissioning. Ann. Nucl. Energy 2022, 166, 108800. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, H.; Qiu, M. Radiation avoiding algorithm for nuclear robot path optimization. Ann. Nucl. Energy 2022, 169, 108948. [Google Scholar] [CrossRef]
Wang, Z.; Cai, J. The path-planning in radioactive environment of nuclear facilities using an improved particle swarm optimization algorithm. Nucl. Eng. Des. 2018, 326, 9–86. [Google Scholar] [CrossRef]
Hu, R.; Zhang, Y. Fast path planning for long-range planetary roving based on a hierarchical framework and deep reinforcement learning. Aerospace 2022, 9, 101. [Google Scholar] [CrossRef]
Liu, X.; Liu, S.; Zhuang, Y.K.; Gao, Y. Exploring foundational problems and methods for interpretability in reinforcement learning: A comprehensive review. J. Softw. 2023, 34, 2300–2316. [Google Scholar]
Ma, T.; Lyu, J.; Yang, J.; Xi, R.; Li, Y.; An, J.; Li, C. CLSQL: Improved Q-Learning Algorithm Based on Continuous Local Search Policy for Mobile Robot Path Planning. Sensors 2022, 22, 5910. [Google Scholar] [CrossRef] [PubMed]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Zheng, S.; Luo, F.; Gu, C.; Ding, W.; Lu, H. Improved Speedy Q-learning algorithm based on dual estimators. J. Comput. Sci. 2020, 47, 179–185. [Google Scholar]
Liu, Y.; Chen, Z.; Li, Y.; Lu, M.; Chen, C. Robot search path planning method based on prioritized deep reinforcement learning. Int. J. Control. Autom. Syst. 2022, 20, 2669–2680. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, Arizona, 12–17 February 2016; p. 30. [Google Scholar]
Quan, L.; Yan, Y.; Fei, Z. A Deep Recurrent Q Network with Exploratory Noise. Chin. J. Comput. 2019, 42, 1588–1604. [Google Scholar]
Xia, Z.; Qin, J. An improved algorithm for deep Q-network. J. Comput. Appl. Res. 2019, 36, 3661–3665. [Google Scholar]
Kim, K.S.; Kim, D.E.; Lee, J.M. Deep learning based on smooth driving for autonomous navigation. In Proceedings of the 2018 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Auckland, New Zealand, 9–12 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 616–621. [Google Scholar]
Dong, Y.F.; Yang, C.; Dong, Y.; Qu, X.Q.; Xiao, H.X.; Wang, Z.Q. Path Planning for Robot Based on Improved DQN. Comput. Eng. Des. 2021, 42, 552–558. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay for Deep Reinforcement Learning. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016; pp. 322–355. [Google Scholar]
Zhang, J.; Springenberg, J.T.; Boedecker, J.; Burgard, W. Deep reinforcement learning with successor features for navigation across similar environments. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2371–2378. [Google Scholar]
Xiliang, C.; Lei, C.; Chenxi, L. Deep reinforcement learning method based on resampling optimization cache experience playback mechanism. Control Decis. 2018, 33, 600–606. [Google Scholar]
Zhao, Y.; Liu, P.; Zhao, W. Secondary activesampling method for deep Q learning. Acta Autom. Sin. 2019, 45, 1870–1882. [Google Scholar]
Li, J.; Chen, Y.; Zhao, X.N.; Huang, J. An improved DQN path planning algorithm. J. Supercomput. 2022, 78, 616–639. [Google Scholar] [CrossRef]
Yuan, S.; Zhang, L.; Gu, Q.; Zhang, F.; Lv, J.Q. Research on Path Planning Method of Mobile Robot Priority Sampling D3QN. J. Small Microcomput. Syst. 2023, 44, 923–929. [Google Scholar]
Li, Q.; Geng, X. Research on Path Planning Method Based on Improved DQN Algorithm. Comput. Eng. 2023, 1–11. [Google Scholar] [CrossRef]
Lee, R.; Chen, I.N.G.Y.I. The time complexity analysis of neural network model configurations. In Proceedings of the 2020 International Conference on Mathematics and Computers in Science and Engineering (MACISE), Madrid, Spain, 14–16 January 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 178–183. [Google Scholar]
Châari, I.; Koubâa, A.; Trigui, S.; Bennaceur, H.; Ammar, A.; Al-Shalfan, K. SmartPATH: An efficient hybrid ACO-GA algorithm for solving the global path planning problem of mobile robots. Int. J. Adv. Robot. Syst. 2014, 11, 94. [Google Scholar] [CrossRef]
Lamini, C.; Benhlima, S.; Elbekri, A. Genetic algorithm based approach for autonomous mobile robot path planning. Procedia Comput. Sci. 2018, 127, 180–189. [Google Scholar] [CrossRef]
Wu, L.; Huang, X.; Cui, J.; Liu, C.; Xiao, W. Modified adaptive ant colony optimization algorithm and its application for solving path planning of mobile robot. Expert Syst. Appl. 2023, 215, 119410. [Google Scholar] [CrossRef]
Duan, H.; Yu, Y.; Zhang, X.; Shao, S. Three-dimension path planning for UCAV using hybrid meta-heuristic ACO-DE algorithm. Simul. Model. Pract. Theory 2010, 18, 1104–1115. [Google Scholar] [CrossRef]

Figure 1. Comparison of network structures between traditional DQN and ID3QN. (a) DQN. (b) ID3QN.

Figure 2. ID3QN path planning process.

Figure 3. Eight-neighborhood path search strategy.

Figure 4. Grid map of nuclear environment. (a) Scenario 1: simple radioactive environment with equal intensity radioactive sources and sparse obstacles. (b) Scenario 2: complex radioactive environment with different equal intensity radioactive sources and dense obstacles.

Figure 5. Path planning routes for six algorithms under Scenario 1. (a) ID3QN. (b) D3QN. (c) DQN. (d) A*. (e) GA. (f) ACO.

Figure 6. Comparison of reward and number of collisions values for three algorithms (ID3QN, D3QN, DQN) under Scenario 1. (a) Reward. (b) Number of collisions.

Figure 7. Path planning routes for six algorithms under Scenario 2.

Figure 8. Comparison of reward and number of collisions values for three algorithms (ID3QN, D3QN, DQN) under Scenario 2. (a) reward. (b) number of collisions.

Table 1. Experimental hardware and software configuration.

Experimental Environment	Configure
CPU	INTEL CORE i7
GPU	NVIDIA GeForce RTX 3060
RAM	16 GB
Hard Disk Volume	512 GB
OS	Windows10 64 bit
Programming Environment	Python 3.6
Torch Version	1.10.2

Table 2. Algorithm parameter settings.

Algorithm	Parameters	Value
ID3QN, D3QN, DQN	Input Layer	2
	Hidden Layer	256
	Output Layer	8
	Learning Rate, $α$	$0.01$
	Replay Memory Size, $β$	64
	Capacity of Experience Pool, M	3200
	Exploration Factor, $ε$	$0.99$ ∼ $0.01$
	Discount actor, $γ$	$0.9$
	Updating Target Network Parameters, $η$	10
	Training Times, N	6000
	Maximum Radiation Dose, $I_l i m i t$	10
	Probability Exploration Parameter, $μ$	$0.6$
GA	Population Size, $N_p$	100
	Maximum Number of Evolutions, $M a x g e n$	500
	Crossover Factor, $p_c$	$0.8$
	Variation Factor, $p_m$	$0.2$
ACO	Number of Iterations, K	500
	Number of Ants, M	100
	Pheromone Impact Factor, $A l p h a$	3
	Heuristic Factor, $B e t a$	7
	Pheromone Evaporation Factor, $R h o$	$0.3$
	Pheromone Enhancement Factor, Q	1

Table 3. Comparison of six algorithms under Scenario 1.

Algorithm	ID3QN	D3QN [26]	DQN [13]	A* [4]	GA [7]	ACO [2]
Path Length(m)	29.7990	30.3848	35.3137	30.9706	29.7990	31.5563
Number of Corners	10	9	6	11	9	8
Accumulated Radiation Dose(mSv)	15.5717	16.8829	20.3404	20.2544	27.3817	16.1276
Number of collisions	25,908	78,338	79,789	/	/	/
Scores	5025	19	7	/	/	/
training duration(s)	1593	1211	747	/	/	/

Table 4. Comparison of six algorithms under Scenario 2.

Algorithm	ID3QN	D3QN [26]	DQN [13]	A* [4]	GA [7]	ACO [2]
Path Length (m)	29.7990	31.5563	32.1412	30.9706	31.5563	31.5563
Number of Corners	13	11	12	13	11	9
Accumulated Radiation Dose (mSv)	18.3488	30.3663	27.8025	19.222	33.4697	27.6534
Number of collisions	28,995	151,908	152,888	/	/	/
Scores	4413	12	3	/	/	/
training duration (s)	1568	1216	761	/	/	/

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, Z.; Yin, Y.; Liu, J.; Zhang, D.; Chen, J.; Jiang, W. A Novel Path Planning Approach for Mobile Robot in Radioactive Environment Based on Improved Deep Q Network Algorithm. Symmetry 2023, 15, 2048. https://doi.org/10.3390/sym15112048

AMA Style

Wu Z, Yin Y, Liu J, Zhang D, Chen J, Jiang W. A Novel Path Planning Approach for Mobile Robot in Radioactive Environment Based on Improved Deep Q Network Algorithm. Symmetry. 2023; 15(11):2048. https://doi.org/10.3390/sym15112048

Chicago/Turabian Style

Wu, Zhiqiang, Yebo Yin, Jie Liu, De Zhang, Jie Chen, and Wei Jiang. 2023. "A Novel Path Planning Approach for Mobile Robot in Radioactive Environment Based on Improved Deep Q Network Algorithm" Symmetry 15, no. 11: 2048. https://doi.org/10.3390/sym15112048

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Path Planning Approach for Mobile Robot in Radioactive Environment Based on Improved Deep Q Network Algorithm

Abstract

1. Introduction

2. Materials and Methods

2.1. ID3QN Algorithm

2.1.1. Target Network Optimization

2.1.2. Optimize Network Structure

2.1.3. Improved Action Selection Mechanism

2.1.4. Prioritized Experience Replay Mechanism

2.1.5. Improvement of the Reward Function

2.1.6. The Proposed ID3QN Algorithm for Path Planning of Mobile Robot in Nuclear Environment Flowchart

2.2. Robot Path Planning Based on ID3QN

3. Results

3.1. Experimental Environment and Parameter Configuration

3.2. Experimental Results Comparison and Analysis

3.2.1. Comparison of Scenario 1

3.2.2. Comparison of Scenario 2

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI