Sequence-to-Sequence Multi-Agent Reinforcement Learning for Multi-UAV Task Planning in 3D Dynamic Environment

Liu, Ziwei; Qiu, Changzhen; Zhang, Zhiyong

doi:10.3390/app122312181

Open AccessArticle

Sequence-to-Sequence Multi-Agent Reinforcement Learning for Multi-UAV Task Planning in 3D Dynamic Environment

by

Ziwei Liu

,

Changzhen Qiu

^* and

Zhiyong Zhang

School of Electronics and Communication Engineering, Sun Yat-Sen University, Shenzhen 518107, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(23), 12181; https://doi.org/10.3390/app122312181

Submission received: 27 October 2022 / Revised: 18 November 2022 / Accepted: 22 November 2022 / Published: 28 November 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Task planning involving multiple unmanned aerial vehicles (UAVs) is one of the main research topics in the field of cooperative unmanned aerial vehicle control systems. This is a complex optimization problem where task allocation and path planning are dealt with separately. However, the recalculation of optimal results is too slow for real-time operations in dynamic environments due to a large amount of computation required, and traditional algorithms are difficult to handle scenarios of varying scales. Meanwhile, the traditional approach confines task planning to a 2D environment, which deviates from the real world. In this paper, we design a 3D dynamic environment and propose a method for task planning based on sequence-to-sequence multi-agent deep deterministic policy gradient (SMADDPG) algorithm. First, we construct the task-planning problem as a multi-agent system based on the Markov decision process. Then, the DDPG is combined sequence-to-sequence to learn the system to solve task assignment and path planning simultaneously according to the corresponding reward function. We compare our approach with the traditional reinforcement learning algorithm in this system. The simulation results show that our approach satisfies the task-planning requirements and can accomplish tasks more efficiently in competitive as well as cooperative scenarios with dynamic or constant scales.

Keywords:

sequence to sequence; multi-UAV; task planning; multi-agent reinforcement learning; deep deterministic policy gradient; 3D dynamic environment

1. Introduction

With the quick development of unmanned aerial vehicles (UAVs) in recent years, more and more attention has been paid to UAVs. Recent developments in UAVs have made them indispensable in modern society. UAVs can perform various tasks previously performed by humans in dangerous and complex environments, such as topographic surveying [1,2], power control [3], military operations [4,5], environmental monitoring and disaster response [6,7,8], smart agriculture [9,10], commercial transportation [11,12], and so on. As tasks become more complex and the task environment becomes more volatile, a single UAV can no longer handle the demands of complex tasks, and UAV cluster technology has come into being. For complex tasks, the cooperation of UAV clusters has become an important means of improving efficiency. The key to solving the cooperative problem is multi-UAV task planning, which has attracted much attention recently.

Task planning is a complex combinatorial optimization problem, including two sub-problems which are the task assignment problem and the path-planning problem. Among them, task assignment is an NP-hard optimization problem. There are many ways to solve this problem, mainly including centralized and distributed models. Centralized models include a co-evolutionary multi-population genetic algorithm [13], distributed task assignment algorithm [14], multiple traveling salesman problem module [15], stochastic game theory task assignment module [16], and so on. Distributed models include: the decentralized auction algorithm [17], group-based distributed auction algorithm [18], ant colony optimization [19], particle swarm optimization [20], genetic algorithm [21], etc. There are many solutions to the path-planning problem, but common solutions include clustering-based algorithm [22], Voronoi diagrams with the Dijkstra algorithm [23], adaptive random search algorithm [24], model predictive control [25], mixed-integer linear programming [26,27], etc.

These algorithms cannot effectively deal with dynamic environments because they need global information about the environment to calculate the optimal result. However, real environments change rapidly. For example, the target’s position may change randomly. Once the environment changes, the original results fail. In addition, these algorithms need many calculations to achieve optimal results after the environment changes, and the calculation process is too slow for real-time operation [28]. Meanwhile, these methods only simulate in 2D environments and do not take into account the fact that this simulation environment differs significantly from the real world. These shortcomings limit the application of such algorithms in UAV clusters in real, dynamic environments.

With the rise of reinforcement learning technology, reinforcement learning technology has attracted the attention of researchers and also provides a new method for task planning. Kun Zhang et al. [29] applied the deep deterministic policy gradient (DDPG) to single UAV path planning and achieved good results. However, there are frequent interactions between UAVs in the multi-UAV collaboration scene, so it is impossible to obtain better results using this completely independent algorithm. At the same time, considering the dynamic change of scale that may exist in the process of multi-UAV flight, and the insensitivity of the sequence-to-sequence (Seq2Seq) network [30,31] to the input and output length, this paper combines DDPG [32] with Seq2Seq, uses DDPG to build a local network, inputs local observations of all agents, and outputs common actions of agents unrelated to the dynamic change of agent scale. Seq2Seq is used to build a global network, inputs the state containing complex interactions between agents, and outputs the personalized actions of agents related to the scale of agents.

Based on the above ideas, the contribution of this study is as follows. First of all, a 3D dynamic environment was created especially for task planning using UAV movement. Using a reward function, the environment evaluates the UAV’s motion for task planning in the unknown space. In the same way, it denounces the UAV for negative occasions, such as obstacle collisions. Another contribution is the sequence-to-sequence multi-agent deep deterministic gradient (SMADDPG) algorithm we propose. In this algorithm, we combine the Seq2Seq with DDPG. Considering the good performance of DDPG in the field of a single UAV, we use DDPG to establish a local network to output common actions indicating the direction of UAV movement according to the observation. Considering the insensitivity of Seq2Seq to the length of input and output data, we use Seq2Seq to establish a global network to input global observations, and the output action is personalized action indicating the UAV’s target. Finally, we give the contrast effect between SMADDPG and common reinforcement learning algorithms, which shows the advantages of our algorithm in scenarios where the scale changes dynamically or remains unchanged.

The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 presents the formulation of multi-UAV task planning. Section 4 describes the SMADDPG algorithm. Section 5 designs the MDP model of multi-UAV task planning. Section 6 is about the experiments in the 3D dynamic environment. Section 7 summarizes our conclusions and future work.

2. Related Work

With the task environment becoming more and more complex, the autonomy of UAV clusters needs to be improved urgently. Task planning is the key to ensuring multiple UAVs can efficiently and coordinately execute complex tasks. In order to improve the autonomy of UAVs, we need to design an algorithm to deal with the task planning of multiple UAVs. The SMADDPG algorithm we propose uses a combination of reinforcement learning and sequence-to-sequence to solve problems in the field of task planning. For this reason, in Section 2.1, we focus on the mainstream methods in the field of task planning and some existing problems. In Section 2.2, we introduce the Markov decision model, the basis of reinforcement learning; Section 2.3 introduces the development of reinforcement learning; and Section 2.4 introduces the sequence-to-sequence model used in SMADDPG.

2.1. Task-Planning Problem

Task planning plays the role of the top command center in the whole autonomous decision-making and control system, which will determine the general direction of UAV task execution in the cluster. Therefore, the establishment of an effective task planning system is the key requirement to achieve high autonomy and intelligence of multiple UAVs. The task-planning problem of multiple UAVs consists of several subproblems, which usually include task assignment, path planning, etc. Task assignment is used to establish the association and mapping relationship between UAVs and tasks. It is a discrete space combinatorial optimization problem under multiple constraints. The key to its implementation is to establish mathematically expressed objective functions and effectively deal with various constraints. The role of path planning is to establish an executable path arrangement for each UAV to perform its assigned tasks. In essence, it is a path optimization problem in continuous space and is the underlying realization of the hierarchical strategy of task planning. It can be seen that task assignment and path planning are closely related and both are complex optimization problems. Therefore, how to design a task-planning algorithm to jointly optimize complex subproblems is the focus of current academic research. Bellingham et al. proposed a mathematical method based on mixed-integer linear programming, which is flexible and can include more detailed constraints [27]. However, it can only be solved by professional mathematical software, because the solution is very complex. Shima et al. proposed a new task-planning algorithm based on a genetic algorithm [33]. This algorithm can effectively plan tasks by combing constraints such as task priority, coordination, time constraints, and trajectory. Ou et al. designed a chaos optimization method [34] based on the particle swarm optimization theory and applied it to the task assignment of multi-UAV which achieved good results. Recently, Babel et al. have designed a new algorithm [35] to achieve task planning for UAV clusters under the condition that the multiple UAVs arrive at the destination in a specified order, while minimizing the task execution time. Liu et al. designed a model of priority attack and target threat of UAV based on cell-winning suppression [36] and proved the efficiency of the algorithm through experiences. The algorithms described above can achieve efficient task planning under different constraints, but they cannot be applied to dynamic environments because they need to rely on global environment information.

In the field of task planning, there are relatively few studies considering the dynamic environment. Bellingham et al. designed a collaborative path planning for multi-UAV in a dynamic environment based on mixed-integer linear programming [37]. Alighanbari et al. [38] proposed the receding-horizon task assignment algorithm, which is a replanning algorithm. They applied it to the dynamic environment and achieved good results. When dealing with new environments, these algorithms recalculate to obtain new task-planning schemes, which will incur huge computing costs.

2.2. Markov Decision Process

In the Markov decision process (MDP), the dynamic change of the environment is completely determined by its current state. The difference between MDP and the partially observable Markov decision process (POMDP) is that the former defaults that agents can observe the global state of the environment, while the latter considers that each agent can only observe a partial state. Considering that multi-UAV can only observe the state of adjacent parts during operation, the POMDP is used. The POMDP consists of a seven-tuple

〈S, A, T, R, Ω, O, γ〉

, where

S

is the state space, the

A

is the action space,

T

is the state transfer probability,

R

is the reward value,

Ω

is the observation space,

O

is the observation probability, and

γ

is the discount factor.

Figure 1 shows the flow of a POMDP at runtime. The initial state of the system is

s_{0} \in S

, which includes the observation

o_{0} \in Ω

, with an observation probability of

O (o_{0} |s_{0}, a_{0})

, the agent chooses the action

a_{0} \in A

according to the policy. The policy

π (a | o)

is the rule that determines how the agent interacts with the environment when it is in the current observation

o

. A policy can be expressed as a probability distribution or a deterministic function. The agent injects the action into the environment. The environment returns the next state

s_{1}

according to a transfer probability

T (s_{1} |s_{0}, a_{0})

, and so on. In this process, the agent gains the reward

(r_{0}, r_{1}, \dots)

based on the current state and injected action. In the POMDP iteration process, the agent hopes to maximize the reward gained by updating the policy. That is, if the policy is updated with the optimal policy

π^{*}

, Equation (1) is established.

E_{π (a | o)} [\begin{matrix}  \end{matrix} \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1}] = E_{π^{*}} [\begin{matrix}  \end{matrix} \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1}],

(1)

2.3. Reinforcement Learning

The purpose of the reinforcement learning algorithm is to use this procedure to iteratively update the policy

π (a | o)

so that the agent can receive the maximum reward sum until it approaches the optimal policy

π^{*}

. As the basic method to solve this MDP, reinforcement learning obtains the current observation by observing the environment, selects an action according to the current situation by using the iterative optimization method to maximize the total reward, and updates the strategy in the interaction between the agent and the environment [39]. Reinforcement learning can be divided into three categories. The first type is value-based methods, such as the deep Q learning network (DQN) [40] proposed by Minh et al. and the DRQN proposed by Hausknecht M et al. [41], which uses the deep neural network as a function approximation to solve large-scale problems, stores interactive information by maintaining an experience replay buffer, and adds the target network to be especially responsible for constructing the temporal-difference target value, to ensure the stability of training. The value-based method is extremely inconvenient for continuous action processing, and the learning ability is low under certain constraints. Therefore, some researchers have proposed the policy-based method, which does not approximate the value function but the policy. The trust region policy optimization (TRPO) [42] proposed by Schulman limits the update step by using Kullback–Leibler (KL) divergence. This method is used to ensure that the differences between the old and new policies are not too large. Proximal policy optimization (PPO) [43] removes the KL penalty item and alternate update of TRPO, and adds the KL divergence regularization part on the basis of the original objective function to represent the difference between the two distributions, which greatly reduces the calculation process of TRPO and improves the training speed. The third method is to combine the above two ideas, that is the actor–critic (AC) algorithm [44]. The algorithm consists of two parts. Actor refers to the policy network, which can obtain the policy distribution through the state input and output the probability value of the corresponding action. This probability distribution is also called the policy. Critic refers to the value function network, which is responsible for evaluating the performance of the policy network and guiding the action of the policy network in the next stage. The AC algorithm combines the advantages of value-based and policy-based methods and can be used in high-dimensional and continuous action space. DPG adopts deterministic policy gradient [45], does not output the probability distribution of actions, but only adopts the work with maximum probability and uses the off-policy method to explore and update. Then, Lillcarp et al. proposed DDPG [32] algorithm by combining DQN and DPG, using the double-actor neural network and double-critic neural network with experience replay to improve the problem of actor-critic convergence. In asynchronous advantage actor–critic (A3C) [46], multi-thread is adopted. The main thread is responsible for updating the parameters of the actor and critic. The secondary thread creates multiple agents to perform asynchronous learning in multiple environments to increase the number of explorations. It summarizes and updates the parameters of the main thread, similar to the experience replay in DQN.

In the field of multi-agent reinforcement learning, Lowe et al. [47] proposed multi- agent deep deterministic policy gradient (MADDPG) that is suitable for both cooperation and confrontation scenarios. It uses centralized training and decentralized execution to enable agents to learn better cooperation and improve overall performance. Foerster et al. [48] proposed counterfactual multi-agent policy gradients (COMA), which also uses centralized training and decentralized execution. The network is a single critic and multiple actors. The actor network uses a gate recurrent unit network to improve the cooperation effect of the whole team. Wei et al. [49] proposed the multi-agent soft Q-learning. The multi-agent uses joint action and uses the global reward to judge the action, to improve the team cooperation effect. The above algorithms can improve the performance of multi-agent team cooperation and confrontation, but they are difficult to handle the dynamic change in the agent scale. Due to the excellent application of DDPG in single-UAV path planning, and the characteristic of DDPG in dealing with continuous action space, we will design our algorithm based on DDPG to deal with the cooperative competition scenario of dynamic scale change.

2.4. Sequence to Sequence

Since it is difficult for the recurrent neural network (RNN) to solve the problem of unequal length sequence, the Seq2Seq is devised to overcome this problem. To solve this problem, the Seq2Seq uses an encoder–decoder structure. More specifically, the encoder is composed of RNNs, which extracts features from the input indefinite length sequence and encodes them into a fixed-length vector that can map the approximate content. The decoder will restore the fixed-length vector obtained by the encoder to the corresponding sequence data. Generally, the decoder uses the same structure as the encoder, which is also RNNs. Seq2Seq can solve the problem of indefinite length sequence, which makes it widely used in natural language processing [30,31]. In this paper, Seq2Seq is used to build a global network and input the state containing complex interactions between agents, and the output action is personalized action indicating the UAV’s target. The personality action is related to the scale of agents.

3. Problem Formulation

Multi-UAV task planning requires a group of

N

UAVs with holonomic or non-holonomic constraints to perform the assigned tasks and achieve their targets according to the corresponding trajectory while avoiding collisions with each other and with obstacles in the environment. In particular, at each time step

t

, each UAV

U_{i} \in U (0 \leq i \leq N - 1)

receives its position

p_{i, t}^{U}

. The targets

T_{i} \in T (0 \leq i \leq N_{T} - 1)

and obstacles

D_{i} \in D (0 \leq i \leq N_{D} - 1)

are distributed across it. The position of the target

T_{i}

is denoted by

p_{i, t}^{T}

, while the obstacle

D_{i}

is given by

p_{i, t}^{D}

. For each UAV

U_{i}

, a trajectory

\{Τ_{i}\}

is planned and given by Equation (2) and the length of

\{Τ_{i}\}

is given by

d_{i}

.

Τ_{i} = \{p_{i, 0}^{U}, p_{i, 1}^{U}, \dots, p_{i, n}^{U}\},

(2)

The optimization goal of the task planning is

\min (\sum_{0}^{N - 1} d_{i})

and it must satisfy the following constraints:

Each target $T_{i}$ is assigned to one UAV, i.e., $\cup_{i = 0}^{N - 1} T_{i} = T, i \in \{0, \dots, N_{T} - 1\}$ .
Each target is only assigned to one UAV, i.e., $\forall i \neq j, T_{i} \neq T_{j} i, j \in \{0, \dots, N_{T} - 1\}$ .
Each UAV’s path is collision-free with obstacles, i.e., $\forall i, j D_{i} \notin Τ_{j}$ $i \in \{0, \dots, N_{D} - 1\}, j \in \{0, \dots, N - 1\}$ .
Each UAV’s path is collision-free with other UAVs, i.e., $\forall i, j, p_{i, t}^{U} \neq p_{j, t}^{U}$ at any time $t$ .

Meanwhile, in the dynamic environment, the number of UAVs

N

, the number of targets

N_{T}

, and the number of obstacles

N_{D}

are variables.

4. Sequence-to-Sequence Multi-Agent Deep Deterministic Policy Gradient

As discussed in Section 3, our formulation of multi-UAV task planning can be considered as a joint optimization of task assignment and path planning under uncertainty, and there are competition and cooperation between UAVs. Considering the excellent performance of the multi-agent reinforcement learning algorithm in competition and cooperation scenarios, we propose a multi-agent reinforcement learning called SMADDPG that can be applied to dynamic scale change scenarios.

In SMADDPG, it separates local observation from state and inputs them into the local network and the global network respectively to solve the problem of dynamic change of UAV cluster size (because it is difficult for us to obtain the global state value of UAV cluster when it is in motion, we are inspired by MADDPG to use the set of observations at this time to describe the current global state). The local network is used to deal with the part that is not affected by the scale change. Its input is the local observation of the UAV, and its output is the common action of the UAV. The global network processes the part that changes with the scale of the UAV. The input is the global observation, and the output is the UAV’s individual action. The local network is composed of DDPG and the global network is composed of Seq2Seq. The algorithm block diagram is shown in Figure 2.

4.1. Local Network

From the framework of SMADDPG, it can be seen that the local network is composed of multiple DDPG networks. Each agent has its own DDPG network. The DDPG structure is shown in Figure 3. In the DDPG algorithm, there are four networks: online police network with parameter

θ^{μ}

, target police network with parameter

θ^{μ^{'}}

, online Q network with parameter

θ^{Q}

, target Q network with parameter

θ^{Q^{'}}

. Online police network and target police network constitute actor

μ (o_{j}, θ^{μ})

directly maps observation

o_{j}

to the action

a_{j}

, which maximize the total reward. Online Q network and target Q network constitute critic

Q (o_{j}, a_{j}, θ^{Q})

which means the long-term reward will be received in the future after action

a_{i}

is taken under observation

o_{i}

. The actor and critic constitute DDPG, in which the environments send the observations to the actor and critic. The actor selects the appropriate action according to the observation and sends it to the critic. Then, the critic estimates the Q value of this simulation and chooses the action corresponding to the maximum Q value under the current condition.

The update method of actor and critic in DDPG is essentially an iterative method, that is, each version is updated from the previous version. Therefore, the most important role of target policy network

μ^{'} (o_{j}, θ^{μ^{'}})

and target Q network

Q^{'} (o_{j}, a_{j}, θ^{Q^{'}})

is to fix the previous version and maintain the stability of the learning process. The parameter

θ^{μ^{'}}

and

θ^{Q^{'}}

are updated by the equations below:

θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}, θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}},

(3)

Critic updates by minimizing the loss between target

y

and Q value through the following equation:

\min_{θ^{Q}} L = \frac{1}{B} {\sum_{j} (y_{j} - Q (o_{j}, a_{j}, θ^{Q}))}^{2},

(4)

In Equation (4),

B

is the mini batch size,

(o_{j}, a_{j}, r_{j}, o_{j + 1})

is the data sampled from the experience replay buffer and

y_{j}

is calculated as the following, where

γ

is the discount factor:

y_{j} = r_{j} + γ Q^{'} (o_{j + 1}, μ^{'} (o_{j + 1}, θ^{μ^{'}}), θ^{Q^{'}}),

(5)

The actor network is updated in the way of policy gradient, as shown in the following equation:

\nabla_{θ^{μ}} J \approx \frac{1}{B} \sum_{j} \nabla_{a} Q (o_{j}, a, θ^{Q}) |_{a = μ (o_{j}, θ^{μ})} \cdot \nabla_{θ^{μ}} μ (o_{j}, θ^{μ}),

(6)

Meanwhile, it can be seen in Figure 3 that stochastic noise

N_{i}

is added to the action selected by the actor according to

o_{t}

, which is to increase the exploration, as shown in the following formula.

a_{t} = μ (o_{t}, θ^{μ}) + N_{i},

(7)

The framework of the local network is shown in the Figure 4. Each agent has its own DDPG network. The parameter of the policy network is

θ_{i}^{μ}

, the parameter of the target policy network is

θ_{i}^{μ^{'}}

, the parameter of the Q network is

θ_{i}^{Q}

, and the parameter of the target Q network is

θ_{i}^{Q^{'}}

;

i = 0, 1, \dots, N - 1

. A single DDPG network only receives local observations from its corresponding agent with itself as the “coordinate origin”. At this point, it is meaningful to use a single strategy to control the actions of all agents. Therefore, this paper refers to asynchronous advantage actor–critic (A3C) [45] and additionally sets a central parameter network in the local network without gradient update, the parameter of policy network is

θ_{N}^{μ}

, the parameter of Q network is

θ_{N}^{Q}

. The network receives the parameter of other DDPG networks for soft update, and then updates other DDPG networks in the way of soft update, so that the parameters of all DDPG networks reach the same single strategy. The central parameter network and other networks update each other as follows:

θ_{N}^{μ} \leftarrow τ θ_{i}^{μ} + (1 - τ) θ_{N}^{μ}, θ_{N}^{Q} \leftarrow τ θ_{i}^{μ} + (1 - τ) θ_{N}^{Q},

(8)

θ_{i}^{μ} \leftarrow τ θ_{N}^{μ} + (1 - τ) θ_{i}^{μ}, θ_{i}^{Q} \leftarrow τ θ_{N}^{Q} + (1 - τ) θ_{i}^{Q},

(9)

4.2. Global Network

The global network is composed of memory and Seq2Seq, the framework of the network is shown in the Figure 5. The memory plays the role of the Q network in reinforcement learning, and is responsible for recording the mapping between global observation and individual actions and corresponding reward values. The Seq2Seq part is equivalent to the actor network in reinforcement learning, which is responsible for learning the mapping from the optimal global observation sequence to the action sequence and predicting the action sequence corresponding to the new observation sequence.

The sequence length of the global network input is the size of the agent cluster. The element dimension in the sequence corresponds to the observation value of each other. The length of the output sequence is also the size of the agent cluster. The element in the sequence is the number of the landmark.

The global network first filters the mapping from the observation sequence with the largest reward value to the action sequence through the idea of reinforcement learning. Then, the Seq2Seq network is used for learning. The network outputs the personality actions of agents, representing the cooperation between agents. In addition, an attention mechanism can be added to the Seq2Seq network to improve the performance of the Seq2Seq network. The parameter of the network is

δ

, and the update method is as follows:

\max_{δ} L_{seq} = \frac{1}{N} \sum_{n = 0}^{N - 1} \ln (a_{0}^{s}, a_{1}^{s}, \dots, a_{N - 1}^{s} |o_{0}^{s}, o_{1}^{s}, \dots, o_{N - 1}^{s}, δ),

(10)

In the equation,

o_{0}^{s}, o_{1}^{s}, \dots, o_{N - 1}^{s}

is the input sequence extracted from memory by Seq2Seq, and

a_{0}^{s}, a_{1}^{s}, \dots, a_{N - 1}^{s}

is the output sequence extracted from memory by Seq2Seq.

In the SMADDPG algorithm, the output actions of each agent are composed of personality actions and common actions. According to the above analysis, during the training of the algorithm, the local network and the global network are trained alternately. In each step, the local network will extract experience from the replay buffer for training, while the global network will only extract data from the memory for training after episode. However, both of them cannot be trained at the same time to ensure the stability of the environment. The pseudo-code of this algorithm is shown in Algorithm 1:

Algorithm 1 SMADDPG

1: Initialize policy networks and Q networks with

θ_{i}^{μ}

and

θ_{i}^{Q}

,

i = 0, 1, \dots, N - 1

2: Initialize target policy network and target Q network with

θ_{i}^{μ^{'}} \leftarrow θ_{i}^{μ}

and

θ_{i}^{Q^{'}} \leftarrow θ_{i}^{Q}

3: Initialize central parameter network with

θ_{N}^{μ}

and

θ_{N}^{Q}

4: Initialize sequence-to-sequence network with

δ

5: Set the max number of episodes

E_{\max}

, the max step of every episode

S_{\max}

and mini batch size

B

6: for

e = 0

to

E_{\max}

do
7: for

t = 0

to

S_{\max}

do
8: Receive observation of all agents

O_{t}^{ddpg} = O_{t}^{seq} = o_{t, 0}, o_{t, 1}, \dots, o_{t, N - 1}

9: Input

O_{t}^{ddpg}

into actor networks and obtain

A_{t}^{ddpg} = a_{t, 0}^{d}, a_{t, 1}^{d}, \dots, a_{t, N - 1}^{d}

10: Input

O_{t}^{seq}

into sequence-to-sequence network and obtain

A_{t}^{seq} = a_{t, 0}^{s}, a_{t, 1}^{s}, \dots, a_{t, N - 1}^{s}

11: Execute actions

A_{t} = A_{t}^{ddpg} + A_{t}^{seq}

12: Receive rewards

R_{t}^{ddpg} = R_{t}^{seq} = r_{t, 0}, r_{t, 1}, \dots, r_{t, N - 1}

13: Receive new observations

O_{t + 1}^{ddpg} = O_{t + 1}^{seq} = o_{t + 1, 0}, o_{t, 1}, \dots, o_{t, N - 1}

14: Store

(O_{t}^{ddpg}, A_{t}^{ddpg}, R_{t}^{ddpg}, O_{t + 1}^{ddpg})

in replay buffer

D

15: Store

(O_{t}^{seq}, A_{t}^{seq}, R_{t}^{seq})

in memory

M

16: for

i = 0

to

N - 1

do
17: Sample a random batch of

B

samples

(o_{j}, a_{j}, r_{j}, o_{j + 1})

from

D

18: Compute

y_{j} = r_{j} + γ Q^{'} (o_{j + 1}, μ^{'} (o_{j + 1}, θ_{i}^{μ^{'}}), θ_{i}^{Q^{'}})

19: Update the critic network for agent i using

\min_{θ_{i}^{Q}} L = \frac{1}{B} {\sum_{j} (y_{j} - Q (o_{j}, a_{j}, θ_{i}^{Q}))}^{2}

20: Update the actor network for agent i using

\nabla_{θ_{i}^{μ}} J \approx \frac{1}{B} \sum_{j} \nabla_{a} Q (o_{j}, a, θ_{i}^{Q}) |_{a = μ (o_{j}, θ_{i}^{μ})} \cdot \nabla_{θ_{i}^{μ}} μ (o_{j}, θ_{i}^{μ})

21: end for
22: Update the local network for each agent i:
23:

θ_{i}^{μ^{'}} \leftarrow τ θ_{i}^{μ} + (1 - τ) θ_{i}^{μ^{'}}

,

θ_{i}^{Q^{'}} \leftarrow τ θ_{i}^{Q} + (1 - τ) θ_{i}^{Q^{'}}

24:

θ_{N}^{μ} \leftarrow τ θ_{i}^{μ} + (1 - τ) θ_{N}^{μ}, θ_{N}^{Q} \leftarrow τ θ_{i}^{μ} + (1 - τ) θ_{N}^{Q}

25:

θ_{i}^{μ} \leftarrow τ θ_{N}^{μ} + (1 - τ) θ_{i}^{μ}, θ_{i}^{Q} \leftarrow τ θ_{N}^{Q} + (1 - τ) θ_{i}^{Q}

26: end for
27: Select

(O_{m}^{seq}, A_{m}^{s}, R_{m}^{seq})

by max

R^{seq}

, and then generate collection
28: Train the sequence-to-sequence network by this data set
29: The core equation in sequence to sequence
30:

\max_{δ} L_{seq} = \frac{1}{N} \sum_{n = 0}^{N - 1} \ln (a_{0}^{s}, a_{1}^{s}, \dots, a_{N - 1}^{s} |o_{0}^{s}, o_{1}^{s}, \dots, o_{N - 1}^{s}, δ)

31: end for

The settings of each key parameter in the algorithm are shown in the Table 1.

5. MDP Model for the Multi-UAV Task Planning

In this section, we introduce the MDP model of multi-UAV task planning. In this model, the SMADDPG algorithm can control the agent to find the policy that maximizes the reward by interacting with the environment. The MDP model is shown in the Figure 6.

In the process of interaction with the environment, the environment is updated with the movement of the UAV, so it is very important to define the flight model of the UAV. We will introduce this part in detail in Section 5.1. The optimal policy that can plan tasks as required should be obtained by maximizing the reward. Therefore, the setting of the reward function directly affects the effect of the policy, which is the key to reinforcement learning. This part will be described in detail in Section 5.2.

5.1. UAV Flight Model

In the simulation environment, in order to facilitate the analysis of the problem, the UAV and the obstacles in the air area are set to the shape of a sphere, the radius of the UAV is

r_{uav}

, and the radius of obstacles is

r_{adv}

. The landmark is set as a cube in order to distinguish it from the UAV, and then one of the landmarks is set as the target. The radius of its effective range is

r_{aim}

. When the UAV approaches the target area, for example

D \leq D_{aim} = r_{uav} + r_{aim}

, it is determined that the UAV reaches the target area. At this time, the position of the UAV is

P_{t} = {[x_{t}, y_{t}, z_{t}]}^{T}

, and its speed is

V_{t} = {[v_{t, x}, v_{t, y}, v_{t, z}]}^{T}

.

In the 3D dynamic environment, the motion model is set as follows to enable UAVs and obstacles to operate in this environment with a motion model:

[\begin{matrix} {v^{'}}_{x} \\ {v^{'}}_{y} \\ {v^{'}}_{z} \end{matrix}] = [\begin{matrix} v_{x} \\ v_{y} \\ v_{z} \end{matrix}] + a \times Δ t,

(11)

where

{v^{'}}_{x}, {v^{'}}_{y}, {v^{'}}_{z}

represents the new velocity components of the UAV in the X-axis, Y-axis and Z-axis, respectively.

a

is the acceleration of the UAV, and

Δ t

is the motion interval time. In the event of a collision

a = f / m

,

f

is the force generated during the collision, and

m

is the mass of the UAV. Then, the position is updated to

P^{'} = [x^{'}, y^{'}, z^{'}]

:

[\begin{matrix} x^{'} \\ y^{'} \\ z^{'} \end{matrix}] = [\begin{matrix} x \\ y \\ z \end{matrix}] + [\begin{matrix} {v^{'}}_{x} \\ {v^{'}}_{y} \\ {v^{'}}_{z} \end{matrix}] \times Δ t,

(12)

In addition, the turning angle, climb angle or descent angle of the UAV are defined as Formula (13) and Formula (14), respectively.

\begin{matrix} θ = \arccos \\ \frac{(x_{t} - x_{t - 1}) (x_{t - 1} - x_{t - 2}) + (y_{t} - y_{t - 1}) (y_{t - 1} - y_{t - 2})}{\sqrt{{(x_{t} - x_{t - 1})}^{2} + {(y_{t} - y_{t - 1})}^{2}} \cdot \sqrt{{(x_{t - 1} - x_{t - 2})}^{2} + {(y_{t - 1} - y_{t - 2})}^{2}}} \end{matrix}

(13)

ψ = \arctan \frac{|z_{t} - z_{t - 1}|}{\sqrt{{(x_{t} - x_{t - 1})}^{2} + {(y_{t} - y_{t - 1})}^{2}}},

(14)

At the same time, the movement of the UAV and obstacles also needs to meet the constraints of speed, acceleration, flight height, turning angle, climb angle or glide angle, as shown in Formula (15), wherein

h

is the flight height of UAV.

\{\begin{matrix} v \leq v_{\max} \\ a \leq a_{\max} \\ h_{\min} \leq h \leq h_{\max} \\ \begin{array}{l} θ \leq θ_{\max} \\ ψ \leq ψ_{\max} \end{array} \end{matrix},

(15)

Some parameters of the UAV 3D model in the experiment are set as shown in Table 2.

5.2. Reward Function for UAV Flight Effect Evaluation

In reinforcement learning, the agent aims to maximize its cumulative reward, so we need to design the reward function to specify what agents need to achieve. In this section, we describe the reward function in detail to guide the agent to move towards the target, shorten the range, avoid collisions, reach the target area, and stay within the task area.

We designed a reward function

r_{t}^{g}

to guide the agent to the target, as Formula (16) shows.

r_{t}^{g} = - \min (‖p_{t}^{uav} - p_{t}^{aim}‖),

(16)

After each step of the agent is updated, the multi-agent system will calculate the distance between each target and the agent to find the target closest to the agent. After the distance between the two is reversed, it will be added to the reward.

Minimizing the flying distance is of great significance for the UAV cluster, such as saving resources, time, and so on. We added a negative penalty term for each time step with very small data, i.e.,

r_{t}^{s} < 0

, to encourage the UAV to use a shorter path to achieve the goal.

Next, we designed a reward function to avoid collisions. That is, when collisions occur, agents will obtain negative rewards, so they can avoid collisions. Inspired by [50], we added critical areas between agents and obstacles, and the critical areas are the extension of agent ontology. When there is a collision between critical areas, we define it as a pseudo-collision. At this time, the agent will obtain a negative reward, which is equivalent to increasing the collision warning mechanism, increasing the reaction time for agents to avoid collisions, and effectively reducing the probability of collisions. When agents or obstacles collide, we regard them as real collisions, and the punishment of a real collision is more serious than that of a pseudo-collision. The schematic diagram of pseudo-collision, real collision and critical area is shown in Figure 7. The specific reward function is shown in Formula (17).

r_{t}^{c} = \{\begin{matrix} r_{pseudo}, ‖p_{t}^{uav} - p_{t}^{adv}‖ \leq D_{critic} \\ r_{col}, ‖p_{t}^{uav} - p_{t}^{adv}‖ \leq D_{col} \\ 0, otherwise \end{matrix},

(17)

In Formula (17),

D_{col}

represents the range of collision and the

D_{critic}

represents the range of the pseudo-collision, where

D_{critic} = D_{col} + 2 \cdot r_{critic}

,

r_{critic}

is the range of the critical area.

We also design a

r_{t}^{a}

reward structure, which gives rewards to the agent who reach the target area.

r_{t}^{a} = \{\begin{matrix} r_{arr}, ‖p_{i}^{uav} - p_{i}^{aim}‖ \leq D_{aim} \\ 0, otherwise \end{matrix},

(18)

Meanwhile, we set up a reward function

r_{t}^{o}

to ensure that the agent stays within the task environment.

r_{t}^{o} = \{\begin{matrix} r_{out}, if out of range \\ 0, if in range \end{matrix},

(19)

We add up all the reward structures to build a reward function

r_{t}

that can enable agents to achieve various tasks.

r_{t} = r_{t}^{g} + r_{t}^{s} + r_{t}^{c} + r_{t}^{a} + r_{t}^{o},

(20)

In this work, we set

r_{pseudo} = - 10, r_{col} = - 20, r_{arr} = 30, r_{out} = - 2, r_{t}^{s} = - 0.1

in the training procedure.

6. Simulation Experience and Verification

6.1. Experimental Setup

In order to verify the effectiveness of SMADDPG in the field of UAV cluster task planning, we design a dynamic random environment based on pyglet and gym. The environment is in a three-dimensional state of continuous space and discrete time, and contains agents, landmarks and air obstacles. Agents in the environment may need to compete with each other or may need to cooperate to achieve their goals. In these scenarios, the flying direction and speed of the agent are common actions, while the landmark to fly to is the personality action. Next, we provide details of several common scenarios below.

6.1.1. Cooperative Communication

In this task scenario, there are two types of agents, speakers and listeners, placed around several landmarks with different colors. The listener can know the distance between the landmark and itself and the color of the landmark through observation, but it does not know what color the target landmark is, but the speaker can tell the target by observing the listener’s color. Therefore, the listener must communicate with the speaker and reach the target according to the speaker’s guidance. The listener will obtain the corresponding reward according to the distance between it and the target to guide it to the target. In Section 6.2, we compare DDPG, MADDPG and SMADDPG in this task environment in the case of fixed agent scale or dynamic agent scale, and give experimental results.

6.1.2. Physical Deception

In this scenario, there are

N

cooperative agents,

N

landmarks and an opponent. Here, the opponent obtains corresponding rewards according to the distance from the target, but it does not know which is the target. The cooperative agents will obtain corresponding rewards according to the minimum distance between any agent and the target, and will obtain a corresponding punishment according to the distance between the enemy and the target. Therefore, this scenario hopes to spread the cooperative agents to each landmark through learning, so as to deceive the opponent.

In the cooperative communication scenario, the gray agent is the speaker, and the color of the audience indicates the corresponding target landmark of the agent. In the physical deception scenario, the blue agents try to cover each landmark to deceive the red agents. The two task scenarios are shown in Figure 8.

6.2. Experimental Result

This paper verifies SMADDPG, MADDPG and DDPG in the scenarios described in Section 6.1. In the above two scenarios with the fixed agent scale, the experiment process of SMADDPG and DDPG is shown in Figure 9. SMADDPG can better complete the tasks specified in the two task scenarios. Although the DDPG algorithm has an excellent learning effect in the field of a single agent, it fails to meet the task requirements in the scenario of multi-agent cooperation and competition. In the cooperative communication scenario, DDPG cannot deal with the communication between agents, so it cannot know which landmark is the target, and it can only move to the center of each landmark to maximize the reward. In the physical deception scenario, DDPG does not consider the overall reward of the agent cluster macroscopically, but greedily maximizes the individual reward, so agents will move to the target landmark, instead of achieving the task of decentralized coverage to deceive the other party. It can be seen that the SMADDPG algorithm can learn better task-planning methods than the DDPG algorithm in the cooperative competition scenario.

Then, the quantitative analysis results of reward and success rate are shown in Figure 10 and Figure 11 for the two situations of scale change and scale unchanged in the cooperative communication scenario. In the scenario of scale change, the number of UAVs changes with the following formula, where

E

is the number of current episodes; that is, the agent scale changes once every 300 episodes, and the agent scale is 6 and 12 in turn.

N = 6 * [\frac{E \mod 600}{300}] + 6 .

(21)

It can be seen from Figure 10 that the SMADDPG algorithm can obtain a higher reward than MADDPG and DDPG when the agent scale is unchanged, and the fluctuation of the SMADDPG algorithm performance is relatively stable in the success rate. It can be seen from Figure 11 that SMADDPG can cope with this scenario. Although the reward and success rate fluctuate to a certain extent due to the change of scale, it can still maintain at a high level and learn better task-planning methods. To sum up, SMADDPG can achieve the best results in the fixed and changed scale cooperation scenarios.

Then, in the physical deception scenario, SMADDPG and DDPG are used as the training algorithms of our side and the other side respectively. The success rates of both are shown in Table 3.

The experimental results shown in Table 3 can explain the running effect of the SMADDPG and DDPG algorithms in Figure 9. When we use SMADDPG and the other uses DDPG, the difference between the two sides is the largest. This is because SMADDPG can cover each landmark in a decentralized way, making the other DDPG unable to know which landmark to go to, so the deception effect is the best at this time. To sum up, we can use the SMADDPG algorithm to achieve better decentralized coverage in the physical deception scenario, so as to cheat the other party.

Therefore, through experiments, it can be concluded that the SMADDPG algorithm, compared with other reinforcement learning algorithms, is more able to obtain a better task-planning method from the overall reward value of the cluster, so as to better deal with cooperation and competition scenarios and improve the task execution efficiency of the UAV cluster.

7. Summary and Conclusions

This paper proposes a deep multi-agent reinforcement-learning-based task-planning algorithm using the Seq2Seq for multi-UAV when there are fixed and dynamic scales. Since the POMDP under consideration is a high-dimensional problem with a continuous action space, DDPG is employed for reinforcement learning. In order to use DDPG for the task-planning problem under consideration, POMDP is defined appropriately first. Then, for the dynamic change of scale, we transplant Seq2Seq, which is commonly used in the field of natural language processing, to the multi-agent reinforcement learning field to learning the mapping relationship between global observation and personality action that changes with the number of agents. Later, in view of the successful application experience of DDPG in single-UAV path planning, we use DDPG to build a local network, input local observations of agents, and output common actions that determine the flight direction and speed of agents. Finally, we add the personality action and common action to the simulation environment for the next update. Later, we verify our algorithm in the competition and cooperation scenarios, and the results are shown in Figure 10 and Figure 11, all of which achieved good results. This also proves that our algorithm can achieve better task-planning results than other reinforcement learning algorithms in the environment of dynamic scale change.

During the experiment, we noticed that the dimension of the global observation in the 3D environment increased significantly, so the training time was increased. We also observed that when there were too many agents, the update of the agent strategy was more relevant to the neighboring agents. To reduce the training time of multi-agent reinforcement learning, we will later introduce the concept of the graph in reinforcement learning. Each agent updates the policy according to the state of its adjacent agent, which can reduce the input dimension of the central network and the training time of the multi-agent.

At the same time, in the collaborative confrontation environment we studied, we noticed that the dynamics were not only reflected in the dynamic changes of the agent scale but also included the behavior changes of unknown agents. Therefore, our subsequent research hopes to introduce long short-term memory (LSTM) to predict unknown agents, which can help us make correct task-planning methods in advance.

Author Contributions

The authors jointly proposed the idea and contributed equally to the writing of manuscript. Writing: Z.L.; Coding: Z.Z. and C.Q.; supervised the research and revised the manuscript: C.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Huang, T.P.; Huang, D.Q.; Wang, Z.K.; Shah, A. Robust tracking control of a quadrotor UAV based on adaptive sliding mode controller. Complexity 2019, 29, 37–42. [Google Scholar] [CrossRef]
Sun, F.J.; Wang, X.C.; Zhang, R. Task scheduling system for UAV operations in agricultural plant protection environment. J. Ambient Intell. Humaniz. Comput. 2020, 17, 37–46. [Google Scholar] [CrossRef]
Meng, H.W.; Guo, Y.M. Automatic safety routing inspection of the electric circuits based on UAV light detection and ranging. Destech Trans. Eng. Technol. Res. 2017, 23, 102–113. [Google Scholar] [CrossRef] [Green Version]
Scherer, J.; Rinner, B. Persistent multi-UAV surveillance with energy and communication constraints. In Proceedings of the IEEE International Conference on Automation Science and Engineering, Fort Worth, TX, USA, 21–25 August 2016; pp. 1225–1230. [Google Scholar]
Chen, X.Y.; Nan, Y.; Yang, Y. Multi-UAV reconnaissance task assignment for heterogeneous targets based on modified symbiotic organism search algorithm. Sensors 2019, 19, 734. [Google Scholar] [CrossRef] [Green Version]
Green, D.R.; Hagon, J.J.; Gómez, C.; Gregory, B.J. Chapter 21—Using Low-Cost UAVs for Environmental Monitoring, Mapping, and Modelling: Examples from the Coastal Zone; Krishnamurthy, R.R., Jonathan, M.P., Srinivasalu, S., Glaeser, B., Management, C., Eds.; Academic Press: Cambridge, MA, USA, 2019; pp. 465–501. ISBN 9780128104736. [Google Scholar]
Grayson, S. Search & Rescue Using Multi-Robot Systems; School of Computer Science and Informatics, University College Dublin: Dublin, Irland, 2014; pp. 1–14. [Google Scholar]
Oh, B.H.; Kim, K.; Choi, H.L.; Hwang, I. Cooperative multiple agent-based algorithm for evacuation planning for victims with different urgencies. J. Aerosp. Inf. Syst. 2018, 15, 382–395. [Google Scholar] [CrossRef]
Jung, S.; Ariyur, K.B. Strategic cattle roundup using multiple quadrotor UAVs. Int. J. Aeronaut. Space Sci. 2017, 18, 315–326. [Google Scholar] [CrossRef]
Lottes, P.; Khanna, R.; Pfeifer, J.; Siegwart, R.; Stachniss, C. UAV-based crop and weed classification for smart farming. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May 2017–3 June 2017; pp. 3024–3031. [Google Scholar]
Barmpounakis, E.N.; Vlahogianni, E.I.; Golias, J.C. Unmanned Aerial Aircraft Systems for transportation engineering: Current practice and future challenges. Int. J. Transp. Sci. Technol. 2016, 5, 111–122. [Google Scholar] [CrossRef]
Omagari, H.; Higashino, S.I. Provisional-ideal-point-based multi-objective optimization method for drone delivery problem. Int. J. Aeronaut. Space Sci. 2018, 19, 262–277. [Google Scholar] [CrossRef]
Bai, X.; Yan, W.; Ge, S.S.; Cao, M. An integrated multi-population genetic algorithm for multi-vehicle task assignment in a drift field. Inf. Sci. 2018, 453, 227–238. [Google Scholar] [CrossRef]
Bai, X.; Yan, W.; Cao, M.; Xue, D. Distributed multi-vehicle task assignment in a time-invariant drift field with obstacles. Inst. Eng. Technol. 2019, 13, 2886–2893. [Google Scholar] [CrossRef]
Bektas, T. The multiple traveling salesman problem: An overview of formulations and solution procedures. Omega 2006, 34, 209–219. [Google Scholar] [CrossRef]
Abd-Elrahman, E.; Afifi, H.; Atzori, L.; Hadji, M.; Pilloni, V. IoT-D2D task allocation: An award-driven game theory approach. In Proceedings of the 2016 23rd International Conference on Telecommunications, Thessaloniki, Greece, 16–18 May 2016; pp. 1–6. [Google Scholar]
Bai, X.; Yan, W.; Ge, S.S. Distributed Task Assignment for Multiple Robots Under Limited Communication Range. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 4259–4271. [Google Scholar] [CrossRef]
Bai, X.; Fielbaum, A.; Kronmüller, M.; Knoedler, L.; Alonso-Mora, J. Group-Based Distributed Auction Algorithms for Multi-Robot Task Assignment. IEEE Trans. Autom. Sci. Eng. 2022. [Google Scholar] [CrossRef]
Chen, X.; Liu, Y. Cooperative Task Assignment for Multi-UAV Attack Mobile Targets. In Proceedings of the 2019 Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019; pp. 2151–2156. [Google Scholar]
Li, M.; Liu, C.; Li, K.; Liao, X.; Li, K. Multi-task allocation with an optimized quantum particle swarm method. Appl. Soft Comput. 2020, 96, 106603. [Google Scholar] [CrossRef]
Ye, F.; Chen, J.; Tian, Y.; Jiang, T. Cooperative task assignment of a heterogeneous multi-UAV system using an adaptive genetic algorithm. Electronics 2020, 9, 687. [Google Scholar] [CrossRef] [Green Version]
Bai, X.; Yan, W.; Cao, M. Clustering-Based Algorithms for Multivehicle Task Assignment in a Time-Invariant Drift Field. IEEE Robot. Autom. Lett. 2017, 2, 2166–2173. [Google Scholar] [CrossRef] [Green Version]
Mclain, T.W.; Chandler, P.R.; Rasmussen, S.; Pachter, M. Cooperative control of UAV rendezvous. In Proceedings of the 2001 American Control Conference, Arlington, VA, USA, 25–27 June 2001; Volume 3, pp. 2309–2314. [Google Scholar]
Kumaer, R.; Hyland, D.C. Control law design using repeated trials. In Proceedings of the 2001 American Control Conference, Arlington, VA, USA, 25–27 June 2001; Volume 2, pp. 837–842. [Google Scholar]
Singh, L.; Fuller, J. Trajectory generation for a UAV in urban terrain, using nonlinear MPC. In Proceedings of the American Control Conference, Arlington, VA, USA, 25–27 June 2001; Volume 3, pp. 2301–2308. [Google Scholar]
Bellingham, J.; Richards, A.; How, J.P. Receding horizon control of autonomous aerial vehicles. In Proceedings of the American Control Conference, Anchorage, AK, USA, 8–10 May 2002; Volume 5, pp. 3741–3746. [Google Scholar]
Richards, A.; Bellingham, J.; Tillerson, M.; How, J. Coordination and control of multiple UAVs. In Proceedings of the AIAA Guidance, Navigation, and Control Conference and Exhibit, San Francisco, CA, USA, 15–18 August 2002; p. 4588. [Google Scholar]
Xiao, Z.; Zhu, B.; Wang, Y.; Miao, P. Low-complexity path planning algorithm for unmanned aerial vehicles in complicated scenarios. IEEE Access 2018, 6, 57049–57055. [Google Scholar] [CrossRef]
Zhang, K.; Li, K.; Shi, H.T.; Zhang, Z. Autonomous guidance maneuver control and decision making algorithm based on deep reinforcement learning UAV route. Syst. Eng. Electron. 2020, 42, 1567–1574. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; The MIT Press: Cambridge, MA, USA, 2014; pp. 3104–3112. [Google Scholar]
Cho, K.; Van Merrienboer, B.; Gulcehre, C.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the ICLR (Poster), San Juan, PR, USA, 2–4 May 2016. [Google Scholar]
Shima, T.; Rasmussen, S.J.; Sparks, A.G. UAV cooperative multiple task assignments using genetic algorithms. In Proceedings of the American Control Conference, Portland, OR, USA, 8–10 June 2005; pp. 2989–2994. [Google Scholar]
Ou, W.; Zou, F.; Xu, X.; Zheng, G. Targets assignment for cooperative multi-UAVs based on chaos optimization algorithm. In Proceedings of the 9th International Conference for Young Computer Scientists, Hunan, China, 18–21 November 2008; pp. 2852–2856. [Google Scholar]
Babel, L. Coordinated target assignment and UAV path planning with timing constraints. Intell. Robot. Syst. 2018, 94, 857–869. [Google Scholar] [CrossRef]
Liu, J.L.; Shi, Z.G.; Zhang, Y. A new method of UAVs multi-target task assignment. DEStech Trans. Eng. Technol. Res. 2018, 388–394. [Google Scholar] [CrossRef]
Bellingham, J.; Tillerson, M.; Richards, A.; How, J.P. Multi-task allocation and path planning for cooperating UAVs. In Cooperative Control: Models, Applications and Algorithms; Springer: Berlin/Heidelberg, Germany, 2003; pp. 23–41. [Google Scholar]
Alighanbari, M. Task Assignment Algorithms for Teams of UAVS in Dynamic Environments. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2004. [Google Scholar]
Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1994. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Hausknecht, M.; Stone, P. Deep recurrent q-learning for partially observable mdps. In Proceedings of the 2015 AAAI Fall Symposium Series, Arlington, VA, USA, 12–14 November 2015. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1889–1897. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, 29 November–4 December 1999; pp. 1057–1063. [Google Scholar]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Volume 32, pp. 387–395. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1928–1937. [Google Scholar]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; The MIT Press: Cambridge, MA, USA, 2017; pp. 6382–6393. [Google Scholar]
Foerster, J.N.; Farquhar, G.; Afouras, T.; Nardelli, N. Counterfactual Multi-Agent Policy Gradients. In Proceedings of the Association for the Advancement of Artificial Intellugence (AAAI), Hilton New Orleans Riverside, New Orleans, LA, USA, 2–7 February 2018; pp. 2974–2982. [Google Scholar]
Wei, E.; Wicke, D.; Freelan, D.; Luke, S. Multiagent soft q-learning. In Proceedings of the 2018 AAAI Spring Symposium Series, Palo Alto, CA, USA, 26–28 March 2018. [Google Scholar]
Qie, H.; Shi, D.; Shen, T.; Xu, X.; Li, Y.; Wang, L. Joint Optimization of Multi-UAV Target Assignment and Path Planning Based on Multi-Agent Reinforcement Learning. IEEE Access 2019, 7, 146264–146272. [Google Scholar] [CrossRef]

Figure 1. Description of the POMDP model.

Figure 2. Framework of SMADDPG.

Figure 3. Framework of DDPG.

Figure 4. Framework of local network.

Figure 5. Framework of global network.

Figure 6. The MDP model of the multi-UAV task planning.

Figure 7. Critical area, pseudo-collision and real collision.

Figure 8. Task scenario adopted in the experience, (a) Cooperative Communication (b) Physical Deception.

Figure 9. Comparison between SMADDPG (a) and DDPG (b) in the environment of cooperative communication (first row) and physical deception (second row) environment at start, middle and end.

Figure 10. Two indicators are used to evaluate algorithms in the cooperative communication environment with six agents: (a) reward, (b) success rate.

Figure 11. Two indicators are used to evaluate algorithms in the cooperative communication environment with N agents: (a) reward, (b) success rate.

Table 1. Key parameters in the algorithm.

Name	Parameter	Initial Value Setting
Online policy network	$θ_{i}^{μ}$	Random initialization
Target policy network	$θ_{i}^{μ^{'}}$	$θ_{i}^{μ}$
Central policy network	$θ_{N}^{μ}$	Random initialization
Online Q network	$θ_{i}^{Q}$	Random initialization
Target Q network	$θ_{i}^{Q^{'}}$	$θ_{i}^{Q}$
Central Q network	$θ_{N}^{Q}$	Random initialization
Seq2Seq network	$δ$	Random initialization
Discount factor	$γ$	0.95
Learning rate	$α$	0.001
Soft update factor	$τ$	0.01
Maximum episodes	$E_{\max}$	30000
Maximum steps	$S_{\max}$	40
Mini batch size	$B$	1024

Table 2. Parameters of the UAV model in the experiment.

Parameter	Value	Unit
Maximum flight speed $v_{\max}$	10	m/s
Maximum flight acceleration $a_{\max}$	3	$m^{2} / s$
Minimum flight altitude $h_{\min}$	0.1	km
Maximum flight altitude $h_{\max}$	9.2	km
Maximum turning angle $θ_{\max}$	60	$^{°}$ (Degree)
Maximum climbing angle $ψ_{\max}$	45	$^{°}$ (Degree)

Table 3. Results on the physical deception task, with N = 6, 12 cooperative agents.

		N = 6		N = 12
$Agent π$	$Adversary π$	AG succ%	ADV succ%	AG succ%	ADV succ%
SMADDPG	DDPG	85.2	23.4	81.3	19.9
SMADDPG	SMADDPG	70.1	14.8	68.4	11.6
DDPG	DDPG	15.6	34.7	12.9	30.2
DDPG	SMADDPG	30.7	27.5	25.8	22.7

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Z.; Qiu, C.; Zhang, Z. Sequence-to-Sequence Multi-Agent Reinforcement Learning for Multi-UAV Task Planning in 3D Dynamic Environment. Appl. Sci. 2022, 12, 12181. https://doi.org/10.3390/app122312181

AMA Style

Liu Z, Qiu C, Zhang Z. Sequence-to-Sequence Multi-Agent Reinforcement Learning for Multi-UAV Task Planning in 3D Dynamic Environment. Applied Sciences. 2022; 12(23):12181. https://doi.org/10.3390/app122312181

Chicago/Turabian Style

Liu, Ziwei, Changzhen Qiu, and Zhiyong Zhang. 2022. "Sequence-to-Sequence Multi-Agent Reinforcement Learning for Multi-UAV Task Planning in 3D Dynamic Environment" Applied Sciences 12, no. 23: 12181. https://doi.org/10.3390/app122312181

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sequence-to-Sequence Multi-Agent Reinforcement Learning for Multi-UAV Task Planning in 3D Dynamic Environment

Abstract

1. Introduction

2. Related Work

2.1. Task-Planning Problem

2.2. Markov Decision Process

2.3. Reinforcement Learning

2.4. Sequence to Sequence

3. Problem Formulation

4. Sequence-to-Sequence Multi-Agent Deep Deterministic Policy Gradient

4.1. Local Network

4.2. Global Network

5. MDP Model for the Multi-UAV Task Planning

5.1. UAV Flight Model

5.2. Reward Function for UAV Flight Effect Evaluation

6. Simulation Experience and Verification

6.1. Experimental Setup

6.1.1. Cooperative Communication

6.1.2. Physical Deception

6.2. Experimental Result

7. Summary and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI