Hierarchical Maneuver Decision Method Based on PG-Option for UAV Pursuit-Evasion Game

Li, Bo; Zhang, Haohui; He, Pingkuan; Wang, Geng; Yue, Kaiqiang; Neretin, Evgeny

doi:10.3390/drones7070449

Open AccessArticle

Hierarchical Maneuver Decision Method Based on PG-Option for UAV Pursuit-Evasion Game

by

Bo Li

¹

,

Haohui Zhang

¹,

Pingkuan He

¹,

Geng Wang

^1,*,

Kaiqiang Yue

¹ and

Evgeny Neretin

²

¹

School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, China

²

School of Robotic and Intelligent Systems, Moscow Aviation Institute, 125993 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Drones 2023, 7(7), 449; https://doi.org/10.3390/drones7070449

Submission received: 23 April 2023 / Revised: 30 June 2023 / Accepted: 4 July 2023 / Published: 6 July 2023

(This article belongs to the Special Issue Intelligent Recognition and Detection for Unmanned Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Aiming at the autonomous decision-making problem in an Unmanned aerial vehicle (UAV) pursuit-evasion game, this paper proposes a hierarchical maneuver decision method based on the PG-option. Firstly, considering various situations of the relationship of both sides comprehensively, this paper designs four maneuver decision options: advantage game, quick escape, situation change and quick pursuit, and the four options are trained by Soft Actor-Critic (SAC) to obtain the corresponding meta-policy. In addition, to avoid high dimensions in the state space in the hierarchical model, this paper combines the policy gradient (PG) algorithm with the traditional hierarchical reinforcement learning algorithm based on the option. The PG algorithm is used to train the policy selector as the top-level strategy. Finally, to solve the problem of frequent switching of meta-policies, this paper sets the delay selection of the policy selector and introduces the expert experience to design the termination function of the meta-policies, which improves the flexibility of switching policies. Simulation experiments show that the PG-option algorithm has a good effect on UAV pursuit-evasion game and adapts to various environments by switching corresponding meta-policies according to current situation.

Keywords:

UAV pursuit-evasion game; hierarchical reinforcement learning; meta-policy; policy gradient

1. Introduction

Unmanned aerial vehicles (UAVs) [1,2,3,4,5,6,7] are used in many fields, such as intelligent confrontation [8], target rounding [9] and intelligent transportation [10], because of their characteristics of being unmanned, having good concealment and having no casualties. UAV pursuit-evasion [11] involves a game between two UAVs with competing interests. In the process of UAV pursuit-evasion, being able to make effective maneuvering decisions [12] to destroy the other side and capture the other side is the key to victory. Among these, the real-time intelligent maneuvering decision-making ability of UAV is the core of problem solving. The maneuvering decision-making mechanism reflects the intelligence level of a UAV in the pursuit-evasion game. Therefore, it is necessary to design an effective maneuvering policy in the process of the UAV pursuit-evasion game.

At present, decision algorithms in UAV pursuit-evasion mainly include differential game theory [13], influence graph method [14], heuristic search algorithm [15], etc. F. Yu et al. [13] take into account the impact of environmental impediments in the pursuit-evasion game between UAVs and UGVs, qualitatively assess the pursuit problem of the difference game and use the differential game in the pursuit-evasion game. Q. Pan et al. [14] propose a cooperative maneuver decision method for multiple unmanned aerial vehicles based on the influence graph theory. A state predicted influence diagram model is used to analyze elements, and an unscented Kalman filter model is used for belief state updating. Mikhail et al. [15] propose schemes to solve the pursuit-evasion problem using Apollonius circles and UAVs in a non-deterministic environment. However, the disadvantages of traditional UAV pursuit-evasion decision methods are that the rule base is complicated and difficult to cover all pursuit-evasion situations, and it has poor flexibility.

Since the 21st century, the development of artificial intelligence [16] has reached its climax. Artificial intelligence methods such as deep learning [17] and reinforcement learning [18] have been well applied in many high-tech fields. The application of artificial intelligence to UAV pursuit-evasion has also achieved good results. Researchers regard UAV as an agent in reinforcement learning, enabling them to gradually acquire optimized polices in complex environments after a certain stage of trial-and-error learning in the environment [19]. Deep reinforcement learning combines the perceptual ability of deep learning with the decision-making ability of reinforcement learning in a general form, providing a way to solve the problem of UAV pursuit-evasion [20,21,22,23,24]. Zhang et al. [20] design a guiding reward function based on the Deep Deterministic Policy Gradient (DDPG) algorithm for the UAV pursuit-evasion task and introduce a soft update strategy based on a sliding average. A UAV swarm can successfully carry out pursuit missions. However, the game between multi-UAVs and one UAV results in a significant advantage gap between the opposing sides. Additionally, the two-dimensional simulation environment differs greatly from actual pursuit-evasion tasks. Zhang et al. [21] construct a multi-agent coronal bidirectional coordinated target prediction network (CBC-TP network) by vectorial extension of the multi-agent depth Deterministic strategy gradient formula, so as to realize the process of UAV pursuit-evasion. However, the pursuit-evasion game environment is singular, and this paper does not comprehensively consider various situations. Fu et al. [22] propose IL-DDPG, which combines the DDPG algorithm and imitation learning algorithm. At the same time, the proportional guidance law is introduced as a guidance strategy. Compared with the DDPG algorithm, it improves the search efficiency and realizes the fast tracking of a pursuing UAV to avoid a UAV. However, the two-dimensional simulation environment differs significantly from the real pursuit-evasion environment. Additionally, the trained UAV can only perform the task of pursuit, resulting in limited functionality. Sun et al. [23] study the pursuit-evasion game problem of multiple UAVs with multiple static obstacles in a two-dimensional bounded environment and propose a multi-agent deep deterministic policy gradient based on attention algorithm to solve the pursuit-evasion problem of multi-UAVs. However, the paper does not take into account the threat posed by an opponent UAV, and there is a lack of a game process between the opponent UAV and our UAV. Vlahov et al. [24] discuss a framework for the development of reactive strategies that can learn to exploit behaviors, apply the A3C algorithm to UAV pursuit decision-making and verify the effectiveness of learning strategies through Monte Carlo experiments. However, it does not consider the generalization of the algorithm. When the environment changes, the performance advantage of the algorithm is not manifested. Additionally, it also does not comprehensively consider multiple situational scenarios.

Although reinforcement learning has a good performance in the UAV pursuit-evasion game, it also faces a problem: in the training process, due to the large state space, it may lead to dimensional disaster, which in turn affects the convergence speed of training. In addition, the pursuit-evasion environment changes in real time. The above algorithms do not comprehensively consider the overall situation, and only one strategy is trained so that the trained UAV can only complete the task of pursuit or evasion, and its performance is relatively simple, which cannot realize the function of turning the defeat into victory.

At present, hierarchical reinforcement learning algorithms are widely used in many fields [25,26,27,28]. Hierarchical reinforcement learning is for the performance of multiple sub-tasks in a hierarchical manner, which improves the decision-making efficiency of the model. Based on the above motivations, this paper considers applying the hierarchical reinforcement learning algorithm to the intelligent decision-making of UAV pursuit-evasion. Based on the framework of hierarchical reinforcement learning based on option [29], this paper chooses to use a policy gradient (PG) algorithm to train the top-level policy selector and proposes a UAV hierarchical maneuver decision-making method based on the PG-option. Four options of the advantage game (AG), quick escape (QE), situation change (SC) and quick pursuit (QP) are designed by considering various situations in the UAV pursuit-evasion game comprehensively, and the corresponding meta-policy is trained by the SAC algorithm. The meta-policy termination function is designed to improve the flexibility of meta-policy switching. The experimental results show that the UAV trained by the PG option can flexibly switch meta-policies in the pursuit-evasion game and can use different meta-policy combinations to deal with different complex scenarios, reflecting the superiority and robustness of hierarchical maneuver decision-making method based on the PG option in the future.

The main contributions of this paper are summarized as follows:

(1): Comprehensively considering the constraint information in the process of UAV pursuit-evasion, this paper conducts a flight control model for a UAV and introduces the concept of threat zone, which makes the simulation more realistic;
(2): Considering various situations in the process of UAV pursuit-evasion comprehensively, four meta-policies are designed for UAV flight decision-making, which not only enrich the maneuver library but also effectively improve the effectiveness of the algorithm;
(3): A hierarchical maneuvering decision-making method for the UAV is designed to ensure that the UAV can flexibly switch meta-polices under different situations and achieve victory.

2. Problem Formulation and Preliminaries

In this section, to simplify the UAV pursuit-evasion scenario and subtract some unnecessary influencing factors, some assumptions are made for the scenario. The UAV flight control model and threat zone model are established for subsequent experiments.

2.1. Scenario Description and Assumptions

In this paper, we focus on the problem of UAV maneuvering decisions in a one-to-one UAV pursuit-evasion game. In the process of pursuit-evasion, many factors may affect the flight of UAV, which can cause the model to be more complex. However, it is not necessary to take all factors into account. Therefore, to pay more attention to our research and simplify the complexity of scenario, this paper proposes the following assumptions:

(a): The UAV is assumed to be a rigid body, and the gravity acceleration is a unified value;
(b): This paper ignores the influence of earth curvature, earth rotation and earth revolution on the UAV flight;
(c): To simplify the complexity of the UAV flight, this paper only considers the kinematics model.

The scenario is defined as a three-dimensional pursuit-evasion scenario. Therefore, a three-dimensional kinematics model of a UAV is established, and the UAV is described by physical quantities such as position, speed and attitude. The OXYZ northeast coordinate system is established. The coordinate origin is O, which indicates the center point of the scenario. The X-axis points to the north direction; the Z-axis points to the east direction; and the Y-axis points to the vertical direction. The situation diagram of both sides in scenario is shown in Figure 1.

The situation information of the red UAV is

{\vec{R}}_{m} = (X_{m}, Y_{m}, Z_{m})

and

\vec{V} = (v_{x m}, v_{y m}, v_{z m})

. The situation information of the blue UAV is

{\vec{R}}_{t} = (X_{t}, Y_{t}, Z_{t})

and

{\vec{V}}_{t} = (v_{x t}, v_{y t}, v_{z t})

.

The relative position vector of red UAV and the blue UAV is

\vec{D}

, and the distance scalar is

d

. The azimuth angle

q

represents the angle between the red UAV’s velocity

\vec{V}

and

\vec{D}

. The formula is as follows:

\vec{D} = (X_{t} - X_{m}, Y_{t} - Y_{m}, Z_{t} - Z_{m})

(1)

q = acos (\frac{\vec{D} \cdot \vec{V}}{‖\vec{D}‖ \cdot ‖\vec{V}‖}) = acos (\frac{(X_{t} - X_{m}) v_{x m} + (Y_{t} - Y_{m}) v_{y m} + (Z_{t} - Z_{m}) v_{z m}}{d * (\sqrt{v_{x m}^{2} + v_{y m}^{2} + v_{z m}^{2}})})

(2)

d = ‖\vec{D}‖

(3)

2.2. UAV Model

It is assumed that our UAV can obtain the position and speed of the opponent through the ground command center. Our UAV needs to evaluate the current situation and make maneuvering decisions to effectively capture the opponent.

In the process of UAV kinematics modeling, it is considered a rigid body and regarded as a mass point. The kinematics equation of a three-degree-of-freedom UAV is as follows:

\{\begin{cases} X_{t + d T} = X_{t} + V_{t + d T} * \cos (θ_{t + d T}) * \cos (φ_{t + d T}) * d T \\ Y_{t + d T} = Y_{t} + V_{t + d T} * \sin (θ_{t + d T}) * d T \\ Z_{t + d T} = Z_{t} + V_{t + d T} * \cos (θ_{t + d T}) * \sin (φ_{t + d T}) * d T \end{cases}

(4)

where

φ

denotes the heading angle;

θ

denotes the pitch angle;

V

is the speed of the UAV;

[X, Y, Z]

denotes the component of the UAV in coordinate axes.

To make the model more real, this paper sets the threat range with the UAV as the center. The threat range is a conical area centered on the UAV and formed by the deviation of the UAV’s speed direction from the

q_{\max}

angle, which is called the threat zone in this paper. The threat zone contains three important indicators: maximum threat distance

D_{\max}

, minimum threat distance

D_{\min}

and maximum threat angle

q_{\max}

. The two-dimensional diagram of the threat zone is shown in Figure 2.

When the distance between the two sides is less than the minimum threat distance

D_{\min}

, it is considered that the two sides collide and crash, and our UAV failed. When the continuous flight time of blue UAV in our threat zone that is defined as

t_{i n}

exceeds the maximum time threshold

t_{\max}

, it is considered that the red UAV successfully captured the blue UAV. The formula is as follows:

\{\begin{cases} t_{i n} > t_{\max} \\ q < q_{\max} \\ D_{\min} < |\vec{D}| < D_{\max} \end{cases}

(5)

3. Hierarchical Maneuver Decision Method for UAV Pursuit-Evasion Game

In this section, a hierarchical maneuver decision-making method for the UAV pursuit-evasion game based on the PG option is proposed. Considering the situation of both sides comprehensively, UAV maneuver decision-making be divided into the four meta-policies of advantage game (AG), quick escape (QE), situation change (SC) and quick pursuit (QP). Because different meta-polices perform different maneuvering characteristics, the corresponding reward function is designed and trained by the SAC algorithm. Then, the PG algorithm is integrated into the traditional hierarchical reinforcement learning to train the upper policy selector, and the expert experience is introduced to design meta-policy termination function. The structure of hierarchical decision-making is shown in Figure 3. The hierarchical decision-making model designed in this paper consists of two policy layers. There are four meta-polices at the bottom of the hierarchical decision model. The meta-policies are pre-trained so that they can perform corresponding actions according to the current pursuit-evasion situation when they are selected. There is a policy selector at the top of the hierarchical decision model, which is used to select the corresponding meta-policy according to the current situation.

3.1. Meta-Policy Model Training Algorithm Based on SAC

For the four meta-policies training of AG, QE, SC and QP, the paper uses the maximum entropy Soft Actor-Critic (SAC) algorithm to train them. SAC is a classical reinforcement learning algorithm. The SAC [30,31,32,33] algorithm is an algorithm based on the Actor-Critic framework, which can randomize the policy. At the same time, entropy is introduced to represent the randomness of the policy. After being trained, the agent will achieve a balance between the reward value and the entropy, so that the agent can increase the exploration of the state space while maximizing the reward and achieving the purpose of accelerating the learning speed.

The four meta-policies all use the same state space, which is represented by nine tuples, as shown in Formula (6).

s t a t e = [X_{m}, Y_{m}, Z_{m}, v, θ, φ, d, q_{m}, q_{t}]

(6)

where

X_{m}, Y_{m}, Z_{m}

represents the projection of UAV ’s position on three coordinate axes;

v

is the speed of our UAV;

θ

denotes the pitch angle of our UAV;

φ

is the heading angle of the UAV;

d

denotes the distance between the opponent and our UAV;

q_{m}

is the relative azimuth of the opponent UAV related to our UAV;

q_{t}

is the relative azimuth of our UAV related to opponent UAV.

q_{t}

and

q_{m}

are shown in Figure 4, where the blue side is the opponent UAV and the red side is our UAV.

Maneuvering speed change rate, pitch angle change rate and heading angle change the rate control of our UAV. The action space of each meta-policy can be represented by triples in the form of Formula (7).

a c t i o n_s p a c e = [\dot{v}, \dot{θ}, \dot{φ}]

(7)

where

\dot{v}

denotes the acceleration;

\dot{θ}

denotes the change rate of pitch angle;

\dot{φ}

denotes the change rate of heading angle.

When using a reinforcement learning algorithm to train agents, the design of the reward function is often very important for training results. A good reward function is often more intuitive and simple, which can make the agent move in the optimal direction. Different reward functions need to be set for different meta-policy to explore the corresponding optimal policy.

To facilitate the setting of different reward functions for the four meta-policy trainings, the following reward and penalty items are set:

$R_{m q 1}$ is the continuous angle penalty value for our UAV, which will continuously penalize the agent in a round. The equation $R_{m q 1}$ is: $R_{m q 1} = - q_{m} / 180$ ;
$R_{m q 2}$ is the sparse angle reward value for our UAV, which will reward the agent under certain conditions in a round. The equation $R_{m q 2}$ is as follows: $R_{m q 2} = 1, i f q_{m} < q_{\max}$ ;
$R_{t q 1}$ is the continuous angle penalty value for opponent UAV, which will continuously penalize the agent in a round. The equation $R_{t q 1}$ is as follows: $R_{t q 1} = - q_{t} / 180$ ;
$R_{t q 2}$ is the sparse angle reward value for opponent UAV, which will reward the agent under certain conditions in a round. The equation $R_{t q 2}$ is as follows: $R_{t q 2} = 1, i f q_{t} < q_{\max}$ ;
$R_{d 1}$ will reward the agent when the opponent UAV is in the threat zone of our UAV. The equation $R_{d 1}$ is as follows: $R_{d 1} = 1, i f D_{\min} < d < D_{\max}$ ;
$R_{d 2}$ will penalize the agent when the opponent UAV is out of the threat zone of our UAV. The equation $R_{d 2}$ is as follows: $R_{d 2} = - 3, i f d \geq D_{\max} o r d \leq D_{\min}$ ;
$R_{d 3}$ denotes the ratio of the distance between the two sides and maximum threat distance. The equation $R_{d 3}$ is as follows: $R_{d 3} = d / D_{\max}$ .

Then, we describe each meta-policy and set different reward function combinations for different meta-policies:

Advantage Game (AG): AG refers to two UAVs in the same scenario, and they are in each other’s threat zone. In this situation, our UAV needs to escape from the opponent’s threat zone as soon as possible and keep the distance between two sides within a small range to facilitate our UAV to capture the opponent. Therefore, for the AG, the total reward is $R_{A G}$ . The equation $R_{A G}$ is as follows: $R_{A G} = w_{1} * (R_{m q 1} + R_{m q 2}) - w_{2} * (R_{t q 1} + R_{t q 2}) + w_{3} * (R_{d 1} + R_{d 2} + R_{d 3} / 5)$ . The paper holds that the reward of both sides is equally important to the task of AG, so $w_{1} = w_{2} = 0.4$ . Furthermore, the primary task of AG is to ensure that UAV escapes from the opponent’s threat zone quickly. Therefore, the weight of the distance reward is small; that is $w_{3} = 0.2$ .
Quick Escape (QE): QE means that in the same scenario, our UAV is located in the opponent threat zone, while the opponent UAV is not in our threat zone, and the opponent’s advantage is greater than our advantage. The primary task of our UAV is to maneuver quickly to escape the opponent’s threat zone. For the QE task, the total reward is $R_{Q E}$ . The equation $R_{Q E}$ is as follows: $R_{Q E} = w_{1} * (R_{t q 1} + R_{t q 2}) - w_{2} * (R_{d 1} + R_{d 3} / 5)$ . In the QE task, angle reward and distance reward are equally important; that is $w_{1} = w_{2} = 0.5$ .
Situation Change (SC): SC means that in the same scenario, both sides are not in the other’s threat zone, but the opponent’s advantage is greater than our advantage. Our primary task is to maneuver quickly to change the situation so that our advantage is greater than the opponent’s advantage. For the SC task, the total rewards are $R_{S C}$ . The equation $R_{S C}$ is as follows: $R_{S C} = w_{1} * (R_{m q 1} + R_{m q 2}) - w_{2} * (R_{t q 1} + R_{t q 2}) + w_{3} * R_{d 3}$ . This paper believes that our angle reward and distance reward are equally important to the SC task; that is $w_{1} = w_{3} = 0.4$ , $w_{2} = 0.2$ .
Quick Pursuit (QP): QP means that in the same scenario, both the opponent and our UAV are not in the opponent’s threat zone, and our UAV’s advantage is greater than the opponent UAV’s advantage. The primary task of our UAV is to capture the opponent UAV. Our UAV needs to maneuver quickly to make the opponent UAV fall into our threat zone. The total rewards are $R_{Q P}$ . The equation $R_{Q P}$ is $R_{Q P} = w_{1} * (R_{m q 1} + R_{m q 2}) + w_{2} * (R_{d 1} + R_{d 2} / 5)$ . In the QP task, angle reward and distance reward are equally important; that is $w_{1} = w_{2} = 0.5$ .

The meta-policy uses the maximum entropy SAC algorithm for training. The network structure set in this paper is shown in Figure 5.

The input of the Actor neural network is the state value of both sides, and the output is the speed change rate, pitch angle change rate and heading angle change rate of our UAV. The input of the Critic neural network is the state value and action value, and the output is the state action value (Q value). Experiments have shown that a wide and shallow fully connected neural network with the same number of neurons performs better than a narrow and deep network [34]. Therefore, both Actor and Critic networks are fully connected neural networks with two hidden layers, and the number of hidden layer neurons is 256. The activation function is Relu. In addition, the Actor neural network has two output layers, which are used to output the mean

μ

and variance

σ

, respectively, and the final action value can be obtained by sampling for agent execution.

Meta-policy training pseudo-code is shown in Algorithm 1.

Algorithm 1: Meta-Policy Training Algorithm

1. Randomly generated parameters:

θ, φ_{1}, φ_{2}

2. Initialize the policy network

π_{θ}

and two Soft-Q networks

Q_{φ_{1}}, Q_{φ_{2}}

3. Initialize the target network of the Soft-Q network, and let

φ_{1}' \leftarrow φ_{1}, φ_{2}' \leftarrow φ_{2}

4. FOR t = 0, T:

5. Get state

s

6. IF t < start_size:

7.

a = r a n d o m ()

8. ELSE:

9.

μ, \log σ = π_{θ} (s), a = \tanh (μ + τ * \exp (σ))

10. The agent performs actions

a

and gets a reward

r

and the next state

s'

11. Store the array

< s, a, r, s' >

in the experience pool

12. IF Experience storage in the experience pool > batch_size:

13. Sampling batch_size group data

< s, a, r, s' >

from the experience pool

14. Update Q function network parameters:

θ_{i} \leftarrow θ_{i} - λ_{Q} {\overset{\land}{\nabla}}_{_{θ_{i}}} J_{Q} (θ_{i})

15. Update policy network’s weights:

ϕ \leftarrow ϕ - λ_{π} {\overset{\land}{\nabla}}_{ϕ} J_{π} (ϕ)

16. Adjust the parameters of temperature:

α \leftarrow α - λ_{π} {\overset{\land}{\nabla}}_{α} J (α)

17. Update the parameters of the target network:

φ_{1}^{'} \leftarrow φ + (1 - τ) φ_{1}^{'}

φ_{2}^{'} \leftarrow φ + (1 - τ) φ_{2}^{'}

18. END IF

19. Jump to step 6, let

s \leftarrow s'

20. END IF

21. END FOR

3.2. Hierarchical Decision Method Based on PG-Option

The traditional hierarchical reinforcement learning algorithm based on option divides the target task into multiple options. Each option has its policy, which is called the meta-policy. The meta-policy is selected by the top-level policy selector. In the traditional hierarchical reinforcement learning algorithm based on this option, the policy selector uses the

ε

-greedy algorithm to select the meta-policy with the highest reward return in the current state as the decision. The problem with

ε

-greedy algorithm is that it generally does not consider other possible situations from the whole. Each choice is only a local optimal solution, and it takes a long time to capture the opponent. However, in a complex scene, our UAV needs to remove danger as soon as possible to achieve victory.

To better solve the problem of the complex UAV pursuit-evasion game, the paper designs a hierarchical decision model based on the traditional hierarchical reinforcement learning algorithm framework and the policy gradient (PG) algorithm. Instead of the

ε

-greedy policy, the neural network trained by the PG algorithm is used as the policy selector. Compared with the

ε

-greedy algorithm, PG is easier to converge the training results and can handle high-dimensional continuous state data.

Since the policy selector evaluates the overall situation, the state space of the upper policy selector adds the position

[X_{t}, Y_{t}, Z_{t}]

and speed

v_{t}

of the opponent UAV based on the meta-policy state space, which is shown in Formula (8).

a l l_s t a t e s = [X_{m}, Y_{m}, Z_{m}, X_{t}, Y_{t}, Z_{t}, v, v_{t}, θ, φ, d, q_{m}, q_{t}]

(8)

The action space outputs the probability of four meta-policies. If a meta-policy is selected, the final reward value will increase, which will increase the probability of this probability, and vice versa.

When the policy selector model is trained, the reward function is shown in Formula (9).

R = \{\begin{matrix} r (τ) / t o t a l_s t e p i f Success \\ - 3 i f Failed \end{matrix}

(9)

where

r (τ)

is the sum of the rewards at the end of each round;

t o t a l_s t e p

is the all simulation step size in each round.

The policy selector can only be rewarded at the end of that moment. The success mark is to capture the opponent UAV, and the failure mark is that the opponent UAV captures our UAV or our UAV fails to capture opponent UAV within the maximum number of rounds.

In the process of training, the upper policy selector model is optimized by maximizing the reward value, and the network parameter of the whole hierarchical model is set as

θ = (ϕ, φ)

.

ϕ

denotes the network parameters of the policy selector, and

φ

denotes the network parameters of the meta-policy. The expected reward of the pursuit-evasion process is shown in Formula (10).

J (θ) = E_{τ ~ π_{θ} (τ)} [r (τ)] = \int_{_{τ ~ π_{θ} (τ)}} π_{θ} (τ) r (τ)

(10)

Next, we take the derivative of Formula (10) which is shown in Formula (11):

\nabla_{θ} J (θ) = E_{τ ~ π_{θ} (τ)} [\nabla_{θ} \log π_{θ} (τ) r (τ)]

(11)

where

π_{θ} (τ)

is as follows:

π_{θ} (τ) = p (s_{0, 0}) \prod_{i = 0}^{p - 1} π_{ϕ} (g_{i} | s_{i, 0}) \prod_{j = 0}^{T_{i}} π_{φ} (a_{i, j} | s_{i, g}, g_{i}) p (s_{i, j + 1} | s_{i, j}, a_{i . j})

(12)

Each round is divided into

p

stages.

s_{0, 0}

denotes the initial state of each round in UAV pursuit-evasion game, and

s_{i, 0}

denotes the initial state of the UAV in the first stage. At each stage, the UAV performs

T_{i}

steps, and

π_{ϕ} (g_{i} | s_{i, 0})

denotes the probability that the meta-strategy

g_{i}

is selected by the policy selector.

π_{φ} (a_{i, j} | s_{i, j}, g_{i})

denotes the probability of the current action.

p (s_{i, j + 1} | s_{i, j}, a_{i, j})

denotes the state transition probability. In Section 2.1, the meta-policy is trained to converge, so the state transition probability is 1. Formula (12) can be rewritten:

π_{θ} (τ) = p (s_{0, 0}) \prod_{i = 0}^{p - 1} π_{ϕ} (g_{i} | s_{i, 0})

(13)

Then, we take the derivative of Formula (13) which is shown in Formula (14):

\nabla_{θ} \log π_{θ} (τ) = \sum_{i = 0}^{p - 1} \nabla_{ϕ} \log π_{ϕ} (g_{i} | s_{i, 0})

(14)

Therefore, Formula (11) can be written as:

\nabla_{θ} J (θ) = E_{τ ~ π_{θ} (τ)} [\sum_{i = 0}^{p - 1} \nabla_{ϕ} \log π_{ϕ} (g_{i} | s_{i, 0}) r (τ)]

(15)

In the process of model training, the optimization goal is to maximize the gradient of Formula (15), to achieve the convergence of the policy selector model.

In addition, in the traditional hierarchical reinforcement learning algorithm based on this option, for each time the agent performs an action and obtains a new state, the policy will reselect the meta-policy according to the current state. However, in this way, when the situation is complex, the switching of meta-policy is too frequent, which makes it difficult for the UAV to meet such high-frequency switching, thus making it difficult to complete the task.

To solve the above problems, the policy selector performs periodic state evaluation at a frequency of 10 Hz and introduces expert experience to design meta-policies’ termination functions to obtain the signs of meta-policy success and failure which are shown in Table 1. After the end, the next meta-policy is selected. The framework of the model training is shown in Figure 6.

The policy selector training pseudo-code is shown in Algorithm 2.

Algorithm 2: Hierarchical Maneuver Decision Algorithm Based on PG-Option

1. Randomly generate the policy selector network’s parameters

ϕ

; Initialize the experience pool

D

2. Initialize state variables

s_{D r e v i o u s}

, Initialize the initial state of the agent

s_{0}

,

s_{0} = s_{D r e v i o u s}

3. FOR m = 1, M:

4. Initialization counting stage:

p = 0

5. Periodic selection of meta-policy

g_{π}

according to 10 Hz frequency

6. FOR t = 1, T:

7. Generate actions

a_{t}

and execute actions; Get reward value

r_{t}

and new state

s_{0 + t}

8. IF the current meta-policy execution ends:
9. Save

(s_{D r e v i o u s}, g_{π}, r_{t})

into the experience pool
10.

s_{D r e v i o u s} = s_{0 + t}, p = p + l

11. END IF

12. IF the end of the current round:
13. IF capturing the opponent UAV:
14. Update the reward value

r_{t} = r (τ) / t

15. ELSE:
16. Update the reward value

r_{t} = - 3

17. END IF
18. Update the network parameters of the policy selector

ϕ

, Clear the experience pool

D

19. Reinitialize the environment to obtain the initial state of the agent
20. END IF
21. END FOR
22. END FOR

4. Experiments and Results

In this section, the proposed hierarchical maneuver decision method will be numerically simulated in many scenes. Firstly, the training parameters and hardware conditions of the simulation experiment are given. After the convergence of the four meta-policy trainings, the policy selector is trained and tested in the experimental scenario, which shows the feasibility of the method. Finally, the algorithm is tested by multiple UAV pursuit-evasion situations, and the simulation analysis is carried out to illustrate the generalization of the method.

4.1. Parameters and Hardware

In this paper, to avoid the instability of simulation results due to the large performance advantage gap between two UAVs, it is assumed that parameters of two UAVs are the same, as shown in Table 2. The heading angle variation is the range of the maximum allowable charge of the heading angle in the single-step execution of the UAV. The pitch angle variation is the range of the maximum allowable change of the pitch angle in the single-step execution of the UAV. The maximum speed of the UAV is the upper limit of the speed of the UAV. The maximum time threshold when the UAV falls into the other’s threat zone is 2 s. Each step takes 0.1 s.

The four meta-policies are trained using the SAC algorithm, and the hyperparameters during training are as follows in Table 3. Batch_size is the number of samples extracted by the UAV from the experience pool during each training. The discount rate denotes the future reward discount when calculating the loss value. The optimization algorithm is the Adam algorithm. The regularization coefficient of entropy is initialized to 1. The target entropy is −3.

When training the model, the Pytorch (v1.2.1) framework is used to design the neural network and optimizer. The CPU is an Intel i5-10400F, and the GPU is an Nvidia RTX 3060Ti which can accelerate the training process of the deep neural network. The GPU memory size is 12 G, and the computer memory is 32 G.

4.2. Results of Train and Simulation

In this paper, to train the PG-option hierarchical decision algorithm, the four meta-policies must firstly be trained to converge. According to Figure 7, it can be seen that at the beginning of training, our UAV adopts a random policy. After a period of training, the reward value convergence tends to be stable, and our UAV has been able to complete the corresponding tasks stably.

In this paper, to train the hierarchical decision algorithm based on the PG option, in the initial environment, the positions of both sides are randomly initialized within a certain range, and the opponent UAV is added to the disturbance in the original maneuvering mode.

Our UAV maneuvers use hierarchical decisions based on the PG-option algorithm. The opponent UAV maneuvers according to the predetermined trajectory and will threaten our UAV.

In this paper, the success rate is used to judge whether the policy selector network converges or not. The specific method is to count the success rate of the UAV in the training round. If the final success rate can converge within a certain range, it shows that the network of policy selectors has been trained and converged.

Training is performed in the initial environment using Table 4. The random value range is taken

[- 1000, 1000]

, and the success rate is counted every 50 rounds.

It can be seen from Figure 8 that the final success rate is stable at about 82% under the above training, indicating that the policy selector training has converged after about 7200 rounds.

The trained model is tested in the training scene. Figure 9 shows the flight trajectory of the opponent and our UAV in the training scene.

At the initial time, the two UAVs are far away. It can be seen from the XOZ view that the height of the opponent UAV is lower than that of our UAV when the simulation begins. After that, our UAV changes the pitch angle, raises the height and shortens the distance between the two sides. From the XOY view, our UAV can reduce the opponent’s azimuth angle against us, fly to the opponent’s tail and fix the opponent UAV in the threat zone. Finally, the relative distance between the two sides is gradually shortened to meet the conditions of capturing.

From Figure 10a, it can be seen that before 80 steps, our UAV reduces the relative azimuth of the opponent UAV by rapidly rotating the heading angle and the pitch angle. After that, the relative azimuth of our UAV has always been kept in a small range. From Figure 10b, it can be seen that our UAV can approach the target faster. After meeting the distance conditions of the threat zone, our UAV can converge stably at about 2000 m.

The selection of meta-policy is shown in Figure 11. At the beginning, the UAV chooses SC for situation conversion. After 220 steps, our UAV chooses the QP method to approach the opponent to capture it, and the selected meta-policy is more in line with the current situation.

4.3. Generalization Simulation

To test whether the PG-option hierarchical decision algorithm can realize the intelligent decision-making task under different initial situations, our UAV is placed in the pursued situation and the active capture situation, respectively, and its flight trajectory is analyzed.

4.3.1. Active Capture Situation

The active capture situation means that the relative azimuth angle of both sides is small at the beginning of the operation, and the overall situation is that the opponent UAV and our UAV are maneuvering in opposite directions. A simulation will be carried out in this situation. The initial state is shown in Table 5.

The flight trajectory is shown in Figure 12. From Figure 12a, it can be seen that at the beginning of the simulation, the height of the opponent UAV is higher than that of our UAV. Our UAV adjusts its pitch angle and shortens the distance between the opponent and our UAV. From Figure 12b, it can be seen that at the beginning of the simulation, both sides of the opponent UAV and our UAV are maneuvering toward each other. Our UAV continues to maneuver and threat around the tail of the opponent UAV.

It can be seen from Figure 13a that after the simulation, our UAV maneuvers towards the opponent and reduces the relative azimuth of the opponent. The relative azimuth meets the requirements of the threat zone from beginning to end. After our UAV escapes the opponent threat area, our UAV keeps the opponent within its threat area.

From Figure 13b, it can be seen that after the simulation begins, the two sides maneuver in opposite directions, and the distance between the opponent UAV and our UAV decreases rapidly. When the distance between the opponent UAV and our UAV meets the requirements of the threat zone, our UAV continues to maneuver and keeps the distance between the opponent and our UAV in this range.

The selection of meta-policy in active capture situation is shown in Figure 14. Under the situation, our UAV begins to choose QP. At about 200 steps, because our UAV falls into the opponent’s threat zone in the process of pursuing the opponent, it chooses SC. At about 830 steps, our UAV chooses QP and finally wins.

4.3.2. Pursued Situation

The pursued situation means that at the beginning of the operation, our relative azimuth is small, and the opponent’s relative azimuth is large. The overall situation is that the opponent UAV is at the tail of our UAV. A simulation will be performed in this situation. The initial state of the pursued situation is shown in Table 6.

The trajectory in pursued situation is shown in Figure 15. From Figure 15a, it can be seen that after the simulation begins, our UAV increases its pitch angle and reduces the distance between the opponent and us. When approaching the height of the opponent, our UAV gradually reduces its pitch angle to prevent our UAV from being higher than the height of the opponent. It can be seen from Figure 15b that after the simulation begins, our UAV quickly adjusts the heading angle to reduce the relative azimuth of the opponent. After adjusting the azimuth, our UAV maneuvers toward the opponent and shortens the distance between the two sides. To further reduce the relative azimuth of the opponent, our UAV begins to rotate the heading angle and maintain the tail-chasing situation of our UAV against the opponent.

From Figure 16a, we can see that at the beginning of the simulation, the relative azimuth is very large, and our UAV is in a state of being pursued. After that, our UAV quickly reduces the relative azimuth and changes its unfavorable situation by maneuvering. After meeting the angle requirements of the threat zone, the simulation continues, and the azimuth has been meeting this requirement ever since.

The distance curve is shown in Figure 16b. It can be seen that at the beginning of the simulation because our UAV is pursued by the opponent, our initial speed is faster. Therefore, the distance between the opponent and our UAV will increase at the beginning. After that, our UAV quickly changes the heading angle and maneuvers toward the opponent. At the same time, the distance between the opponent and our UAV can always be maintained in this range after meeting the distance requirements of the threat zone.

The selection of meta-policy in the pursued situation is shown in Figure 17. Under the pursued situation, our UAV begins to choose the SC. At about 220 steps, our UAV chooses the AG to occupy a favorable position. At about 1100 steps, our UAV chooses the QP and successfully captures the opponent.

5. Conclusions

Based on the background of intelligent decision-making in the UAV pursuit-evasion game, this paper proposes a hierarchical maneuver decision-making method based on a PG option. The framework of maneuver decision-making is divided into two parts: the bottom is the four meta-policies of advantage game (AG), quick escape (QE), situation change (SC) and quick pursuit (QP), and the top is the policy selector trained by the PG algorithm. Delay evaluation and expert experience are introduced to effectively solve the problem of frequent meta-policy switching. The simulation and experimental results have shown that the method can choose reasonable meta-policies to control UAV maneuvering under different pursuit-evasion situations to capture the opponent. Some future work includes adding stationary or moving obstacles to simulation scenarios to make simulation more complex.

Author Contributions

Funding acquisition, B.L.; Methodology, H.Z.; Resources, B.L. and G.W.; Validation, H.Z., P.H. and K.Y.; Writing—original draft, B.L. and H.Z.; Writing—review and editing, G.W. and E.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by Project supported by the National Nature Science Foundation of China under grant no. 62003267, the Fundamental Research Funds for the Central Universities under grant no. G2022KY0602, the Technology on Electromagnetic Space Operations and Applications Laboratory under grant no. 2022ZX0090, the Key Research and Development Program of Shaanxi Province under grant no. 2023-GHZD-33, and the key core technology research plan of Xi’an under grant no. 21RGZN0016.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, B. Research on AI Application in the Field of Quadcopter UAVs. In Proceedings of the 2020 IEEE 2nd International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Weihai, China, 14–16 October 2020; pp. 569–571. [Google Scholar]
Li, B.; Gan, Z.; Chen, D.; Sergey Aleksandrovich, D. UAV Maneuvering Target Tracking in Uncertain Environments Based on Deep Reinforcement Learning and Meta-Learning. Remote Sens. 2020, 12, 3789. [Google Scholar] [CrossRef]
Li, B.; Song, C.; Bai, S.; Huang, J.; Ma, R.; Wan, K.; Neretin, E. Multi-UAV Trajectory Planning during Cooperative Tracking Based on a Fusion Algorithm Integrating MPC and Standoff. Drones 2023, 7, 196. [Google Scholar] [CrossRef]
Liu, X.; Su, Y.; Wu, Y.; Guo, Y. Multi-Conflict-Based Optimal Algorithm for Multi-UAV Cooperative Path Planning. Drones 2023, 7, 217. [Google Scholar] [CrossRef]
Li, S.; Wu, Q.; Du, B.; Wang, Y.; Chen, M. Autonomous Maneuver Decision-Making of UCAV with Incomplete Information in Human-Computer Gaming. Drones 2023, 7, 157. [Google Scholar] [CrossRef]
Zhang, H.; He, P.; Zhang, M.; Chen, D.; Neretin, E.; Li, B. UAV Target Tracking Method Based on Deep Reinforcement Learning. In Proceedings of the 2022 International Conference on Cyber-Physical Social Intelligence (ICCSI), Nanjing, China, 18–21 November 2022; pp. 274–277. [Google Scholar]
Alanezi, M.A.; Haruna, Z.; Sha’aban, Y.A.; Bouchekara, H.R.E.H.; Nahas, M.; Shahriar, M.S. Obstacle Avoidance-Based Autonomous Navigation of a Quadrotor System. Drones 2022, 6, 288. [Google Scholar] [CrossRef]
Shahid, S.; Zhen, Z.; Javaid, U.; Wen, L. Offense-Defense Distributed Decision Making for Swarm vs. Swarm Confrontation While Attacking the Aircraft Carriers. Drones 2022, 6, 271. [Google Scholar] [CrossRef]
Awheda, M.D.; Schwartz, H.M. A fuzzy reinforcement learning algorithm using a predictor for pursuit-evasion games. In Proceedings of the 2016 Annual IEEE Systems Conference (SysCon), Orlando, FL, USA, 18–21 April 2016; pp. 1–8. [Google Scholar]
Gao, K.; Han, F.; Dong, P.; Xiong, N.; Du, R. Connected Vehicle as a Mobile Sensor for Real Time Queue Length at Signalized Intersections. Sensors 2019, 19, 2059. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Alexopoulos, A.; Kirsch, B.; Badreddin, E. Realization of pursuit-evasion games with unmanned aerial vehicles. In Proceedings of the 2017 International Conference on Unmanned Aircraft Systems (ICUAS), Miami, FL, USA, 13–16 June 2017; pp. 797–805. [Google Scholar]
Gan, Z.; Li, B.; Neretin, E.; Sergey Aleksandrovich, D. UAV Maneuvering Target Tracking based on Deep Reinforcement Learning. J. Phys. 2021, 1958, 12015. [Google Scholar] [CrossRef]
Yu, F.; Zhang, X.; Li, Q. Determination of The Barrier in The Qualitatively Pursuit-evasion Differential Game. In Proceedings of the 2018 IEEE CSAA Guidance, Navigation and Control Conference (CGNCC), Xiamen, China, 10–12 August 2018; pp. 1–6. [Google Scholar]
Pan, Q.; Zhou, D.; Huang, J.; Lv, X.; Yang, Z.; Zhang, K.; Li, X. Maneuver decision for cooperative close-range air combat based on state predicted influence diagram. In Proceedings of the 2017 IEEE International Conference on Information and Automation (ICIA), Macao, China, 18–20 July 2017; pp. 726–731. [Google Scholar]
Mikhail, K.; Vyacheslav, K. Notes on the pursuit-evasion games between unmanned aerial vehicles operating in uncertain environments. In Proceedings of the 2021 International Conference Engineering and Telecommunication (En&T), Dolgoprudny, Russia, 24–25 November 2021; pp. 1–5. [Google Scholar]
Han, Z. The Application of Artificial Intelligence in Computer Network Technology. In Proceedings of the 2021 2nd International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Shanghai, China, 15–17 October 2021; pp. 632–635. [Google Scholar]
Zhu, X.; Wang, Z.; Li, C.; Sun, X. Research on Artificial Intelligence Network Based on Deep Learning. In Proceedings of the 2021 2nd International Conference on Information Science and Education (ICISE-IE), Chongqing, China, 26–28 November 2021; pp. 613–617. [Google Scholar]
Lyu, L.; Shen, Y.; Zhang, S. The Advance of Reinforcement Learning and Deep Reinforcement Learning. In Proceedings of the 2022 IEEE International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, 25–27 February 2022; pp. 644–648. [Google Scholar]
Li, W.; Wu, J.; Chen, J.; Lia, K.; Cai, X.; Wang, C.; Guo, Y.; Jia, S.; Chen, W.; Luo, F.; et al. UAV countermeasure maneuver decision based on deep reinforcement learning. In Proceedings of the 2022 37th Youth Academic Annual Conference of Chinese Association of Automation (YAC), Beijing, China, 19–20 November 2022; pp. 92–96. [Google Scholar]
Zhang, Y.Z.; Xu, J.L.; Yao, K.J.; Liu, J.L. Pursuit missions for UAV swarms based on DDPG algorithm. Acta Aeronaut. Astronaut. Sin. 2020, 41, 314–326. [Google Scholar]
Zhang, R.; Zong, Q.; Zhang, X.; Dou, L.; Tian, B. Game of Drones: Multi-UAV Pursuit-Evasion Game With Online Motion Planning by Deep Reinforcement Learning. IEEE Trans. Ind. Neural Netw. Learn. Syst. 2022. [Google Scholar] [CrossRef] [PubMed]
Fu, X.; Zhu, J.; Wei, Z.; Wang, H.; Li, S. A uav pursuit-evasion strategy based on ddpg and imitation learning. Int. J. Aerosp. Eng. 2022, 2022, 1–14. [Google Scholar] [CrossRef]
Sun, Y.; Yan, C.; Lan, Z.; Lin, B.; Zhou, H.; Xiang, X. A Scalable Deep Reinforcement Learning Algorithm for Partially Observable Pursuit-Evasion Game. In Proceedings of the 2022 International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM), Xiamen, China, 5–7 August 2022; pp. 370–376. [Google Scholar]
Vlahov, B.; Squires, E.; Strickland, L.; Pippin, C. On Developing a UAV Pursuit-Evasion Policy Using Reinforcement Learning. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications(ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 859–864. [Google Scholar]
Li, Z. A Hierarchical Autonomous Driving Framework Combining Reinforcement Learning and Imitation Learning. In Proceedings of the 2021 International Conference on Computer Engineering and Application (ICCEA), Kunming, China, 25–27 June 2021; pp. 395–400. [Google Scholar]
Cheng, Y.; Wei, C.; Sun, S.; You, B.; Zhao, Y. An LEO Constellation Early Warning System Decision-Making Method Based on Hierarchical Reinforcement Learning. Sensors 2023, 23, 2225. [Google Scholar] [CrossRef] [PubMed]
Qiu, Z.; Wei, W.; Liu, X. Adaptive Gait Generation for Hexapod Robots Based on Reinforcement Learning and Hierarchical Framework. Actuators 2023, 12, 75. [Google Scholar] [CrossRef]
Li, Q.; Jiang, W.; Liu, C.; He, J. The Constructing Method of Hierarchical Decision-Making Model in Air Combat. In Proceedings of the 2020 12th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 22–23 August 2020; pp. 122–125. [Google Scholar]
Bacon, L.; Harb, J.; Precup, D. The option-critic architecture. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 1726–1734. [Google Scholar]
Wu, Y.; Sun, G.; Xia, X.; Xing, M.; Bao, Z. An Improved SAC Algorithm Based on the Range-Keystone Transform for Doppler Rate Estimation. IEEE Geosci. Remote Sens. Lett. 2013, 10, 741–745. [Google Scholar]
Gao, M.; Chang, D. Autonomous Driving Based on Modified SAC Algorithm through Imitation Learning Pretraining. In Proceedings of the 2021 21st International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 12–15 October 2021; pp. 1360–1364. [Google Scholar]
Xiao, T.; Qi, Y.; Shen, T.; Feng, Y.; Huang, L. Intelligent Task Offloading Method for Vehicular Edge Computing Based on Improved-SAC. In Proceedings of the 2022 IEEE 5th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 16–18 December 2022; pp. 1720–1725. [Google Scholar]
Zhu, Q.; Su, S.; Tang, T.; Xiao, X. Energy-efficient train control method based on soft actor-critic algorithm. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 2423–2428. [Google Scholar]
Ota, K.; Jha, D.K.; Kanezaki, A. Training larger networks for deep reinforcement learning. arXiv 2021, arXiv:2102.07920v1. [Google Scholar]

Figure 1. The situation in UAV pursuit-evasion.

Figure 2. Threat zone in 2D.

Figure 3. The structure of hierarchical decision.

Figure 4. Relative azimuth.

Figure 5. The structure of the Actor-Critic network.

Figure 6. Training framework of hierarchical decision model.

Figure 7. The reward function of meta-policies.

Figure 8. The curve of success rate in the training scene.

Figure 9. The flight trajectory in the training scene.

Figure 10. Relative situation between opponent and our UAV in the training scene.

Figure 11. Meta-policy selection in the training scene.

Figure 12. The flight trajectory in active capture situation.

Figure 13. Relative situation between opponent and our UAV in active capture situation.

Figure 14. Meta-policy selection in active capture situation.

Figure 15. The flight trajectory in pursued situation.

Figure 16. Relative situation between opponent and our UAV in pursued situation.

Figure 17. Meta-policy selection in pursued situation.

Table 1. The signs of meta-policy success and failure.

Meta-Policy	Success	Failure
AG	Out of the opponent UAV’s threat zone 1000 m	Failed to break away from the opponent threat zone within 500 steps
QE	Out of the opponent UAV’s threat zone 2000 m	Failed to break away from the opponent threat zone within 500 steps
SC	The relative azimuth of the opponent is less than 25°	Our relative azimuth is not bigger than the opponent’s relative azimuth within 500 steps
QP	Successfully capturing the opponent	The opponent failed to exist our threat zone within 500 steps

Table 2. The parameters of UAVs.

Parameter	Value
Our UAV pitch angle range	[0°, 4°]
Our UAV heading angle range	[0°, 10°]
Our UAV change range in speed	[0 m/s, 10 m/s]
Our UAV threat distance range	[1 km, 3 km]
Our UAV maximum speed	200 (m/s)
Our UAV Maximum threat angle	20 (°)
Our UAV Maximum time threshold	2 (s)
Opponent UAV pitch angle range	[0°, 4°]
Opponent UAV heading angle range	[0°, 10°]
Opponent UAV change range in speed	[0 m/s, 10 m/s]
Opponent UAV threat distance range	[1 km, 3 km]
Opponent UAV maximum speed	200 (m/s)
Opponent UAV Maximum threat angle	20 (°)
Opponent UAV Maximum time threshold	2 (s)

Table 3. The hyperparameters of SAC.

Hyperparameter	Value	Hyperparameter	Value
Actor-network learning rate	3 × 10⁻⁴	Total number of rounds	10,000
Critic network learning rate	3 × 10⁻⁴	Max steps in one round	500
Experience pool size	100,000	Soft update parameters	0.005
Batch_size	64	Discount rate	0.99
Target entropy value	−3	Regularization coefficient of entropy	1

Table 4. Initialization information in the training scene.

	X (km)	Y (km)	Z (km)	Pitch (°)	Yaw (°)	Speed (m/s)	Distance (km)	Azimuth (°)
Our UAV	2	3.5	−3	2	50	70	7.457	97.3
Opponent UAV	−3.5	3	2	1	−40	75	7.457	97.3

Table 5. Initialization information in active capture situation.

	X (km)	Y (km)	Z (km)	Pitch (°)	Yaw (°)	Speed (m/s)	Distance (km)	Azimuth (°)
Our UAV	2	2	3	2	50	70	7.991	29.6
Opponent UAV	−2	9	3.5	1	−40	80	7.991	29.6

Table 6. Initialization information in pursued situation.

	X (km)	Y (km)	Z (km)	Pitch (°)	Yaw (°)	Speed (m/s)	Distance (km)	Azimuth (°)
Our UAV	2.1	2.9	3.0	2	50	70	8.01	165.1
Opponent UAV	−1.8	8.7	3.6	1	70	90	8.01	165.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, B.; Zhang, H.; He, P.; Wang, G.; Yue, K.; Neretin, E. Hierarchical Maneuver Decision Method Based on PG-Option for UAV Pursuit-Evasion Game. Drones 2023, 7, 449. https://doi.org/10.3390/drones7070449

AMA Style

Li B, Zhang H, He P, Wang G, Yue K, Neretin E. Hierarchical Maneuver Decision Method Based on PG-Option for UAV Pursuit-Evasion Game. Drones. 2023; 7(7):449. https://doi.org/10.3390/drones7070449

Chicago/Turabian Style

Li, Bo, Haohui Zhang, Pingkuan He, Geng Wang, Kaiqiang Yue, and Evgeny Neretin. 2023. "Hierarchical Maneuver Decision Method Based on PG-Option for UAV Pursuit-Evasion Game" Drones 7, no. 7: 449. https://doi.org/10.3390/drones7070449

Article Menu

Hierarchical Maneuver Decision Method Based on PG-Option for UAV Pursuit-Evasion Game

Abstract

1. Introduction

2. Problem Formulation and Preliminaries

2.1. Scenario Description and Assumptions

2.2. UAV Model

3. Hierarchical Maneuver Decision Method for UAV Pursuit-Evasion Game

3.1. Meta-Policy Model Training Algorithm Based on SAC

3.2. Hierarchical Decision Method Based on PG-Option

4. Experiments and Results

4.1. Parameters and Hardware

4.2. Results of Train and Simulation

4.3. Generalization Simulation

4.3.1. Active Capture Situation

4.3.2. Pursued Situation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI