Research on Scheme Design and Decision of Multiple Unmanned Aerial Vehicle Cooperation Anti-Submarine Based on Knowledge-Driven Soft Actor-Critic

Zhang, Xiaoyong; Yue, Wei; Tang, Wenbin

doi:10.3390/app132011527

Open AccessArticle

Research on Scheme Design and Decision of Multiple Unmanned Aerial Vehicle Cooperation Anti-Submarine Based on Knowledge-Driven Soft Actor-Critic

by

Xiaoyong Zhang

,

Wei Yue

^* and

Wenbin Tang

College of Marine Electrical Engineering, Dalian Maritime University, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(20), 11527; https://doi.org/10.3390/app132011527

Submission received: 25 September 2023 / Revised: 14 October 2023 / Accepted: 19 October 2023 / Published: 20 October 2023

(This article belongs to the Special Issue Intelligent Control of Unmanned Aerial Vehicles)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

To enhance the anti-submarine and search capabilities of multiple Unmanned Aerial Vehicle (UAV) groups in complex marine environments, this paper proposes a flexible action-evaluation algorithm known as Knowledge-Driven Soft Actor-Critic (KD-SAC), which can effectively interact with real-time environmental information. KD-SAC is a reinforcement learning algorithm that consists of two main components: UAV Group Search Knowledge Base (UGSKB) and path planning strategy. Firstly, based on the UGSKB, we establish a cooperation search framework that comprises three layers of information models: the data layer provides prior information and fundamental search rules to the system, the knowledge layer enriches search rules and database in continuous searching processes, and the decision layer utilizes above two layers of information models to enable autonomous decision-making by UAVs. Secondly, we propose a rule-based deductive inference return visit (RDIRV) strategy to enhance the knowledge base of search. The core concept of this strategy is to enable UAVs to learn from both successful and unsuccessful experiences, thereby enriching the search rules based on optimal decisions as exemplary cases. This approach can significantly enhance the learning performance of KD-SAC. The subsequent step involves designing an event-based UGSKB calling mechanism at the decision-making level, which calls a template based on the target and current motion. Finally, it uses a punishment function, and is then employed to achieve optimal decision-making for UAV actions and states. The feasibility and superiority of our proposed algorithm are demonstrated through experimental comparisons with alternative methods. The final results demonstrate that the proposed method achieves a success rate of 73.63% in multi-UAV flight path planning within complex environments, surpassing the other three algorithms by 17.27%, 29.88%, and 33.51%, respectively. In addition, the KD-SAC algorithm outperforms the other three algorithms in terms of synergy and average search reward.

Keywords:

knowledge-driven; collaboration anti-submarine; soft actor-critic; group search knowledge base; UAV path planning

1. Introduction

1.1. Background and Motivation

The continuous advancement of 5G technologies has significantly enhanced the communication and collaboration capabilities among multiple UAVs, making it extensively utilized in both military and civilian sectors, particularly in maritime anti-submarine missions focused on searching, detecting, and monitoring enemy submarines [1,2]. The UAVs encounter uncertain marine environments and weather factors during mission planning. In this context, they must dynamically adjust their motion state based on detection information and mission requirements to gain a decisive advantage in search. Figure 1 shows a scenario in which multiple UAVs cooperate to carry out anti-submarine tasks. Throughout this process, these UAVs are not only dedicated to target detection (i.e., water and underwater targets) but are also consistently vigilant in monitoring their proximity to hazardous areas (i.e., obstacles and threat zones) [3]. The UAV swarm must therefore make optimal flight decisions based on the dynamic mission environment and the information available within the swarm, to ensure safety and enhance search efficiency. Additionally, contributing search experiences to establish a search knowledge base among UAVs, can provide them with superior battlefield awareness. In other words, historical experience aids UAVs in making optimal decisions.

However, UAVs face several challenges. The first challenge lies in the difficulty of capturing the movement characteristics and navigation routes of enemy submarines, particularly due to the unpredictable reemergence locations after diving underwater. Additionally, both above- and below-water targets generally exist in a formation format. By flexibly changing the search strategy when a target is identified, the success rate of searching for other targets can be significantly improved [4]; the second challenge lies in the dynamic evolution of the environment, where factors such as threat areas depicted in Figure 1, sensor attributes and real-time movement of water targets impact the tactical arrangement of UAV swarms (i.e., search mode and path planning) [2,5]. Furthermore, deploying UAVs in complex environments requires them to avoid dense obstacles while searching for targets, necessitating timely responses and plans for collision-free paths [3,6]. The establishment of a decision support system is therefore imperative to enhance the collective decision-making capability among UAVs, which serves as the key factor for enhancing mission success rates in a complex environment. These challenges have motivated our research, with this paper’s main contributions outlined below:

(1): From the perspective of environmental information processing, this study analyzes the factors that impact the efficiency of UAV search in maritime anti-submarine missions and establishes UGSKB, and then the initial search rules are established based on the varying number of discovery targets. The proposed framework for multi-UAV collaborative search is based on a three-layer network architecture. The data layer is responsible for storing prior information and gradually supplementing search rules during the search process. The knowledge layer generates new optimal experiences based on detection information and UAV status, while the decision level invokes action templates or makes optimal decisions using KD-SAC.
(2): The proposed deductive inference replay strategy utilizes the deep reinforcement learning (DRL) algorithm to transform the optimal decision. The strategy incorporates evaluation and playback capabilities, enabling optimal decisions into valuable learning experiences and new search rules to enhance the knowledge base. To accomplish this, we establish corresponding spaces and reward functions tailored to different search objectives. This process enhances both the learning speed of KD-SAC and the success rate in achieving desired goals. Additionally, a Markov decision-based model for UAV motion and an event-triggering mechanism have been proposed to facilitate autonomous invocation of the UGSKB by the UAV. Upon activation of an event, the UAV invokes the action template in the knowledge base and continues task execution accordingly.
(3): The proposed KD-SAC path planning method combines a knowledge base and the SAC algorithm, enabling real-time modification of prior search rules. This method enhances the exploration of random search strategies through interactive adaptation to the search environment, ultimately incorporating the decision with the highest return into the knowledge layer of the cooperation search framework as an optimal choice. The utilization of multiple UAVs with a shared knowledge base can enhance the efficiency of the cluster’s exploration in unfamiliar environments, thereby facilitating UAVs informed decision-making throughout the search process.

1.2. Related Work

Anti-submarine search is a crucial combat task in the military field, where the target being searched possesses strong concealment capability and anti-search capability, and multiple ships invade with formation simultaneously [2]. When enemy submarines submerge underwater, their movement direction and process become unpredictable, making it challenging to determine their course [2]. To plan the optimal multi-agent collaborative search path, the use of heuristic algorithms has become prevalent among researchers. In paper [7], Ding et al. use a genetic algorithm to construct the mathematical planning model for optimal agent search paths, which can calculate the search radius and search width of the UAV spiral path based on detection range, detection velocity, and the target’s escape velocity. The paper [8] proposes a collaborative search strategy that utilizes the particle swarm optimization algorithm and prior information of the searched target to make optimal decisions. Chen et al. proposed a mathematical model based on the ant colony algorithm in paper [9] to plan the search path for multiple UAVs in an unknown environment. This model effectively plans the point-to-point flight trajectory of UAVs. However, the aforementioned methods ignore the influence of target motion on both the search environment and the UAV’s search trajectory. In general, these effects can be expressed by employing Markov decision processes (MDP) or partially observable Markov decision processes [2,10,11], which rely on the UAV’s perception of the target’s motion state and its surroundings. The paper [12] uses the strategy iteration of MDP to plan search paths. The paper [13] treats collaborative target search as an uncertainty minimization problem and further examines the revenue of UAV motion state on minimization problem through Markov decision. The safety of the UAV can be ensured in complex environments by effectively avoiding obstacles and no-fly areas, thereby minimizing the risk of enemy detection.

Aiming at the obstacle avoidance problem in the search process, a dynamic discrete pigeon swarm optimization algorithm was proposed in paper [14]. This algorithm guides the coordinated movement of UAVs during the search task using an information probability graph and B-spline curve, enabling the generation of an obstacle avoidance path. In paper [15], Fei et al. combine communication cost with formation search efficiency as optimization objectives and employ the sparrow search algorithm to solve the optimization problem. Ultimately, the control architecture on the UAV observation position is constructed to prevent collision among the UAVs. In addition, some intelligence algorithms have also been applied to the above problems. Compared with heuristic algorithms, intelligence algorithms are better suited for multi-optimization objective problems [16,17]. In paper [18], the Clausius bio-inspired neural network and bio-inspired cascaded tracking control approach have been implemented for the search and obstacle avoidance tasks of the UAV. A bionic neural wave network was presented in paper [19], which built a neural node structure based on the theory of neural wave diffusion. In this structure, neurons correspond to the obstacles, agents, and the searched targets in the environment. Liu et al. integrated the environment perception module with the UAV motion module to form a cooperative search algorithm based on reinforcement learning and probability maps [20]. paper [21] decomposes the search problem into three components, wherein UAVs initially undertake a random cruising task, transitioning to a dynamic coalition formation based on established search knowledge in the medium term, and ultimately forming coalitions and search teams. Subsequently, real-time planning for target searching and obstacle avoidance can be efficiently executed by employing the enhanced dolphin algorithm. The study conducted by paper [22] examined the performance namely the black hole optimization algorithm, firefly algorithm, cuckoo search algorithm, particle swarm optimization algorithm, and gray Wolf optimizer. The primary focus of this research is on addressing the challenge of UAVs searching for dynamic targets in complex environments. However, the aforementioned methods rely on the prior information of the search environment and ignore the real-time update of the search strategy. The intelligence algorithms now prioritize real-time online learning in addition to offline training, enabling the conversion of high-quality decision results into effective search strategies.

This paper proposes a knowledge-driven SAC cooperative search strategy with the ability to return quality experience for utilizing UAVs in searching submarine formations within unknown sea environments containing obstacles, which the main idea is to allow multiple UAVs to learn from both failure and successful decisions to enrich search strategies, where SAC is a stochastic strategy based on maximum entropy theory. In addition, we further enhance the initial search rules, and the UAV can retrieve the action template from the knowledge base according to different events. The search rules for UAVs also vary depending on the number of targets detected.

The rest of this paper is as follows. The modeling and the problem description are introduced in Section 2. Section 3 describes the knowledge-driven autonomous decision-making process. In Section 4, the path planning strategy based on KD-SAC is given. Simulation experiments of different search methods are shown. The conclusion is written in Section 6. Finally, future work is described in Section 7.

2. Problem Description

2.1. Environmental Model

In the complex and unknown maritime mission environment E, the target is typically presented in a formation governed by certain rules [23,24], such as a formation consisting of submarines, ships, and aircraft carriers. By leveraging these rule-based characteristics, a search trajectory can be planned for the UAV group while accounting for obstacles and threat areas within E, where obstacles are presented

OBS = {o b s_{1}, o b s_{2}, \dots, o b s_{N_{o}}}

,

N_{o}

as the number of obstacles; and threat areas are repressed

THR = {t h r_{1}, t h r_{2}, \dots, t h r_{N_{t}}}

,

N_{t}

as the number of threat areas. The UAV swarm comprises

N_{v}

UAVs and is denoted as

UAV = {V_{1}, V_{2}, \dots, V_{N_{v}}}

. With safety as a top priority for the UAVs involved in this mission scenario, it is expected that completion of the search task will occur with minimal cooperation cost and within an optimal timeframe.

Next, we develop mathematical models to represent the obstacles present in the environment as well as potential threats from water or underwater sources. In this paper, these areas are represented by models

ℂ_{i}, 1 \leq i \leq (N_{o} + N_{t})

, which are further specified as various regular geometrical forms, including cylinders, hemispheres, and cones (as shown in Figure 1). Mathematical representations of

ℂ_{i}

is provided below:

ℂ_{i} : = {h \in ℝ^{3} | \frac{{(x_{i} - x_{o i})}^{2 w}}{A} + \frac{{(y_{i} - y_{o i})}^{2 e}}{B} + \frac{{(z_{i} - z_{o i})}^{2 q}}{C} \leq 0}

(1)

where

h = {[x_{i}, y_{i}, z_{i}]}^{T}

is the surface coordinate of

ℂ_{i}

in the three-dimensional coordinate system;

(x_{o i}, y_{o i}, z_{o i})

is the central coordinate point of

ℂ_{i}

;

A

,

B

,

C

are size coefficients, which determine the size of the

ℂ_{i}

;

w

,

e

,

q

are shape parameters, which determine the specific shape of the

ℂ_{i}

.

ℂ_{i}

is a circular vertebra when

A = B

,

w = e = 1

,

q < 1

;

ℂ_{i}

is a cylinder when

A = B

,

w = e = 1

,

q > 1

.

2.2. UAV Motion Model

This section considers the flight path planning of the UAV to avoid threats, thus necessitating high requirements for maneuvering performance. We establish the motion model of k time

V_{i}

as follows:

{\begin{cases} x_{i} (k + 1) = x_{i} (k) + v_{i} (k + 1) \cos (θ_{i} (k + 1)) \cos (φ_{i} (k + 1)) d k \\ y_{i} (k + 1) = y_{i} (k) + v_{i} (k + 1) \sin (θ_{i} (k + 1)) d k \\ z_{i} (k + 1) = z_{i} (k) + v_{i} (k + 1) \cos (θ_{i} (k + 1)) \sin (φ_{i} (k + 1)) d k \\ v_{i} (k + 1) = v_{i} (k) + d v_{i} (k) \\ θ_{i} (k + 1) = θ_{i} (k) + d θ_{i} (k) \\ φ_{i} (k + 1) = φ_{i} (k) + d φ_{i} (k) \end{cases}

(2)

where (

x_{i} (k)

,

y_{i} (k)

,

z_{i} (k)

) represents the location information;

v_{i} (k)

represents the speed;

θ_{i} (k)

represents the pitching angle;

φ_{i} (k)

represents the heading angle;

d v_{i} (k)

represents the acceleration;

d θ_{i} (k)

represents the variation in pitch angle;

d φ_{i} (k)

represents the variation in the heading angle. In addition, the following distance constraints must be met by UAVs to prevent collision and excessive communication distance:

\begin{array}{l} D_{c} \leq d_{i j} \leq D_{l} \\ d_{i j} = \sqrt{{(x_{i} (k) - x_{j} (k))}^{2} + {(y_{i} (k) - y_{j} (k))}^{2} + {(z_{i} (k) - z_{j} (k))}^{2}} \end{array}

(3)

where

d_{i j}

represents the distance between

V_{i}

and

V_{j}

, satisfied

j \in {1, 2, \dots, N_{v}}

,

j \neq i

;

D_{c}

is the safe distance;

D_{l}

indicates the communication limit distance; Then, according to (2) and the

ℬ

proposed in Section 2.3, the states set of the

V_{i}

at k time can be represented as follows:

s_{i} (k) = {x_{i} (k), y_{i} (k), z_{i} (k), v_{i} (k), θ_{i} (k), φ_{i} (k), ℬ_{i} (k)}

(4)

where

ℬ_{i} (k)

represents all relationship conditions contained in the

V_{i}

at k time. In addition, the actions set of the

V_{i}

at k time can be represented as follows:

a_{i} (k) = {d v_{i} (k), d θ_{i} (k), d φ_{i} (k), X_{i} (k)}

(5)

where

X_{i} (k)

represents all action templates contained in the

V_{i}

at k time, such as the

V_{i}

need to switch communication mode, adjust flight altitude, flight speed, and other action sets.

2.3. Knowledge-Driven Cooperative Search Framework

The success rate of traditional UAV anti-submarine missions is significantly hindered by the limited detection capabilities of sensors. This paper aims to enhance the conventional MUAV cooperation anti-submarine framework [22], which solely relies on sensor detection information. To achieve this, a knowledge-driven search path planning system framework is developed by integrating target and prior task information with UAV’s detection and status data. This framework consists of three layers: data layer, cognitive layer, and decision layer, as illustrated in Figure 2. The specific functions are as follows:

(1): The data layer stores basic search rules [22] (refer to Table 1) and prior information in the database, thereby providing data support for the knowledge layer;

(2): The knowledge layer analyzes various factors influencing search rules and establishes mapping relationships. Then establishing the search knowledge base and designing a knowledge structure for the target search based on these mappings, which includes event attributes, relationship conditions, and action templates. Furthermore, as search continues to evolve, the knowledge layer must consistently update both the search rules and knowledge base;
(3): The decision layer can call the UGSKB and use the SAC algorithm [25] to train the action and state of the UAV, and constantly adjust and select the optimal strategy.

2.4. Knowledge Base Model

The knowledge base is a semantic network that unveils the intricate relationship between entities [26]. Through knowledge extraction, the knowledge base can be transformed into a triplet form of entity-attribute-attribute value [27]. In this paper, the focus lies on the mapping relationship between

Γ = {I_{r}, O_{c}, M_{t}, T_{o}, T_{m o}, S_{e}, C_{m}}

and specific events, which constitutes a set of factors influencing search rules in MUAV cooperation anti-submarine missions. Finally, the knowledge structure shown in Figure 3 is designed, that is event attribute—relationship condition—action template. The

Γ

includes the following parts:

(1): Mission requires $I_{r}$ , i.e., detection of all targets at minimum cost;
(2): Marine environment $O_{c}$ , which includes marine environment, meteorological conditions and hydrological characteristics;
(3): Task scenario $M_{t}$ , i.e., a no-fly zone exists in the scenario;
(4): Single target attribute $T_{o}$ , that is, the motion state and aggression of the target to be searched;
(5): Multi-target attribute $T_{m o}$ , that is, the distribution rule of the formation;
(6): Sensor properties $S_{e}$ , which includes the sensor type and operating conditions;
(7): Communication mode $C_{m}$ , including communication mode and routing protocol between the UAVs, as well as between the UAVs and the ground station.

A = {α_{1}, α_{2}, α_{3}, \dots, α_{i}, \dots, α_{n_{s}}}

represents the event attribute, while

n_{s}

signifies the number of extractable events; Relational conditions can be denoted by

ℬ = {β_{1}, β_{2}, β_{3}, \dots, β_{i}, \dots, β_{n_{g}}}

, where

n_{g}

represents the number of extractable relational conditions; The action template can be represented by

X = {χ_{1}, χ_{2}, χ_{3}, \dots, χ_{i}, \dots, χ_{n_{z}}}

, where

n_{z}

represents the actions (including the UAV’s motion status, communication mode, sensor type, and flight path).

Remark 1.

The relationship between

A

and

ℬ

is many-to-one, meaning that multiple elements in

A

can refer to the same element in

ℬ

, such as “obstacle,” “threat area,” and “target attribute” in

A

corresponding to “task requirements” in

ℬ

. The relationship between

ℬ

and

X

is one-to-many, where “task requirements” correspond to “sensor properties,” “search rules,” and other actions.

X

and

A

have a one-to-many relationship, which can be interpreted as a set of action templates capable of triggering multiple events.

2.5. Decision Model Based on Markov Chain

The Markov decision model incorporates stochastic strategies and rewards, providing a behavior-controllable learning framework for MUAV decision-making. According to the Markov decision process, we represent search decisions as a five-element model (

S, A, P, R, γ

), where

S

represents the motion state of

V_{i}

;

A

is the set of actions that

V_{i}

can choose from;

P

and

R

denote the probability and reward of action

a_{i} (k)

performed by the

V_{i}

at time k from current state

s_{i} (k)

to state

s_{i} (k + 1)

respectively, satisfied

a_{i} (k) \in A

,

s_{i} (k)

,

s_{i} (k + 1) \in S

; Additionally, we introduce a discount factor

γ \in [0, 1]

which ensures convergence and solvability of this model while determining the importance of current or future rewards.

In a Markov decision process, the reward obtained by the

V_{i}

is designated as

r (s_{i} (k), a_{i} (k))

at time k. According to the way the action is generated, it can be divided into the following two cases:

(i): If the knowledge base can provide decision support, $V_{i}$ acquires the corresponding action template $X_{i} (k)$ , and modifies the action to $a_{i} (k + 1)$ .
(ii): If the search efficiency of the decision based on the knowledge base is low, the SAC algorithm can be employed by UAVs to calculate an optimal action considering the information from the current environment.

We denote the set of strategies that

V_{i}

can be chosen in the above two scenarios as

π

, and among all planning strategies, there exists an optimal strategy

π_{\max}^{*}

. When all UAVs select the optimal strategy, the cumulative reward function is expressed as follows:

R_{N_{V}} (s_{i} (k + 1), a_{i} (k + 1)) = {\begin{cases} \begin{matrix} \sum_{1}^{N_{v}} \sum_{1}^{K} γ_{i} (k) r (s_{i} (k), a_{i} (k)) & D_{c} \leq d_{i j} \leq D_{l} \end{matrix} \\ \begin{matrix} - \sum_{1}^{N_{v}} \sum_{1}^{K} γ_{i} (k) r (s_{i} (k), a_{i} (k)) & d_{i j} \leq D_{c} o r d_{i j} \geq D_{l} \end{matrix} \end{cases}

(6)

where

γ \in [0, 1]

represents the discount factor, which ensures convergence and solvability of this model while determining the importance of current or future rewards.

2.6. Search Objective

The paper investigates the utilization of multiple UAVs for searching submarine formations and navigating through complex waters while avoiding obstacles. To enhance the search capability of UAVs, a collaborative search method based on a knowledge base is proposed. In essence, the cooperative search problem in intricate environments can be viewed as an optimization problem with constrained conditions, aiming to maximize target detection while ensuring UAV safety.

3. Knowledge-Driven UAVs Autonomous Decision-Making

The cognitive layer depicted in Figure 2 stores the knowledge base processed by the knowledge extraction strategy, which should include not only the initial knowledge construction but also the feedback from UAVs after each optimal decision is executed. Moreover, when invoking the knowledge base information to address autonomous decision problems of UAVs, whether or not to invoke it should be determined based on different scenarios. Therefore, building upon Section 2.4, this section proposes an RDIRV method for updating the knowledge base and an event-based mechanism for accessing the knowledge base to facilitate the UAV independent decision-making driven by knowledge.

3.1. Update Mechanism Based on RDIRV

The primary function of RDIRV is to acquire knowledge from optimal decision cases made by UAVs, establish new search rules based on these optimal decisions as reference cases, and then convert this knowledge into usable search information through the process of relational mapping using a knowledge extraction method. This information is subsequently stored in the knowledge base. To achieve RDIRV, we have devised a modified reward function

r_{i} (k) = r (s_{i} (k), a_{i} (k), g, κ)

for each planning decision made by

V_{i}

, where g represents the task requirement that must be fulfilled when an optimal decision exists;

κ

denotes the adaptability value associated with g. Furthermore, the closer the state

s_{i} (k)

is to g, the greater will be the obtained reward.

The specific flow of RDIRV is given as follows: Firstly, we initialize an offline DRL model and clear the playback buffer D. Then, we set the initial state of the

V_{i}

to

s_{i} (0)

. Subsequently, the DRL algorithm interacts with the environment and obtains a transition tuple

(s_{i} (k), a_{i} (k), r_{i} (k), s_{i} (k + 1), a_{i} (k + 1), r_{i} (k + 1), g, κ)

for Markov decision-making. The transition tuple with the original task request g is then stored in the playback buffer D, while RDIRV stores the additional transition tuple

(s_{i} (k), a_{i} (k), r_{i}^{'} (k), s_{i} (k + 1), a_{i} (k + 1), r_{i}^{'} (k + 1), g, κ)

with the task request set

g^{'} \in {g_{1}^{'}, g_{2}^{'}, \dots, {g^{'}}_{m}}

in D.

Remark 2.

In the initial phase of the search, the UAV’s decisions are stochastic due to complete mission information unavailability, resulting in low reward values for each deductive inference. As the UAV continues its search, its cognitive understanding of environmental information gradually improves, thereby amplifying the impact of each deductive inference on the knowledge base.

3.2. Event Triggered Knowledge Base Call Mechanism

The motion state of the UAV is determined by a maneuvering model based on Markov decisions, and it dynamically adapts to the UAV’s perception of the current environment. To access knowledge base information, an event-triggering mechanism needs to be designed. This mechanism ensures that the event capable of invoking the knowledge base is triggered only when a specific situation occurs; otherwise, it remains inactive. The process can be described as follows:

The UAV finds a dynamic water target as a trigger event

W_{i}

. The environmental factors in the designated search area of the UAV are defined as

α_{1}

(threat area within the mission zone),

α_{2}

(obstruction within the mission zone), and

α_{3}

(underwater target detection) in the event attribute

A

. The task requirement is specified as

β_{1}

(detection) in

ℬ

. For anti-submarine missions, available action templates for the UAV include

χ_{1}

(the Lidar sensor),

χ_{2}

(underwater target detection sensor selection), and

χ_{3}

(Search Rule 1). The UAV is required to perform Rule 1 (i.e.,

χ_{3}

) in Table 1. If an underwater target is not detected, this event serves as a triggering event

W_{2}

that satisfies conditions

α_{4}

(to search for other underwater targets) within the event attribute

A

. If an underwater target is detected, optional action templates

χ_{1}

,

χ_{2}

and

χ_{4}

(performing search rule 2) are available. Subsequently, the UAV executes corresponding actions based on Rule 2.

4. Path Planning Method Based on KD-SAC

Through the analysis of the UAV knowledge-driven autonomous decision-making process, the multi-UAV cooperative search problem necessitates a continuous resolution of the optimal action template, followed by an expansion of the knowledge base following Section 3.1. Therefore, this paper proposes the KD-SAC algorithm with two main functions: (1) exploring random strategies extensively and making optimal decisions through interaction with the environment; (2) adjusting search rules by KD-SAC based on decision income magnitude after each MUAV independently makes a decision using the knowledge base. The KD-SAC algorithm framework is shown in Figure 4.

The policies in SAC adhere to a specific distribution, and the stochasticity of strategy

π (\cdot | s_{i} (k))

of the

V_{i}

is quantified by strategy entropy

ℋ (π (\cdot | s_{i} (k)))

at k time. To maximize the strategy’s entropy, the SAC algorithm incorporates entropy into the following objective function:

J (π) = \sum_{t = 0}^{T} E_{(s_{i} (k), a_{i} (k)) ~ ρ_{π}} [r (s_{i} (k), a_{i} (k)) + ξ ℋ (π (\cdot | s_{i} (k)))]

(7)

where the regularization coefficient

ξ

of entropy H is a crucial parameter that determines the degree of randomness in the strategy. A higher value of

ξ

indicates a more exploratory approach. Moreover, the expression for entropy H can be defined as follows:

\begin{array}{l} ℋ (π & (\cdot | s_{i} (k))) = - E_{π} [\log π (\cdot | s_{i} (k))] \\ = - \int_{A} π (a_{i} (k) | s_{i} (k)) \log π (a_{i} (k) | s_{i} (k)) d a \end{array}

(8)

The next step involves assuming that the SAC algorithm’s optimal search strategy is

π_{\max}^{*}

, which can be expressed as follows:

π_{\max}^{*} = \arg \max_{π} E_{(s_{i} (k), a_{i} (k)) ~ ρ_{π}} [r (s_{i} (k), a_{i} (k)) + α ℋ (π (\cdot | s_{i} (k)))]

(9)

where

π_{\max}^{*}

can maximize

J (π)

in (7) by employing Soft strategy iteration of the Soft Q function and the state value function. The representation of the Soft Q function is as follows:

Q (s_{i} (k), a_{i} (k)) : = r (s_{i} (k), a_{i} (k)) + γ E_{s_{i} (k + 1) ~ p} [V (s_{i} (k + 1))]

(10)

where

p = p (s_{i} (k + 1) | a_{i} (k), s_{i} (k))

represents the state transition probability;

V (s_{i} (k))

represents the Soft state value function, and its expression is as follows:

V (s_{i} (k)) = E_{a_{i} (k) ~ π} [Q (s_{i} (k), a_{i} (k)) - ξ \log π (a_{i} (k) | s_{i} (k))]

(11)

So-called Soft strategy evaluation, namely the Bellman equation in (10) is used to evaluate the Soft Q value of strategy

π (\cdot | s_{i} (k))

at each time step.

The objective function for the update strategy can be defined as follows:

J_{π} (π) = E_{S ~ p} [D_{K L} (π (\cdot | s_{i} (k)) ‖ \frac{\exp (\frac{1}{α} Q (s_{i} (k), \cdot))}{Z (s_{i} (k))})]

(12)

where the symbol

D_{K L}

represents the Kullback–Leibler (KL) divergence;

Z (s_{i} (k))

denotes the partition function utilized for normalizing the distribution. In Equation (12), the improvement of the Soft strategy is primarily manifested in minimizing

J (π)

with respect to

π (\cdot | s_{i} (k))

. Based on

D_{K L}

, we can ascertain that maximizing problem (7) is equivalent to minimizing problem (12).

The process of evaluation and improvement of Soft strategies is commonly known as Soft strategy iteration. In this manner, the SAC algorithm can determine the optimal values for

π_{\max}^{*}

and their corresponding Q value. However, when applying Soft strategy iteration directly to multi-UAV systems with continuous states, it becomes necessary to approximate the instance data provided by the knowledge base. Therefore, instead of employing Soft strategy iteration, we can utilize the Q function as a function approximator. Subsequently, we denote Soft Q network parameters as

θ

while representing Soft strategy network parameters as

ϕ

. Then, we optimize parameter

θ

of the Soft Q function using the bellman residual

J_{Q} (θ)

. The specific form of the

J_{Q} (θ)

is given below:

\begin{array}{l} J_{Q} (θ) & = E_{(s_{i} (k + 1), a_{i} (k + 1)) ~ D} [\frac{1}{2} (Q_{θ} (s_{i} (k), a_{i} (k)) \\ - (r (s_{i} (k), a_{i} (k)) + γ E_{s_{i} (k + 1)} V_{\bar{θ}} (s_{i} (k + 1))))^{2}] \end{array}

(13)

where

\bar{θ}

represents the parameter of the target Q function. The soft strategy parameter

ϕ

can be learned by the following formula to minimize the expected KL deviation in (12):

J_{π} (ϕ) = E_{s_{i} (k + 1) ~ D} {E_{a_{i} (k) ~ π_{ϕ}} [α \log π_{ϕ} (a_{i} (k) | s_{i} (k)) - Q_{θ} (s_{i} (k), a_{i} (k))]}

(14)

Finally, the Stochastic Gradient Descent (SGD) method is utilized to optimize the Soft Q network, soft strategy network, and weight

ξ

. We trained two Soft Q networks parameterized

θ_{1}

and

θ_{2}

by exponential moving average method to optimize

J_{Q} (θ)

in Equation (13). The two minima of the Soft Q function are then employed in SGD to minimize the loss function in Equations (13) and (14). At each gradient step of SGD, special emphasis is placed on optimizing the weight

ξ

with the objective of minimizing the following formula:

J (α) = E_{a_{i} (k) ~ π} [- ξ \log π (a_{i} (k) | s_{i} (k)) - ξ \bar{ℋ}]

(15)

where

\bar{H}

is the desired target entropy. The SAC algorithm yields optimized values for

θ_{1}

and

θ_{2}

, as well as the soft policy parameter

ϕ

, after completing the learning process. Finally, through combining with RDIRV in Section 3.1, the KD-SAC is capable of autonomously updating the knowledge base and revising search rules. It should be emphasized that when invoking the knowledge base, the KD-SAC solely evaluates action templates matched against UAVs. Otherwise, KD-SAC is used to plan optimal decisions for the UAV. In Algorithm 1, we give the pseudo-code of KD-SAC with inference playback.

Remark 3.

In knowledge-driven UAV autonomous decision-making processes, both maximizing reward value and deeply exploring search rules are essential. Therefore, the introduction of a maximum entropy model by the KD-SAC algorithm enables strategy randomization, dispersing the probability of each output action as much as possible rather than concentrating on a single action to explore potential search rules.

Algorithm 1: The pseudo-code of KD-SAC

Input initialize

ξ

; initialize the playback pool D; networks parameterized

θ_{1}

and

θ_{2}

; soft policy parameter

ϕ

.

1. for each episode do

2. Sample an initial states set

s_{i} (0)

of

V_{i}

and a goal g

3. for the initial step,

k = 0

~K do

4.

a_{i} (k) ~ π_{ϕ} (a_{i} (k) | s_{i} (k), g)

, select the action from the policy set

π (\cdot | s_{i})

5.

s_{i} (k + 1) ~ P (s_{i} (k + 1) | a_{i} (k), s_{i} (k), g)

, select the next state based on probability P

6. end for

7. for the initial step,

k = 0

~K do

8.

r_{i} (k) = r (s_{i} (k), a_{i} (k), g, κ)

, compute action reward

9.

D \leftarrow D \cup {(s_{i} (k), a_{i} (k), r_{i} (k), s_{i} (k + 1), a_{i} (k + 1), r_{i} (k + 1), g, κ)}

, establish playback buffer D

10. for

g^{'} \in {g_{1}^{'}, g_{2}^{'}, \dots, {g^{'}}_{m}}

do

11.

{r^{'}}_{i} (k) = (s_{i} (k), a_{i} (k), r_{i}^{'} (k), s_{i} (k + 1), a_{i} (k + 1), r_{i}^{'} (k + 1), g, κ)

, provides supplementary rewards for RDIRV

12.

D \leftarrow D \cup {(s_{i} (k), a_{i} (k), r_{i}^{'} (k), s_{i} (k + 1), a_{i} (k + 1), r_{i}^{'} (k + 1), g, κ)}

13. end for

14. end for

15. for gradient descent step,

n = 1

~N do

16.

Q \leftarrow Q - λ_{Q} {\hat{\nabla}}_{Q} J (Q)

, update Q function

17.

π \leftarrow π - λ_{π} {\hat{\nabla}}_{π} J (π)

, update policy set

18.

ξ \leftarrow ξ - λ {\hat{\nabla}}_{ξ} J (ξ)

, update temperature weight

19.

θ \leftarrow δ \bar{θ} + (1 - δ) θ

, update network parameters

20. end for

21. end for

Output optimal:

θ_{1}

,

θ_{2}

,

ϕ

5. Simulation Experiment and Analysis

We analyze from three aspects: multi-machine collaborative path planning, algorithm performance, and knowledge correlation. The effectiveness of the KD-SAC algorithm proposed in this paper is verified by comparing it with RI-MAC [21], PSO [22], and SAC [23] under the same environment, where SAC is a DRL algorithm with no policy model, which is similar to the strategy in this paper. Consider the following three experimental scenarios: (1) Multi-UAV collaborative search trajectories planning based on invoking knowledge base in a barrier-free environment; (2) Incorporating two UAVs to execute missions in complex mission area environments; (3) Incorporating four UAVs to execute missions in complex mission area environments. To verify and compare the learning performance of KD-SAC and SAC, we provide the values of similar simulation hyper-parameters in paper [23], as shown in Table 2. In addition, to accelerate the learning rate of KD-SAC, we increase the values of

λ_{Q}, λ_{π}

, and

λ

.

5.1. Path Planning Analysis

5.1.1. Multi-UAV Cooperative Search Trajectories Based on Knowledge Base

Scenario 1:

We consider five searched targets moving with a square formation at sea, where the central point with position (5000, 0, 0) is the water target, and its motion model is as follows:

v_{t x} = - 19.6 \sin (0.225 k)

,

v_{t y} = 19.6 \cos (0.225 k)

and

v_{t z} = 0

, where

{[v_{t x}, v_{t y}, v_{t z}]}^{T}

is the motion vector. Suppose that the start points of two UAVs are given as (600, −1000, 500) and (−500, −300, 300), respectively.

The trajectories of two UAVs are shown in Figure 5. It can be observed that the UAVs initially approach the target formation with an irregular flight path during the search phase. Upon first detecting the water target (i.e., target 1), two UAVs activate search rule 1 in the knowledge base and simultaneously move along a spiral track centered on the water target. Subsequently, upon discovering the second target (i.e., target 4), search rule 2 is triggered in the knowledge base to explore other targets along the mid-vertical lines connecting these two discovered targets. The effectiveness of the event-triggering strategy we mentioned is demonstrated by this experiment. The UAV can select its search mode based on the fundamental search rules (refer to Table 1) to adapt to different cases, such as employing a spiral search or conducting a search along the central vertical axis. Specifically, the UAV employs different search rules from the Knowledge Base based on the number of targets discovered, such as utilizing rule 1 when one target is detected and rule 2 when two targets. When event attributes and related conditions are associated with an action template in the knowledge base, the UAV can invoke and execute the corresponding action template from the knowledge base. The search rules stored in the knowledge base can serve as experiential guidance for the UAV, enabling it to directly acquire self-service decision outcomes, which may encompass actionable actions or search trajectories (such as spiral search trajectories).

5.1.2. Multi-UAV Cooperative Search Trajectories Planning Based on KD-SAC

Scenario 2:

This section establishes a 1000 m × 1000 m task area. It is assumed that within this designated area, multiple water and underwater targets are traveling in an unknown direction and same velocity but maintaining a certain distance between each other. The initial information setting for the movable targets can be found in Table 3. Furthermore, we have represented variations in the mission area’s environmental complexity by setting obstacles and threat areas, which are detailed in Table 4. The experiment was conducted across two scenarios, with each scenario repeated a total of 1000 times using four algorithms. Subsequently, the path generated by each algorithm was thoroughly analyzed.

Scenario 2 satisfies event attribute

A

in knowledge base

α_{1}

(threat areas within mission area),

α_{2}

(obstacles within mission area),

α_{3}

(search for underwater targets),

α_{4}

(considered height of obstacles in the task area), and

α_{3}

(considered number of obstacles in the task area). Furthermore, relationship condition

ℬ

satisfies

β_{1}

(detection). The optional action templates remain

χ_{1}

(select LiDAR sensor),

χ_{2}

(select underwater target detection sensor), and

χ_{3}

(perform search Rule 1). To navigate through more complex environments, all four algorithms require real-time adjustments of the UAV’s attitude. Figure 6 illustrates the planned path for the UAV by these algorithms. Additionally, Table 5 presents the initial information of the UAV in this particular scenario.

The complexity of the search environment increases as the number of obstacles is increased, as shown in Figure 6. Similar to Scenario 1, all UAVs select action templates

χ_{3}

and execute search Rule 1. Due to the increased number of obstacles, the KD-SAC algorithm is employed to learn and adapt to the new environment, resulting in a new search rule formation. Ultimately, the paths through areas with nine obstacles (represented by black circles in the top view of Figure 6) and two threats (indicated by hemispheres in Figure 6) successfully reach the predicted target location from the start point. The peak depicted in Figure 6 represents the actual target location. Furthermore, as evident from the smoothness of pink lines in Figure 6c,d, KD-SAC demonstrates superior path planning capabilities compared to SAC.

Scenario 3:

The initial information settings of the four UAVs in this scenario are presented in Table 6. Under event attribute

A

and relation condition

ℬ

of Scenario 2, the UAVs execute actions based on the Markov maneuver model and autonomously select templates

χ_{1}

(LiDAR sensor),

χ_{2}

(underwater target detection sensor), and

χ_{3}

(execute search Rule 1). Four distinct prediction points are assigned to UAVs according to Rule 1, resulting in a broader search area compared to Scenario 2. Figure 7 illustrates the path-planning outcomes for all four algorithms employed in this task scenario. It can be seen from Figure 7 that even if multiple UAVs plan trajectories at the same time, the KD-SAC algorithm can plan smoother trajectories, where the trajectory of the UAV4 is particularly smooth. The obstacle avoidance trajectory of UAV1 is smoother compared to Figure 6d, indicating that KD-SAC continuously adjusts the motion state of the UAV to enhance the search rule. This is due to our goal g that takes into account both the smoothness of the path and the threat posed by obstacles along the path. As the number of iterations increases, based on the deductive inference return visit mechanism, the KD-SAC fully learns and consistently identifies the optimal solution from the action strategy set to optimize the action template in the knowledge base.

5.2. Algorithm Performance Analysis

The performance of the four algorithms was compared through a comparative experiment on comprehensive cooperative evaluation, average reward, and the success rate of the UAV path, based on Section 5.1. The analysis results of the three comparative experiments are presented below.

The comparison of success rates among the four algorithms in Scenario 2 and Scenario 3 is illustrated in Figure 8. As the UAV’s understanding of the environment deepens, all four algorithms are capable of successfully generating an optimal flight path to reach the predicted target point. However, as depicted by result in Figure 8a, after conducting multiple experiments about the complex scenario 2, the success rate of KD-SAC is 73.63%, while SAC, RI-MAC, and PSO experience is 56.36%, 43.75%, and 40.12%, respectively. The KD-SAC algorithm outperforms SAC, PSO, and ACO algorithms significantly in terms of planning success rate. The results presented in Figure 8b demonstrate that, when compared to scenario 2, the success rate of KD-SAC decreases to 71.83%, SAC decreases to 53.26%, and RI-MAC and PSO decrease to 33.35% and 29.21%, respectively. In the face of complex environments with four UAVs, the proposed KD-SAC algorithm maintains a high planning success rate, thereby confirming its superiority and suitability for addressing the problems outlined. The increase in the learning rate enhances the learning speed of KD-SAC. Moreover, as the number of iterations increases, the action template within the knowledge base executes optional action for UAVs in each moment, which improves the learning efficiency of KD-SAC. Consequently, KD-SAC successfully generates an optimal path for a UAV from its initial position to its target position.

The comprehensive cooperation evaluation serves as a crucial metric for assessing the effectiveness of UAV cooperation search paths. By comparing the comprehensive cooperative evaluation indexes across Scenarios 2 and Scenarios 3, the performance of the proposed KD-SAC algorithm is validated. Figure 9 illustrates the experimental comparison among four algorithms. In Figure 9a,b, both Scenario 2 and Scenario 3 represent cooperation evaluations involving two UAVs. In Scenario 2, the four algorithms exhibit distinct differences in cooperation evaluation within a complex task area environment, with the KD-SAC algorithm demonstrating a higher index value compared to the other three algorithms. Figure 9b illustrates Scenario 3 where four UAVs are engaged in cooperative tasks. As the number of UAVs increases, their exploration of environmental information accelerates, and the expansion of the knowledge base is also accelerating. The shared knowledge base of UAV swarms enables each UAV to access the action template of the entire cluster, thereby fostering enhanced synergy among UAVs. Consequently, the KD-SAC algorithm outperforms other algorithms by achieving a higher index value. This indicates that the algorithm proposed in this paper showcases superior cooperation performance in search path planning.

To further demonstrate the effectiveness of the proposed algorithms in search path planning, we assessed the path planning outcomes of different algorithms based on the average reward value, which serves as a measure for describing the target search earnings of UAVs. The initial state of the KD-SAC algorithm differs in Scenario 2 and Scenario 3 due to its requirement for environmental information comprehension and making decisions based on the knowledge base. Therefore, the average reward value in the initial state does not impact subsequent evaluations. As shown in Figure 10, it can be observed that the average reward value of the KD-SAC algorithm gradually increases with an increasing number of experiments which indicates its faster learning speed.

5.3. Knowledge Correlation Analysis

To validate the impact of the proposed knowledge base on search results, we randomly placed multiple static and dynamic targets, obstacles, and threat areas in a single scene. We then conducted repeated learning and training using the algorithm presented in this paper. Subsequently, we analyzed the deviation between target position information predicted by action templates matched with our knowledge base and actual target position information across different iterations through comparison. These experiments confirm both the effectiveness of our knowledge base as well as RDIRV’s expansion effect on it. Specific error analysis results are shown in Figure 11.

The comparison of errors under different iterations in the same scenario is illustrated in Figure 11. Figure 11a indicates that after iterating 500 times, the maximum error is recorded as 0.2382 and the minimum error as 0.0267, which can be observed that during the initial stage of task execution, by invoking the knowledge base matching action template, the UAV can predict the target’s location with a small margin of error. Figure 11b demonstrates that after iterating 1000 times, the maximum error decreases to 0.1496 while the minimum error reduces to 0.0153. As iterations are performed, it becomes evident that the UAV’s understanding of its environment becomes increasingly comprehensive. Meanwhile, RDIRV continues to replay optimal decisions resulting in decreased error value. The results depicted in Figure 11c demonstrate that after 1500 iterations, the maximum error is recorded as 0.0523 while the minimum error stands at 0.0074. As the training progresses to a certain extent, the UAV’s proficiency in the task area reaches its zenith and there is a continuous enhancement in the knowledge base content, resulting in significantly reduced predicted error values. The knowledge base is expanded and shared among multiple UAVs, enabling a collective learning process. As the action template in the knowledge base grows, the available actions are reduced for UAVs to explore unknown environments they have never been to before, thereby enhancing the learning efficiency of KD-SAC. Through comparative analysis, we can observe the effectiveness and efficiency of utilizing the knowledge base for addressing multi-UAV cooperative path planning problems.

6. Discussion

In this paper, based on the analysis of the influencing factors of multi-UAV anti-submarine tasks in a complex unknown sea environment, a knowledge-driven search path planning system framework is designed. The design of the three-layer information model helps the UAV system to form a knowledge base among aircraft groups. Within this framework, the data and knowledge layers synergistically integrate prior information, detection information, state information, and search rules of the UAV group to ultimately influence the decision-making process at the decision layer. Through the combination of online search and learning, the high-quality decision scheme of UAV can be transformed into a search experience to enrich the knowledge base. To facilitate the call of the knowledge base, an event-based call mechanism is designed. The UAV is capable of utilizing search rules from the knowledge base according to the current event attributes and associated rules. The experiment shows that the UAV swarm can perform the search task according to the action template in the knowledge base under a specific event. In addition, to continuously improve the knowledge base and improve the quality of search path planning, the KD-SAC algorithm is used to interact with the environment to adjust the action and state of the UAV in real-time to generate the optimal decision, and new search rules are formed by RDIRV strategy and the knowledge base is expanded. This strategy can accelerate the learning rate of the KD-SAC algorithm by modifying the punishment function. Simulation results show that KD-SAC is efficient in learning and solving path-planning problems. The superiority of this strategy is demonstrated by comparing KD-SAC with three other algorithms, particularly in terms of planning highly smooth paths and effectively avoiding obstacles. The obstacles in this paper are represented as regular geometric shapes, which simplifies the complexity of the environment. Therefore, in future studies, we will consider environments with irregular obstructions.

7. Future Work

The next step involves validating the proposed strategy’s path planning and search capabilities in a larger cluster of UAVs and a more complex environment that includes dynamic obstacles and irregular obstacles with overlapping polygons. Then, the impact of the environment on sensor performance will be further investigated. The impact of the natural environment on various sensors, including ranging sensors and detection sensors, will be then examined and taken into consideration in our future work. Furthermore, a more comprehensive initial knowledge base will be established by investigating the influence of submarine formation on search rules. Finally, the knowledge-driven search path planning system framework will be used to consider existing anti-submarine problems.

Author Contributions

Conceptualization, W.Y.; Methodology, W.Y., X.Z. and W.T.; Validation, W.Y.; Investigation, W.Y.; Writing—original draft preparation, W.Y. and W.T.; Writing—review and editing, W.T.; Supervision, W.Y.; Project administration, W.Y.; Funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Dalian Science and Technology Innovation Fund [grant number 2019J12GX040] and the Fundamental Research Funds for the Central Universities [grant number 3132017128].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, J.Q.; Zhang, G.Q.; Jiang, C.Y. A Survey of Maritime Unmanned Search System: Theory, Applications and Future Directions. Ocean. Eng. 2023, 285, 1–12. [Google Scholar] [CrossRef]
Mishra, M.; An, W.; Sidoti, D. Context-Aware Decision Support for Anti-Submarine Warfare Mission Planning within a Dynamic Environment. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 318–335. [Google Scholar] [CrossRef]
Yahia, H.S.; Mohammed, A.S. Path Planning Optimization in Unmanned Aerial Vehicles Using Meta-heuristic Algorithms: A Systematic Review. Environ. Monit. Assess. 2023, 195, 30. [Google Scholar] [CrossRef] [PubMed]
Li, F. Technical Research on Scheme Design and Decision of Unmanned Cluster Cooperative Anti-Submarine. In Proceedings of the 2022 IEEE 13th International Conference on Software Engineering and Service Science, Beijing, China, 21–23 October 2022; pp. 110–114. [Google Scholar]
D’Souze, J.M.; Velpula, V.V.; Guruprasad, K.R. Effectiveness of a Camera as a UAV Mounted Search Sensor for Target Detection: An Experimental Investigation. Int. J. Control Autom. Syst. 2021, 19, 2557–2568. [Google Scholar] [CrossRef]
Yao, P.; Zhu, Q.; Zhao, R. Gaussian Mixture Model and Self-Organizing Map Neural-Network-Based Coverage for Target Search in Curve-Shape Area. IEEE Trans. Cybern. 2022, 52, 3971–3983. [Google Scholar] [CrossRef]
Ding, W.J.; Gao, H.; Guo, H. Investigation on Optimal Path for Submarine Search by an Unmanned Underwater Vehicle. Comput. Electr. Eng. 2019, 79, 106468. [Google Scholar] [CrossRef]
Jia, Q.Y.; Xu, H.L.; Feng, X.S. Research on Cooperative Area Search of Multiple Underwater Robots Based on the Prediction of Initial Target Information. Ocean. Eng. 2019, 172, 660–670. [Google Scholar] [CrossRef]
Chen, J.C.; Ling, F.Y.; Zhang, Y. Coverage Path Planning of Heterogeneous Unmanned Aerial Vehicles Based on Ant Colony System. Swarm Evol. Comput. 2022, 69, 101005. [Google Scholar] [CrossRef]
Liu, F.; Zeng, G.Z. An Online Multi-agent Co-operative Learning Algorithm in POMDPs. J. Exp. Theor. Artif. Intell. 2008, 20, 335–344. [Google Scholar] [CrossRef]
Chen, P.; Wu, X.F.; Chen, Y. Method of Call-search for Markovian Motion Targets Using UUV Cooperation. Syst. Eng. Electron. 2012, 34, 1630–1634. [Google Scholar]
Yang, T.T.; Jiang, Z.; Sun, R.J. Maritime Search and Rescue Based on Group Mobile Computing for Unmanned Aerial Vehicles and Unmanned Surface Vehicles. IEEE Trans. Ind. Inform. 2020, 16, 7700–7708. [Google Scholar] [CrossRef]
Luo, Q.Y.; Luan, T.H.; Shi, W.S. Deep Reinforcement Learning Based Computation Offloading and Trajectory Planning for Multi-UAV Cooperative Target Search. IEEE J. Sel. Areas Commun. 2023, 41, 504–520. [Google Scholar] [CrossRef]
Duan, H.B.; Zhao, J.X.; Deng, Y.M. Dynamic Discrete Pigeon-Inspired Optimization for Multi-UAV Cooperative Search-Attack Mission Planning. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 706–720. [Google Scholar] [CrossRef]
Fei, B.W.; Bao, W.D.; Zhu, X.M. Autonomous Cooperative Search Model for Multi-UAV with Limited Communication Network. IEEE Internet Things J. 2022, 9, 19346–19361. [Google Scholar] [CrossRef]
Shen, G.Q.; Lei, L.; Zhang, X.T. Multi-UAV Cooperative Search Based on Reinforcement Learning with a Digital Twin Driven Training Framework. IEEE Trans. Veh. Technol. 2023, 72, 8354–8368. [Google Scholar] [CrossRef]
Wang, Y.D.; Liu, W.Z.; Liu, J. Cooperative USV–UAV Marine Search and Rescue with Visual Navigation and Reinforcement Learning-based Control. ISA Trans. 2023, 137, 222–235. [Google Scholar] [CrossRef]
Cao, X.; Sun, H.B.; Jan, G.E. Multi-AUV Cooperative Target Search and Tracking in Unknown Underwater Environment. Ocean. Eng. 2018, 150, 1–11. [Google Scholar] [CrossRef]
Ma, X.W.; Chen, Y.L.; Bai, G.Q. Multi-autonomous Underwater Vehicles Collaboratively Search for Intelligent Targets in an Unknown Environment in the Presence of Interception. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2021, 235, 1539–1554. [Google Scholar] [CrossRef]
Liu, Y.; Peng, Y.; Wang, M. Multi-USV System Cooperative Underwater Target Search Based on Reinforcement Learning and Probability Map. Math. Probl. Eng. 2020, 2020, 7842768. [Google Scholar] [CrossRef]
Ni, J.J.; Yang, L.; Shi, P.F.; Luo, C.M. An improved DSA-Based approach for multi-AUV cooperative search. Comput. Intell. Neurosci. 2018, 2018, 2186574. [Google Scholar] [CrossRef]
Kyriakakis, N.A.; Marinaki, M.; Matsatsinis, N.; Marinakis, Y. Moving peak drone search problem: An online multi-swarm intelligence approach for UAV search operations. Swarm Evol. Comput. 2021, 66, 100956. [Google Scholar] [CrossRef]
Yue, W.; Tang, W.B.; Wang, L.Y. Multi-UAV Cooperative Anti-Submarine Search Based on a Rule-Driven MAC Scheme. Appl. Sci. 2022, 12, 5707. [Google Scholar] [CrossRef]
Phung, M.D.; Ha, Q.P. Motion-encoded Particle Swarm Optimization for Moving Target Search Using UAVs. Appl. Soft Comput. 2020, 97, 106705. [Google Scholar] [CrossRef]
Myoung, H.L.; Jun, M. Deep Reinforcement Learning-based Model-free Path Planning and Collision Avoidance for UAVs: A Soft Actor–critic with Hindsight Experience Replay Approach. ICT Express 2023, 9, 403–408. [Google Scholar]
Kenett, Y.N.; Faust, M. A Semantic Network Cartography of the Creative Mind. Trends Cogn. Sci. 2019, 23, 271–274. [Google Scholar] [CrossRef]
Fan, J.; Kalyanpur, A.; Gondek, D.C. Automatic Knowledge Extraction from Documents. IBM J. Res. Dev. 2012, 56, 1–10. [Google Scholar] [CrossRef]

Figure 1. Multi-UAV cooperative anti-submarine task scenario.

Figure 2. Knowledge-driven cooperative search framework.

Figure 3. Knowledge structure.

Figure 4. Path planning method framework based on KD-SAC.

Figure 5. Two UAVs searching trajectories in a barrier-free environment.

Figure 6. Comparison of optimal path planning generated by four algorithms in Scenario 2.

Figure 7. Comparison of optimal path planning generated by four algorithms in Scenario 3.

Figure 8. Comparison of the success rate of path planning in two scenarios of four algorithms.

Figure 9. Path comprehensive cooperation evaluation of four algorithms in two scenarios.

Figure 10. Comparison of average rewards of four algorithms in three scenarios.

Figure 11. Error comparison under different iterations.

Table 1. Basic search rules.

Rule Name	Rule Description	Remark
Rule 1	The spiral search is conducted with the submarine as the center of the circle after locating the single target.	The objective is to search for additional submarine targets that might be located close to the designated target.
Rule 2	The search is reversed in the direction of the submarine-based on Rule 1 if no other target is detected.
Rule 3	The search is carried out along the mid-vertical line connecting the two submarines once the second submarine is located.	The mid-vertical may potentially contain submarine-protected targets.
Rule 4	The submarines are surrounded by polygons after a thorough search, with the geometric center of the polygon being the focus of the search.	The geometric center of the polygon may have a heavily protected combat platform.
Rule 5	The UAV will search the midpoint of the connection lines between two detected water targets.

Table 2. Hyper-parameter setting of KD-SAC algorithm.

Serial Number	Hyper-Parameter	Value
1	Learning rate $λ_{Q}, λ_{π}$ , $λ$	5.6 × 10⁻⁴
2	Discount factor $γ$	0.99
3	Hidden layers	2
4	Number of hidden units	256
5	Minibatch size	256
6	Smoothing factor $δ$	0.005
7	Playback pool capacity	10⁸
8	Target entropy $\bar{H}$	−1

Table 3. Moving target information.

Serial Number	Start Point (m)	Initial Relative Angle (°)	Velocity (km/h)
1	(180, 420)	45	3
2	(350, 180)	−35	3
3	(750, 650)	30	3
4	(850, 350)	−60	3

Table 4. Obstacle and threat location information settings.

Serial Number	Attribution	Center Coordinates (m)	Radius Size (m)
1	Obstacle 1	(350, 500, 40)	60
2	Obstacle 2	(600, 200, 50)	70
3	Obstacle 3	(500, 350, 50)	50
4	Obstacle 4	(300, 280, 50)	30
5	Obstacle 5	(700, 550, 50)	50
6	Obstacle 6	(650, 750, 50)	40
7	Obstacle 7	(800, 400, 50)	50
8	Obstacle 8	(300, 650, 50)	50
9	Obstacle 9	(500, 600, 50)	60
10	Threat area 10	(450, 650, 70)	20
11	Threat area 11	(580, 400, 80)	30

Table 5. UAV initial information under Scenario 2.

Number	Start Point	Initial Relative Angle (°)	Velocity (km/h)
UAV1	(150, 350, 150)	0	90
UAV2	(400, 100, 150)	0	90

Table 6. UAV initial information under Scenario 3.

Number	Start Point	Initial Relative Angle (°)	Velocity (km/h)
UAV1	(100, 350, 100)	0	90
UAV2	(250, 200, 100)	0	90
UAV3	(200, 500, 100)	180	90
UAV4	(400, 100, 100)	180	90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Yue, W.; Tang, W. Research on Scheme Design and Decision of Multiple Unmanned Aerial Vehicle Cooperation Anti-Submarine Based on Knowledge-Driven Soft Actor-Critic. Appl. Sci. 2023, 13, 11527. https://doi.org/10.3390/app132011527

AMA Style

Zhang X, Yue W, Tang W. Research on Scheme Design and Decision of Multiple Unmanned Aerial Vehicle Cooperation Anti-Submarine Based on Knowledge-Driven Soft Actor-Critic. Applied Sciences. 2023; 13(20):11527. https://doi.org/10.3390/app132011527

Chicago/Turabian Style

Zhang, Xiaoyong, Wei Yue, and Wenbin Tang. 2023. "Research on Scheme Design and Decision of Multiple Unmanned Aerial Vehicle Cooperation Anti-Submarine Based on Knowledge-Driven Soft Actor-Critic" Applied Sciences 13, no. 20: 11527. https://doi.org/10.3390/app132011527

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Scheme Design and Decision of Multiple Unmanned Aerial Vehicle Cooperation Anti-Submarine Based on Knowledge-Driven Soft Actor-Critic

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Related Work

2. Problem Description

2.1. Environmental Model

2.2. UAV Motion Model

2.3. Knowledge-Driven Cooperative Search Framework

2.4. Knowledge Base Model

2.5. Decision Model Based on Markov Chain

2.6. Search Objective

3. Knowledge-Driven UAVs Autonomous Decision-Making

3.1. Update Mechanism Based on RDIRV

3.2. Event Triggered Knowledge Base Call Mechanism

4. Path Planning Method Based on KD-SAC

5. Simulation Experiment and Analysis

5.1. Path Planning Analysis

5.1.1. Multi-UAV Cooperative Search Trajectories Based on Knowledge Base

5.1.2. Multi-UAV Cooperative Search Trajectories Planning Based on KD-SAC

5.2. Algorithm Performance Analysis

5.3. Knowledge Correlation Analysis

6. Discussion

7. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI