1. Introduction
In recent years, with the development of weapons technology, offensive and defensive confrontation scenarios have become increasingly complex. The traditional one-to-one game problem is also difficult to keep up with the trend of battlefield intelligence. In various new studies, both sides of the confrontation continuously adopt new game strategies to gain battlefield advantages. Among them, the target-missile-defender (TMD) three-body engagement triggered by active target defense has attracted increasing research interest [
1,
2,
3,
4,
5,
6,
7]. In a typical three-body confrontation scenario, three types of vehicles are involved: the target (usually a high-value vehicle such as an aircraft or ballistic missile), an attacking missile to attack the target, and a defender missile to intercept the attacking missile. This combat scenario breaks the traditional pursuit-evasion model with greater complexity and provides more possibilities for battlefield games.
The early classical studies of the three-body confrontation problem mainly started from the spatial-geometric relationship. The researchers achieved the goal of defending the target by designing the spatial position of the defender with the target and the attacking missile (e.g., in the middle of the target and the missile). From the line-of-sight (LOS) guidance perspective, a guidance strategy for a defender guarding a target was investigated that enables the defender to intercept an attacking missile at a speed and maneuverability disadvantage [
8]. Triangle intercept guidance is also an ingenious guidance law based on the idea of LOS command guidance [
9]. In order to avoid the degradation of system performance or the need for additional high-resolution radar assistance due to reduced angular resolution at longer distances, a simpler gain form of the LOS angular rate was derived by optimal control, reducing the capability requirements of the sensing equipment [
10,
11]. Nonlinear control approaches, such as sliding mode control, can also achieve the control of LOS rotation [
12].
The more dominant research idea for the three-body problem is by means of optimal control or differential game. The difference between the two is that the guidance law based on the optimal control theory needs to know the opponent’s control strategy in advance. Although the reliance on a priori information for one-sided optimization can be reduced by sharing information between the target and the defender [
13], there are problems such as difficulties in applying numerical optimization algorithms online. In contrast, differential game has received more widespread attention as it does not require additional assumptions about the opponent’s strategy [
14,
15]. The differential game can obtain the game strategy of the two opponents by finding the saddle point solution, and under the condition of accurate modeling, it can guarantee the optimality of the strategy against the opponent’s arbitrary maneuver [
16,
17,
18]. Considering the drawback that the control of linear quadratic differential game guidance law may go beyond the boundary, the bounded differential game is proposed and verified on a two-dimensional plane and in three-dimensional space [
19,
20]. The differential game approach can also be applied to analyze the capture and escape regions and the Hamilton–Jacobi–Isaacs equation can be solved to demonstrate the consistency of the geometric approach with the optimal control approach [
21,
22,
23,
24]. Based on the analysis of the capture radius, the game can be divided into different stages and the corresponding control strategies can be proposed and the conditions of stage switching are analyzed [
25,
26]. In addition, in order to be closer to the actual battlefield environment, recent studies have considered the existence of strong constraint limits on capability boundaries [
27], state estimation under imperfect information through Kalman filtering [
28], the existence of the relative intercept angle constraints on attacking requirements [
17,
29,
30], cooperative multi-vehicle against an active defense target [
17,
31], weapon-target-allocation strategies [
32], and so on.
The existing studies basically must use the model’s linearization and order reduction as the basis to derive a guidance law that satisfies certain constraints and performance requirements. To simplify the derivation, the vehicles are often assumed to possess ideal dynamics [
33,
34]. However, as participating vehicles adopt more advanced game strategies, the battlefield becomes more complex and the linearization suffers from significant distortion under intense maneuvering confrontations.
Deep reinforcement learning (DRL) developed in recent years has good adaptability to complex nonlinear scenarios and shows strong potential in the aerospace field [
35], such as applying DRL to the attitude control of hypersonic vehicles [
36], design of missile guidance laws [
37,
38], asteroid landing [
39,
40], vehicle path planning [
41], and other issues. In addition, there have been many studies applying DRL to the pursuit-evasion game or TMD engagement. The problem of the cooperative capture of an advanced evader by multiple pursuers was studied in [
42] using DRL, which is difficult for differential game or optimal control in such a complex uncertain environment. In [
43], the researchers applied reinforcement learning algorithms to a particle environment where the attacker was able to evade the defender and eventually capture the target, showing better performance than traditional guidance algorithms. The agents in [
42] and [
43] all have ideal dynamics with fewer constraints relative to the real vehicle. In [
44], from the perspective of the target, reinforcement learning was applied to study the timing of target launching defenders, which has the potential to be solved online. Deep reinforcement learning was also utilized for the ballistic missile maneuvering penetration and attacking stationary targets, which can also be considered as a three-body problem [
6,
45]. In addition, adaptive dynamic programming, which is closely related to DRL, has also attracted extensive interest in intelligent adversarial games [
46,
47,
48,
49,
50]. However, the system models studied so far are relatively simple and few studies are applicable to complex continuous dynamic systems with multiple vehicles [
51,
52].
Motivated by the previous discussion, we apply DRL algorithms to a three-body engagement and obtain intelligent game strategies for both offensive and defensive confrontations, so that both an attacking missile and target/defender can combine evasion and interception performance. The strategy for the attacking missile ensures that the missile avoids the defender and hits the target; the strategy for the target/defender ensures that the defender intercepts the missile before it threatens the target. In addition, the DRL-based approach is highly adaptable to nonlinear scenarios and, thus, has outstanding advantages in further solving more complex multi-body adversarial problems in the future. However, there also exists a gap between the simulation environment and the real world when applying DRL approaches. Simulation environments can improve sampling efficiency and alleviate security issues, but difficulties caused by the reality gap are encountered when transferring agent policies to real devices. To address this issue, research applying DRL approaches to the aerospace domain should focus on the following aspects. On the one hand, sim-to-real (Sim2Real) research is used to close the reality gap and thus achieve more effective strategy transfer. The main methods currently being utilized for Sim2Real transfer in DRL include domain randomization, domain adaptation, imitation learning, meta-learning, and knowledge distillation [
53]. On the other hand, in the simulation phase, the robustness and generalization of the proposed methods should be fully verified. In the practical application phase, the hardware-in-the-loop simulation should be conducted to gradually improve the reliability of applying the proposed method to real devices.
In order to assist the DRL algorithm to converge more stably, we introduce curriculum learning into the agent training. The concept of curriculum learning was first introduced at the top conference International Conference on Machine Learning (ICML) in 2009, which caused a great sensation in the field of machine learning [
54]. In the following decade, numerous studies on curriculum learning and self-paced learning have been proposed.
The main contributions of this paper are summarized as follows.
- (1)
Combining the findings of differential game in the traditional three-body game with DRL algorithms enables agent training with clearer direction, while avoiding inaccuracies due to model linearization, and better adapts to complex battlefield environments with stronger nonlinearity.
- (2)
The three-body adversarial game model is constructed as a Markov Decision Process suitable for reinforcement learning training. Through analysis of the sign of the action space and design of the reward function in the adversarial form, the combat requirements of evasion and attack can be balanced in both missile and target/defender training.
- (3)
The missile agent and target/defender agent are trained in a curriculum learning approach to obtain intelligent game strategies for both attack and defense.
- (4)
The intelligent attack strategy enables the missile to avoid the defender and hit the target in various battlefield situations and adapt to the complex environment.
- (5)
The intelligent active defense strategy enables the less capable target/defender to achieve an effect similar to network adversarial attack on the missile agent. The defender intercepts the attacking missile before it hits the target.
The paper is structured as follows.
Section 2 introduces the TMD three-body engagement model and presents the differential game solutions solved on the basis of linearization and order reduction. In
Section 3, the three-body game is constructed as a Markov Decision Process with training curricula. In
Section 4, the intelligent game strategy for the attacking missile and the intelligent game strategy for the target/defender are solved separately using curriculum-based DRL. The simulation results and discussion are provided in
Section 5, analyzing the advantages of the proposed approach. Finally, some final remarks are provided as a conclusion in
Section 6.
3. Curriculum-Based DRL Algorithm
Applying the deep reinforcement learning algorithm to the TMD engagement scenario consists of the following steps. First, the engagement environment is constructed based on the dynamics model, which was outlined in
Section 2. Next, the environment is constructed as a Markov decision process, which includes action selection, redirection shaping, and observation selection. This needs to be carefully designed taking full account of the dynamics of the missile and the target/defender. Finally, there is a learning curriculum to ensure training stability.
3.1. Deep Reinforcement Learning and Curriculum Learning
Reinforcement learning, as a branch of machine learning, has received a lot of attention from researchers in various fields in recent years. Classical reinforcement learning is used to solve the Markov Decision Process (MDP) of dynamic interaction between an agent and the environment, which consists of a quintuple , where and denote the state space and action space, denotes the probability matrix of state transfer, denotes the immediate reward, and denotes the reward discount factor. In the MDP, the immediate reward and the next state only depend on the current state and action, which is called Markov property. The solving process of a dynamic system through the integral is essentially consistent with the MDP.
Benefiting from the rapid development of deep learning, reinforcement learning has achieved abundant achievements in recent years and developed into deep reinforcement learning. However, DRL is often plagued by reward sparsity and excessive action-state space in training. In the TMD engagement, we are concerned with the terminal miss distance and not with the intermediate processes. Therefore, the terminal reward in the reward function dominates absolutely, which is similar to the terminal performance index in the optimal control problem. Thus, the reward function in the guidance problem is typically sparse, otherwise the dense intermediate reward may lead to speculative strategies that the designer does not expect. Furthermore, despite the clear problem definition and optimization goals, the nearly infinite action-state space and the huge random initial conditions still pose obvious difficulties for the agent training. In particular, random conditions such as the position, speed, and heading error of each vehicle at the beginning of the engagement add uncertainty to the training.
To solve this problem, we use a curriculum learning approach to ensure the steady progress of training. The learning process of humans and animals generally follows a sequence from easy to difficult and curriculum learning draws on this learning idea. In contrast to the general paradigm of indiscriminate machine learning, curriculum learning mimics the process of human learning by proposing that models start with easy tasks and gradually progress to complex samples and knowledge [
56,
57]. Curriculum learning assigns different weights to the training samples of different difficulty levels according to the difficulty of the samples. Initially, the highest weights are assigned to the easy samples and, as the training process continues, the weights of the harder samples will be gradually increased. Such a process of dynamically assigning weights to samples is called a Curriculum. Curriculum learning can accelerate training and reduce the training iteration steps while achieving the same performance. In addition, curriculum learning enables the model to obtain better generalization performance, i.e., it allows the model to be trained to a better local optimum state. We will start with simple missions in our training, so that the agent can easily obtain the sparse reward at the end of an episode. Then, the random range of the initial conditions will be gradually expanded to enable the agent to eventually cope with the complex environment.
In the following, we will construct an MDP framework for the TMD engagement, consisting of action selection, reward shaping, and observation selection. The formulation requires adequate consideration of the dynamic model properties, as these have a significant impact on the results.
3.2. Reward Shaping
For the design of the reward function, consider the engagement as the process of the confrontation game between the missile and the target/defender and the advantage of one side on the battlefield is correspondingly expressed as the disadvantage of the other side. Therefore, the reward function should reflect the combat intention of both sides of the game, including positive rewards and negative penalties and, accordingly, the rewards and penalties of one side show the punishments and rewards of the other side. We design the following two forms of reward functions:
where
is assigned to indicate an intermediate reward or penalty in an episode and
is assigned to indicate a terminal reward or penalty near the end of an episode. The function
adopts an exponential form that rises exponentially as
approaches zero. The parameters
and
regulate the rate of growth of the exponential function. The general idea is to obtain continuously varying dense rewards through the exponential function. However, this results in a poor differentiation of the cumulative rewards between different policies and thus affects policy updates. We eventually set the reward to vary significantly as
approaches 0, meaning that this will be a sparse reward. The basis of the differential game formulation reduces the difficulty of training and ensures that the agent completes training with sparse rewards. For both the missile and the target/defender,
can be chosen as either the distance or the zero-effort miss. Note that using the zero-effort miss in the reward function imposes no additional requirements on the hardware equipment of the guidance system, as this is only used for off-line training. The function
adopts a stairs form and
,
, and
are the quantities associated with the kill radius.
3.3. Action Selection
According to the derived Equation (4), when training the missile agent, the action is chosen as a two-dimensional vector ; when training the target/defender agent, the action is chosen as a four-dimensional vector .
Further analysis of Equation (4) reveals that each term in the control law is precisely in the form of the classical proportional navigation guidance law [
58]. Thus, each of the effective navigation gains have the meaning in
Table 1. Beyond the effective time, that is, after the engagement between the missile and the defender, the corresponding gains are set to zero.
To further improve the efficiency and stability of the training, we further analyze the positive and negative of the effective navigation gains. From the control point of view, the proportional navigation guidance law can be considered as a feedback control system that regulates the zero-effort miss to zero. Therefore, only a negative feedback system can be used to avoid the divergence, as shown in
Figure 2a. The simplest step maneuver is often utilized to analyze the performance of a guidance system; the conclusion that the miss distance converges to zero with increasing flight time is provided in [
58].
Establish the adjoint system of the negative feedback guidance system, as shown in
Figure 2b. For convenience, we allow
to be replaced by
and, from the convolution integral, we can obtain
Converting Equation (10) from time to the frequency domain, we obtain
Next, integrating the preceding equation yields
When the guidance system is a single-lag system, which means that
we can finally obtain the expression for the miss distance of the negative feedback guidance system in the frequency domain as
Applying the final value theorem, when the flight time increases, the miss distance will tend to zero:
which means that the guidance system is stable and controllable. Similarly, we can find the expression of the miss distance for the positive feedback guidance system in the frequency domain as follows
Again, applying the final value theorem, it can be found that the miss distance does not converge with increasing flight time, but instead diverges to infinity
This conclusion is obvious from the control point of view, since positive feedback systems are generally not adopted because of their divergence characteristics. Therefore, positive feedback is never used in proportional navigation guidance systems and the effective guidance gain is never set to be negative. However, now we are faced with a situation where wants to decrease , and want to decrease , but wants to increase and also and want to increase . Therefore, combining the properties of negative and positive feedback systems, we set the actions , , and to be positive and , , and to be negative.
3.4. Observation Selection
During the flight of a vehicle, not all states are meaningful for the design of the guidance law, nor all states can be accurately obtained by sensors. Redundant observations not only complicate the structure of the network, thus increasing the training difficulty, but also ignore the prior knowledge of the designer. Through radar and filtering technology, information such as distance, closing speed, line-of-sight angle, and line-of-sight angle rate can be obtained, which are also commonly required in classical guidance laws. Therefore, the observation of the agent is eventually selected as
It should be noted that both in training the missile agent and in training the target/defender agent, the selected observation is the in Equation (18). The observation does not impose additional hardware equipment requirements on the vehicle that are capable of interfacing with existing weapons.
In addition, although the TMD engagement is divided into two phases, and , the observations associated with the defender are not set to zero during in order to ensure the stability of the network updating.
3.5. Curricula for Steady Training
Considering the difficulty of training directly, the curriculum learning approach was adopted to delineate environments of varying difficulty, thus allowing the agent to start with simple tasks and gradually adapt to the complex environment. The curricula are set to a different range of randomness for the initial conditions. The randomness of the initial conditions is reflected in the position of the vehicle (both lateral and longitudinal ), the velocity , and the flight path angle including the pointing error. The greater randomness of the initial conditions implies greater uncertainty and complexity of the environment. If the initial conditions are generated from a completely random range at the beginning, it will be difficult to stabilize the training of the agent. The curricula are set up to start training from a smaller range of random initial conditions and gradually expand the randomness of the initial conditions.
Assuming that the variable
belongs to
, when the total training step reaches
, the random range of the variable is
where
is the scheduling variable for the curricula difficulty. The training scheduler is depicted in
Figure 3, from which it can be seen that the random range keeps expanding, and, by the time the training step reaches
, the random range has basically coincided with the complete environment.
The growth rate of the range of random initial conditions is related to the difficulty of the environment. For more difficult environments, is required to be larger. This involves a trade-off between the training stability and training time consumption. For scenarios with difficult initial conditions, the probability distribution of random numbers can be designed to adjust the curricula. In the next training, we will choose the uniform distribution for initialization.
3.6. Strategy Update Algorithm
With the MDP constructed, the reinforcement learning algorithm applied to train the agents is selected. In recent years, along with the development of deep learning, reinforcement learning has evolved into deep reinforcement learning and has made breakthroughs in a series of interactive decision problems. The algorithms that have received wide attention include the TD3 algorithm (Twin Delayed Deep Deterministic Policy Gradient) [
59], the SAC algorithm (Soft Actor Critic) [
60], and the PPO algorithm (Proximal Policy Optimization) [
61]. In this study, we adopt the PPO algorithm, which is insensitive to hyperparameters, stable in the training process, and suitable for training in dynamic environments with continuous action spaces.
At any moment , the agents perform actions based on the current observation from sensors and the embedded trained policy , driving the dynamic system to the next state , and receiving the corresponding reward . The interaction process exists until the end of the three-body game, which is called an episode. The agent and environment concurrently engender a sequence , which is defined as a trajectory.
The goal of the agent is to solve the optimal policy
to maximize the expected cumulative discount reward, which is usually formalized by the state-value function
and the state-action value function
:
The advantage function is also calculated to estimate how advantageous an action is relative to the expected optimal action under the current policy:
In the PPO algorithm, the objective function expected to be maximized is represented as
where
is a hyperparameter to restrict the size of policy updates and the probability ratio is
. Equation (22) implies that the advantage function will be clipped if the probability ratio between the new policy and the old policy falls outside the range
and
. The probability ratio measures how different the two policies are. The clipped objective function ensures that excessive policy updates are avoided through clipping the estimated advantage function.
To further improve the performance of the algorithm, a value function loss term
for the estimation accuracy of the critic network and an entropy maximum bonus
for encouraging exploration are introduced into the surrogate objective
where
and
are corresponding coefficients. The purpose of the algorithm is to update the parameters of the neural network to maximize the surrogate objective with respect to
.