1. Introduction
The phenomenon of exploiting atmospheric updrafts for energy harvesting and long-distance flight is widely present in migrating birds and experienced glider pilots [
1,
2]. This method of using atmospheric updrafts is referred to as soaring, which has received a lot of research and made significant progress. With increasing attention paid to soaring, many researchers have attempted to implement soaring using unmanned aircraft. The behavior of unmanned aircraft exploiting wind updrafts to increase endurance is called autonomous soaring [
3].
Traditional research on the autonomous soaring of UAVs has relied heavily on locating the center of updrafts or using simplified wind field environments and UAV models. Allen [
4] proposed a flight method for tracking the updraft center. For known updraft centers, he applied a circular strategy to obtain energy from the center by measuring its movement. The simulation results show that UAVs can greatly improve their endurance by utilizing convective lift in the atmosphere. Depenbusch et al. [
5,
6] proposed a complete set of algorithm systems that can be used for the autonomous soaring of aircraft. They conducted research on wind estimation, updraft identification, the measurement and dynamics of heat flow, and exploration methods. The components of autonomous algorithms include the mapping of thermals, exploration/utilization decisions for wind fields, navigation, the calculation of optimal airspeed, and energy state estimation. The feasibility of this algorithm is proven by actual flight experiments. Edwards et al. [
7] developed a new way to locate and keep the glider in the rising thermal and applied this method to a glider weighing 5 kg, participating in a UAV flying competition. In competition with manually piloted gliders, the intelligent unmanned aerial vehicle performed well, providing a valuable comparison between the effectiveness of artificial and autonomous flight.
In recent years, various artificial intelligence algorithms have been applied to the research on autonomous soaring. Reddy et al. [
1,
8] combined digital simulation technology and reinforcement learning methods to study the problem of soaring in a turbulent wind environment. They used a numerical model of turbulent convection to simulate the atmospheric boundary layer and used a classic reinforcement learning algorithm called state–action–reward–state–action to train an unpowered glider with a wingspan of 2 m. They obtained perceptual cues that can effectively control flight in turbulent environments and realized autonomous navigation. In further research, field flight experiments were successfully implemented, and an autonomous flight strategy directly applicable to turbulence environments was proposed. Thomas et al. [
9] used a new intelligent algorithm, the artificial lumbered flight algorithm, to conduct autonomous flight research for a powered UAV with a wingspan of 2 m. By setting the reward function and using intelligent algorithms to train UAVs in simulated wind fields, the researchers made an autonomous intelligence trade-off between navigating to a target point and using updraft to gain energy. Using fully trained intelligent algorithms as flight strategies, the researchers successfully implemented flight experiments, demonstrating the feasibility of the autonomous hovering of UAVs. Notter et al. [
10] proposed a model-free method for flight in a vertical plane thermal. Based on the Markov decision process (MDP), a strategy gradient rise method was used to optimize random control policies. The simulation results show that the control of the trained agent mimics the optimal behavior and can reasonably take actions in a random environment. In their further research [
11], a new hierarchical reinforcement learning method was proposed to provide control strategies for GPS triangular competition tasks for remotely controlled gliders. After training, the agent makes decisions and executes the sub-strategies of tracking the track and utilizing updraft. The preliminary flight test results verify the feasibility of the hierarchical reinforcement learning method. Chakrabarty et al. [
12] proposed a graph-based method for planning energy flight trajectories based on a set of points to achieve the long-distance flight of small UAVs. Experimental tests using high-fidelity numerical simulations of real wind fields showed that the energy map method achieved good results in solving the problem of long-distance flight in unmanned aerial vehicles. Guilliard et al. [
13] dealt with the problem of the autonomous balanced utilization of updrafts and environmental mapping as a partially observable MDP, which was based on predicted trajectories when given actual system states rather than generating Markov tuples offline. The researchers further applied this algorithm to the popular open-source autopilot ArduPlane and compared it with existing alternative algorithms in a series of real-time flight experiments. The results proved that the controller based on partially observable MDP had significant advantages.
Autonomous soaring logically divides into a structured approach and “a lumbered approach” [
9]. The structured approach means that the aircraft identifies updrafts by comparing updraft velocity measurements with knowledge of classified updrafts. The lumbered approach means that aircraft can only sense the nearby updrafts and use local real-time decision making to navigate the wind field, without the need to classify the wind. Both approaches involve two steps: exploring and exploiting thermal updrafts. Many researchers have conducted a lot of studies exploring the thermal center [
2,
4,
12], but few works have focused on efficiently exploiting thermals when gliders encounter one. Some experienced pilots used MacCready speed ring (determining the optimal speed to maximize the average cross-country speed) to help their thermal soaring [
14]. Walton et al. [
15] studied optimal glider trajectories for gaining altitude from thermal updraft and suggested that off-centered and figure-eight trajectories may have the potential to increase the options for energy harvesting. Edwards et al. [
16] investigated the energy efficiency between the radius of a circular orbit and thermal updraft size, as well as solar input.
As presented before, lots of works have made use of reinforcement learning algorithms to conduct autonomous soaring and achieved great results [
1,
3,
8,
10,
11]. However, most of these works use the original reinforcement learning algorithms, which are not optimized for autonomous soaring, which makes them either use a complex thermal model [
1] or widely varying initial conditions [
3,
11] or assume that the location of the thermal center is known a priori by the vehicle [
3,
17].
In this paper, novel strategy optimization methods for exploiting thermal task in autonomous soaring are proposed. A simple round thermal model [
18], which assumes thermals to be approximately round, is used as the thermal environment. A continuous reinforcement learning algorithm, the Twin Delayed Deep Deterministic policy gradient algorithm (TD3) [
19], is used to train an unpowered glider.
First, a brief training method is utilized to train a TD3 glider agent and obtain the basic energy harvesting strategies. Prompt tuning [
20] is a competitive technique for adapting frozen pre-trained language models to downstream tasks. Inspired by prompt tuning, an extra correction module is proposed and introduced to the basic strategies to solve the weak generalization problems. These problems are caused by the partially observable problem that often occurs in reinforcement learning [
21]. Furthermore, based on the unique characteristic of the energy-harvesting task in autonomous soaring, a flight strategy symmetry method is introduced to the basic strategies. Based on these two optimized methods, basic soaring strategies are obtained in the sample bell-shaped thermal updraft with a brief training method, which can have more effective soaring strategies. Simulation tests using a bell-shaped thermal model, the Dryden continuous turbulence model [
22], and the Gedeon thermal model [
23] are conducted to verify the effectiveness of the proposed training and optimization methods for energy-harvesting tasks in autonomous soaring.
The paper is organized as follows. In
Section 2, the simulation glider dynamics model and round updraft model are established. In
Section 3, the TD3 algorithm and the basic elements of reinforcement learning are introduced, and the training results of the basic soaring strategies in the round thermal are displayed. Also, a brief training method is discussed.
Section 4 introduces further optimized methods and compares the results from round updrafts, the Dryden continuous turbulence model, and the Gedeon thermal model. Finally, in
Section 5, we summarize the current work and demonstrate the results of our research.
3. Reinforcement Learning Algorithm and Training Results
Reinforcement learning considers the example of interaction between an agent and its environment with the aim of learning the behavior of maximum reward function. These problems are typically described in the framework of the Markov decision process (MDP). In reinforcement learning, the agent obtains a state,
, at each discrete time step,
, and selects actions,
, with respect to its policy,
. The agent obtains a reward,
, and a new environment,
, after each action. The learning objective is to maximize the sum of discounted future rewards,
, for each state,
, where
is a discount factor (
) determining the priority of short-term rewards [
24].
The majority of earlier studies used discrete control reinforcement learning algorithms [
1,
3,
17], in which the agent’s action space comprises a number of preset actions. A discrete control reinforcement learning method can only increase or decrease a fixed value from the existing discrete bank angles or leave it unaltered, which limits the flight performance of the glider. This paper makes use of a continuous control reinforcement learning algorithm, TD3, which takes actions within a range rather than several fixed actions.
3.1. TD3 Algorithm
The Twin Delayed Deep Deterministic policy gradient algorithm (TD3) is an actor–critic algorithm that considers the interplay between function approximation errors in both policy and value updates [
19]. The TD3 is based on the Deep Deterministic Policy Gradient algorithm (DDPG) [
24], an advanced actor–critic method for continuous control, and it solves the problem wherein DDPG too easily overestimates current policies. There are six networks in the TD3: critic networks,
; target critic networks,
; the actor network,
; and the target actor network,
.
At the beginning of training, the critic networks, , and actor network, , are initialized with random parameters, , and target networks are initialized by . Also, the TD3 initializes a replay buffer, , to store historical experiences.
At each time step,
, actions,
, are first selected and the agent observes a reward,
, and a new state,
,; meanwhile, the transition tuple
is stored in
. A mini-batch of
transition tuples are sampled from
and obtain
,
, where
is a noise, and
is a discounted factor determining the priority of short-term rewards. Then, the weights,
, of the critic networks,
, are updated by the minimized loss function as follows:
After time step
, the weight,
, of the actor network,
, is updated by the deterministic policy gradient:
and target networks are undated by
where
is the update rate. When time step
reaches its maximum,
, or other conditions are met, the training process stops.
3.1.1. Action Space
In this research, the pitch angle, , and roll angle, , of the glider are continuously controlled. The actions are determined by the pitch angle within the range of and the roll angle within the range of .
3.1.2. Reward Function
The purpose of this study is to make an unpowered glider learn to harvest as much energy as possible in the updraft. For this purpose, the reward function,
, is set as the total energy gain of the glider divided by the mass, including gravitational potential energy and kinetic energy, i.e., energy gain independent of mass, as shown in the following formula:
where
is gravitational acceleration,
is the reward at simulation time
, and
and
are glider altitude and airspeed.
3.1.3. Break Conditions
Consistent with the reward function, the break conditions include a restriction on the glider altitude and airspeed, as well as a simulation time limit. The minimum altitude, , is set as 20 m to prevent the glider from touching the ground, and the maximum altitude is not set. We encourage the glider to gain as much gravitational potential energy as possible. The minimum airspeed, , is set as 5 m/s, and the maximum airspeed, , is set as 30 m/s. In fact, the trained glider always maintains the flight airspeed at around 10 m/s. The maximum flight time of the simulation, = 100 s, is enough for the glider to fly to the bell-shaped thermal center. When any of the three conditions are met, the current episode ends, and a new episode begins.
3.1.4. Observations
For the glider agent, observations are the source of decision making. In this research, to provide enough information for the glider agent and to minimize electronic sensory devices required for control, the selection of observations follows two principles. (1) Observations should be sufficient to allow the glider to make sound energy-harvesting decisions. (2) Observations should not contain information that the glider cannot obtain in reality, such as updraft distribution. Reddy et al. used the pair of vertical wind acceleration and torque as a sensorimotor cue sensed by the glider, namely, the observation of the glider agent, to minimize the electronic sensory devices required for control [
1].
In this research, “torque” is also used as an observation to help the glider perceive the possible location of a stronger updraft, which is defined as
where
and
are the updraft velocities at the left and right wings.
Instead of vertical wind acceleration, updraft velocity, , is directly used as the second observation of the glider agent here. tells the glider how strong the current updraft is and helps it find the possible location of stronger updraft. and provide enough succinct information for the glider agent to control and , as well as notify the glider how strong the present updraft is and assist the glider in locating a probable stronger updraft.
To manage
at the same level as
, the wind speed differential,
, is multiplied by a scaling factor,
, which is set to 100. Therefore, the observation vector received by the glider agent is
. The change trend of
in a flight test after training is shown in
Figure 3.
3.2. Training Method
Reinforcement learning is very sensitive to both initialization and the dynamics of the training process, which leads to failed training [
25]. In order to enable reinforcement learning to be effectively used for autonomous soaring, previous researchers always used complex thermal models and widely varying initial conditions. In this research, a sample round thermal [
18] is used to reduce the complexity of the thermal model, and brief initial conditions are set. The initial position
is set to
m, and the initial altitude,
, is set to 100 m. Initial airspeed is set to 8 m/s, and the maximum wind strength is
= 8. The initial
,
, and yaw angle,
, of the glider are all set to zero. All the initial conditions are fixed without any variance, which greatly reduces training difficulty. During the training process, the time step of the TD3 algorithm is set to 1 s, and the target and policy update frequency is set to 2 s. The learning rate is set to 0.01 at the beginning of the training and decreases to 0.001 during the training process. The discount factor,
, is set to 0.99, as each episode lasts for 100 s. We can train five agents with different random seeds, and the training stops when the average reward fluctuation of 40 episodes is within 100. The reinforcement learning training and simulation processes are conducted in MATLAB/Simulink. The CPU used for simulation is an Intel Xeon Platinum 8255C.
3.3. Training Results
Training results for the five agents with different random seeds at four different initial positions are shown in
Figure 4. All five agents achieved good training results, indicating a high training success rate. The first agent had the best result and, therefore, will be used for subsequent research.
Figure 5 displays the training results of the flight paths, and
Table 2 displays the energy gain independent of mass at different initial positions on a circle with a radius of
m. It can be observed that at different initial positions, the glider can successfully locate the center of the round updraft and hover around it. When
= −50 m, as shown in
Figure 5a, the initial direction of the glider flies toward the larger updraft, implying that it only requires a modest yaw angle to fly to the updraft center. When
= 50 m, as shown in
Figure 5b, the initial flying direction of the glider is toward the weaker updraft, necessitating a turn in order to fly toward the updraft center. An interesting phenomenon is that the trained glider performs better in a clockwise hover, which is caused by the unequal sampling of the updraft environment during the training phase.
Figure 6a depicts the trajectories, and
Table 3 shows the energy gain independent of mass at different
values when
= −50 m. It can be observed that when
−90 m, the glider can successfully detect the center of the round updraft and hover around it. However, when
−100 m, the glider is unable to precisely recognize the updraft center and flies to the opposite direction. This indicates that the current glider agent is ineffective at recognizing a weak updraft. Also, at different
values, the glider tends to hover clockwise. The trajectories at different maximum thermal velocities,
, when the initial position is
= (−50 m, −50 m) are shown in
Figure 6b. and
Table 4 shows the energy gain independent of mass. The glider can effectively locate the center of the round updraft and hover around it when
4. However, when
2, the glider is unable to correctly recognize the updraft center.
Figure 7 depicts the learned flight strategies. For the control strategies of roll angle,
, shown in
Figure 7a, there is a distinct boundary between
and
. It can be observed that on the boundary,
. For
,
is always be positive, and
has a relatively significant reaction to changes in observations only when
. Conversely, when
,
is very sensitive to changes in observations, implying that the glider has more precise control over the roll angle,
. This explains why the glider always hovers clockwise. When the glider hovers clockwise, the updraft velocity at the left wing
is smaller than
, resulting in
, allowing the glider to select the optimal
to maximize energy gain. As illustrated in
Figure 7b, the glider agent constantly tries to keep a higher pitch angle,
, because a higher
corresponds to a higher lift drag ratio and a lower sink rate.
4. Further Optimized Algorithm
According to the previous presentation, the glider agent trained by a brief method has high perception in round updrafts and can effectively enhance its ability to harvest energy. However, it also shows that, when the updraft is weak, for example, when
or
−100 m, the glider cannot identify the updraft center. This is because the initial position during training is
m, where the updraft is relatively strong, making the sampling data in weak updrafts difficult for the agent. One alternative is to move the initial position further away, but this approach may lead to degradation when the initial position is closer to the updraft center. As demonstrated in
Figure 6a and
Table 3, the energy gain independent of mass is less when
= −40 m than
= −50 m or even −70 m. Another solution is to use variable initial distance between the initial position and the updraft center. However, the maximum energy gain corresponding to various initial distances is inconsistent, making it challenging to gauge training quality. It is also difficult to determine the range of changes in distance.
Furthermore, as shown in
Figure 7, the learned flight strategies exhibit a significant asymmetry about
for both
and
.
is the observation that represents the strength difference between the left and right updrafts, which guides the glider to fly toward the left or right. For the same
value,
should be an odd function of
. During the training process, after the glider agent has explored a successful hovering flight strategy (clockwise or counterclockwise) for the first time, the glider will prefer to choose the same hovering strategies for all future training. As a consequence, a large proportion of similar hovering strategies will be gathered into the TD3 experience replay buffer. This will cause future training to concentrate on optimizing the same hovering strategy, resulting in asymmetry.
Prompt tuning is a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks [
20]. Inspired by prompt-tuning mechanisms, as we have developed a basic soaring strategy, novel-strategy-optimized methods are proposed for “prompt tuning” for basic strategies. The optimized methods are aimed at the energy-harvesting task in autonomous soaring and are additional to the reinforcement learning algorithm and also applicable to algorithms other than the TD3. Comparisons of the simulated flight results of the TD3 and the optimized TD3 are conducted in round updrafts and the Gedeon thermal model [
23]. Furthermore, flight results in the Dryden continuous turbulence model [
22,
26], a gust model used to simulate turbulence, can be compared.
4.1. Optimized Methods
As shown in
Figure 7,
has a significant reaction to changes in
, while the reaction to changes in
is rather minor. Furthermore,
is an observation that perceives the potential location of stronger updrafts. As a result, before inputting observation
into the glider agent, we can add an extra correction module,
, to improve the agent’s capacity to recognize weak updraft.
Firstly, to correct , a threshold value, , is determined to divide the strong and weak updrafts. In the correction module, two artificial neural networks, and , are employed to correct the strong and weak updrafts, respectively, and the corrected is input into the glider agent. As mentioned earlier, is set to 100.
Once the training of the TD3 glider agent is completed, none of the parameters of the trained glider agent will change. For the trained glider agent, different scaling factors,
, will result in different energy gains independent of mass (total reward,
) at varied initial positions, as shown in
Figure 8. Then, we can derive the
that corresponds to the maximum
at each initial positions. Based on this, we can further obtain mapping between
and
. As a result, the artificial neural networks
and
can be trained to correct
, making the input to the glide agent more reasonable.
and
can be considered an upgraded version of
. In the round updraft,
is set to 0.1 m/s.
Additionally, in order to solve the problem of flight strategy asymmetry, the trained glider is operated with strategy symmetry so that it has better control over both clockwise and counterclockwise flight. In
Figure 7a, when
,
is very sensitive to changes in observations; thus, the flight strategies when
are used as the basic strategies. Specifically,
is taken as an absolute value and multiplied by −1 before being input into the glider agent. For the
value output by glider agent, no operation is performed when
, and
is multiplied by −1 for
. No operation is performed for either
or
because a higher
always corresponds to a higher lift drag ratio and a lower sink rate.
Figure 9 depicts a schematic of the optimized reinforcement learning system with the extra correction module and the strategy symmetry method.
4.2. Results for the Round Thermal
Figure 10 compares simulated flight paths for TD3 and optimized TD3, and
Table 2 displays the energy gain independent of mass at different initial positions on a circle with a radius of
m. The performance of the optimized TD3 glider agent is superior to the original one. When
= −50 m, both of the initial flying directions of the glider are toward the larger updraft. However, with the optimized algorithm, the glider flies more directly to the updraft center than the original one. When
= 50 m, both of the initial flying directions are toward the weaker updraft; the optimized glider turns faster to the updraft center.
Figure 11 depicts the trajectories of the two agents, and
Table 3 illustrates the energy gain independent of mass at
−100 m, where the original glider could not exactly recognize the updraft center. At both the initial
values, the original glider is unable to precisely recognize the updraft center. The optimized glider, however, could readily locate the updraft center and fly directly to it when
−100 m (yellow line in
Figure 11). Even with
values are at very distant initial positions of −500 m and −780 m (green and red lines in
Figure 11), the optimized glider is able to effectively detect the updraft center, demonstrating the effectiveness of the optimized algorithm. According to
Figure 6b and
Table 4, when
2, the glider is unable to properly recognize the updraft center. With the help of the correction module, the trained glider agent is able to accurately recognize the updraft center. Comparisons of the results between the TD3 and the optimized TD3 when
= 2 m/s are shown in
Figure 12 and
Table 4. The optimized glider agent can also successfully exploit the updraft to harvest energy in a rather weak updraft.
Figure 13 depicts a flight strategy comparison between the TD3 and the optimized TD3 in a very narrow region of
and
. According to
Figure 13a,c, the boundary around
is thinner in the optimized TD3. In this boundary,
always has a large value because the glider rarely encounters cases where
is very small during training, preventing the strategies from being trained effectively. This result in incomplete, and unsatisfactory flight strategies result when
is very small. Also, it explains the oscillation phenomenon of the flight paths during long-distance flights, as shown in
Figure 11. When the glider position is far from the updraft center,
is very small, leading to a large
and a rapidly rolling glider, resulting in an opposite
and an opposite
.
is set to 0.1 m/s; thus, there is a distinct boundary for the optimized TD3 at
0.1. As shown in
Figure 13b,d, through the benefit of the application of strategy symmetry,
always has a high value in the optimized TD3, obtaining a higher lift drag ratio and a lower sink rate. Also, there is a distinct boundary for the optimized TD3 at
0.1.
4.3. Results for the Gedeon Thermal
To test the generalization of the trained glider agent, we can extend the original round updraft into a more complex Gedeon thermal environment [
23]. The Gedeon thermal model is described as
where the thermal center
is set to
.
Updraft velocity in the round updraft model is always positive, while the Gedeon model considers a downdraft outside the thermal center, which increases the difficulty for the glider agent in identifying the updraft center. We tested the initial positions of (−50 m, −50 m), where updraft velocity is exactly the lowest and (−60 m, −50 m).
Figure 14 shows a comparison of simulation flight paths, and energy gains independent of mass are shown in
Table 5. When
−50 m, the original TD3 agent cannot effectively harvest energy, while the optimized TD3 agent can still fly to the updraft center and hover to harvest energy. When
−60 m, neither the TD3 nor the optimized TD3 agents can achieve energy gain, implying that the current training method is not sufficient for the agent to successfully recognize the updraft center when
0.
4.4. Results for Dryden Continuous Turbulence
Comparisons in the preceding section highlight the effectiveness of the optimized methods in assisting the glider in accurately recognizing and harvesting energy in weaker round updrafts and the Gedeon thermal. In order to deeply verify the generalization of the correction module, we can test the same TD3 glider agent and the optimized TD3 agent that was obtained from the preceding section for Dryden continuous turbulence. The specification MIL-F-8785C is used [
26]. The turbulence is generated for a moderate level of turbulence at low altitude (less than 304.8 m), with a mean wind speed of 15 m/s at 6.1 m altitude [
22,
27].
As both thermals in the round updraft model and the Gedeon model thermal center have a center, the most important task for the glider agent is to locate the thermal center. In Dryden continuous turbulence, there is no obvious thermal center, and wind velocity changes are time-varying. Therefore, in the turbulence environment, the difficulty lies in how to fly toward larger updraft velocity regions through observations. In addition, we can test an extra agent with fixed and values during flight simulation. The fixed and parameters allow the glider to harvest the most energy without extra control commands.
Figure 15 shows a comparison of the position parameters across the three different control strategies and energy gains independent of mass are shown in
Table 6. The original TD3 glider agent cannot harvest energy from the turbulence environment, whereas the optimized TD3 glider agent succeeds and surpasses the other two in energy gain, implying that the optimized TD3 has a greater potential to explore unknown turbulence fields. The optimized TD3 agent outperforms the other two controllers in terms of altitude gain, which means that it has higher gravitational potential energy gain. Both the TD3 and the optimized TD3 agents fly upward when
and downward when
, as shown in
Figure 15a,b. For the flight velocity, the original TD3 agent has a peak flight velocity of about 19.2 m/s, and the flight velocity remains at about 15.4 m/s at the end of the simulation. The optimized TD3 agent has a peak flight velocity of about 14.5 m/s, and the flight velocity remains at about 7.8 m/s at the end of the simulation,
100 s. The original TD3 agent has a larger kinetic energy gain, but it losses too much altitude, resulting in low energy gain.
5. Conclusions
This paper studied the flight strategy of maximizing the harvesting energy of an unpowered glider in a round updraft using a continuous control reinforcement learning algorithm: the TD3. A PID controller was used to control pitch angle, , and roll angle, , in a 6-DOF glider dynamics model in this research. The testing results show that the trained glider is able to recognize the updraft center successfully and hover around it to harvest more energy. In order to improve the trained glider agent’s perception of weak updrafts without increasing training difficulty and to solve the asymmetric strategy problem, further optimized methods are proposed. An extra correction module is added to the trained glider agent, and a strategy symmetry method is applied. The comparison results confirm that the optimized TD3 agent can fly from a long distance to the round updraft center and could gain energy in a relatively weak updraft environment. Furthermore, flight simulations in a Gedeon thermal model and Dryden continuous turbulence demonstrate that the further optimized methods significantly improve the harvest energy ability of the glider agent.
Compared with the TD3 policy networks, the proposed correction networks have fewer parameters. Therefore, we may be able to add various correction modules for different autonomous soaring tasks or for different weather conditions rather than training multiple reinforcement learning policy networks with a large number of parameters.
The research in this paper demonstrates the potential of using reinforcement learning algorithms to maximize the energy gain of unpowered gliders in updrafts. However, the energy-harvesting strategies learned by the TD3 are not comprehensive, and situations of updraft velocities that are less than zero have not been considered. In future work, more comprehensive energy-harvesting strategies and methods for maximizing harvesting energy from various types of updrafts should be studied. Furthermore, the training of the complex reinforcement learning algorithm requires a certain amount of computing resources, which limits the possibility of glider training during real-world flight. This means that we can only use fixed flight strategies that are trained in simulations to achieve real-world autonomous soaring, thereby weakening the generalizability to real updrafts.