Next Article in Journal
Studies on the Relationship between Occupational Stress and Mental Health, Performance, and Job Satisfaction of Chinese Civil Aviation Pilots
Next Article in Special Issue
Integrating GRU with a Kalman Filter to Enhance Visual Inertial Odometry Performance in Complex Environments
Previous Article in Journal
A Methodology for Assessing Capacity of the Terminal Maneuvering Area Based on Service Resource Equilibrium
Previous Article in Special Issue
Bridging the Gap between Simulation and Real Autonomous UAV Flights in Industrial Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Energy-Harvesting Strategy Investigation for Glider Autonomous Soaring Using Reinforcement Learning

1
Huanjiang Laboratory, Shaoxing 311800, China
2
School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China
*
Author to whom correspondence should be addressed.
Aerospace 2023, 10(10), 895; https://doi.org/10.3390/aerospace10100895
Submission received: 15 September 2023 / Revised: 12 October 2023 / Accepted: 17 October 2023 / Published: 19 October 2023
(This article belongs to the Special Issue UAV Path Planning and Navigation)

Abstract

:
Birds and experienced glider pilots frequently use atmospheric updrafts for long-distance flight and energy conservation, with harvested energy from updrafts serving as the foundation. Inspired by their common characteristics in autonomous soaring, a reinforcement learning algorithm, the Twin Delayed Deep Deterministic policy gradient, is used to investigate the optimal strategy for an unpowered glider to harvest energy from thermal updrafts. A round updraft model is utilized to characterize updrafts with varied strengths. A high-fidelity six-degree-of-glider model is used in the dynamic modeling of a glider. The results for various flight initial positions and updraft strengths demonstrate the effectiveness of the strategy learned via reinforcement learning. To enhance the updraft perception ability and expand the applicability of the trained glider agent, an extra wind velocity differential correction module is introduced to the algorithm, and a strategy symmetry method is applied. Comparison experiments regarding round updraft, the Gedeon thermal model, and Dryden continuous turbulence indicate the crucial role of the further optimized methods in improving the updraft-sensing ability of the reinforcement learning glider agent. With optimized methods, a glider trained in a simplified thermal updraft with a simple training method can have more effective flight strategies.

1. Introduction

The phenomenon of exploiting atmospheric updrafts for energy harvesting and long-distance flight is widely present in migrating birds and experienced glider pilots [1,2]. This method of using atmospheric updrafts is referred to as soaring, which has received a lot of research and made significant progress. With increasing attention paid to soaring, many researchers have attempted to implement soaring using unmanned aircraft. The behavior of unmanned aircraft exploiting wind updrafts to increase endurance is called autonomous soaring [3].
Traditional research on the autonomous soaring of UAVs has relied heavily on locating the center of updrafts or using simplified wind field environments and UAV models. Allen [4] proposed a flight method for tracking the updraft center. For known updraft centers, he applied a circular strategy to obtain energy from the center by measuring its movement. The simulation results show that UAVs can greatly improve their endurance by utilizing convective lift in the atmosphere. Depenbusch et al. [5,6] proposed a complete set of algorithm systems that can be used for the autonomous soaring of aircraft. They conducted research on wind estimation, updraft identification, the measurement and dynamics of heat flow, and exploration methods. The components of autonomous algorithms include the mapping of thermals, exploration/utilization decisions for wind fields, navigation, the calculation of optimal airspeed, and energy state estimation. The feasibility of this algorithm is proven by actual flight experiments. Edwards et al. [7] developed a new way to locate and keep the glider in the rising thermal and applied this method to a glider weighing 5 kg, participating in a UAV flying competition. In competition with manually piloted gliders, the intelligent unmanned aerial vehicle performed well, providing a valuable comparison between the effectiveness of artificial and autonomous flight.
In recent years, various artificial intelligence algorithms have been applied to the research on autonomous soaring. Reddy et al. [1,8] combined digital simulation technology and reinforcement learning methods to study the problem of soaring in a turbulent wind environment. They used a numerical model of turbulent convection to simulate the atmospheric boundary layer and used a classic reinforcement learning algorithm called state–action–reward–state–action to train an unpowered glider with a wingspan of 2 m. They obtained perceptual cues that can effectively control flight in turbulent environments and realized autonomous navigation. In further research, field flight experiments were successfully implemented, and an autonomous flight strategy directly applicable to turbulence environments was proposed. Thomas et al. [9] used a new intelligent algorithm, the artificial lumbered flight algorithm, to conduct autonomous flight research for a powered UAV with a wingspan of 2 m. By setting the reward function and using intelligent algorithms to train UAVs in simulated wind fields, the researchers made an autonomous intelligence trade-off between navigating to a target point and using updraft to gain energy. Using fully trained intelligent algorithms as flight strategies, the researchers successfully implemented flight experiments, demonstrating the feasibility of the autonomous hovering of UAVs. Notter et al. [10] proposed a model-free method for flight in a vertical plane thermal. Based on the Markov decision process (MDP), a strategy gradient rise method was used to optimize random control policies. The simulation results show that the control of the trained agent mimics the optimal behavior and can reasonably take actions in a random environment. In their further research [11], a new hierarchical reinforcement learning method was proposed to provide control strategies for GPS triangular competition tasks for remotely controlled gliders. After training, the agent makes decisions and executes the sub-strategies of tracking the track and utilizing updraft. The preliminary flight test results verify the feasibility of the hierarchical reinforcement learning method. Chakrabarty et al. [12] proposed a graph-based method for planning energy flight trajectories based on a set of points to achieve the long-distance flight of small UAVs. Experimental tests using high-fidelity numerical simulations of real wind fields showed that the energy map method achieved good results in solving the problem of long-distance flight in unmanned aerial vehicles. Guilliard et al. [13] dealt with the problem of the autonomous balanced utilization of updrafts and environmental mapping as a partially observable MDP, which was based on predicted trajectories when given actual system states rather than generating Markov tuples offline. The researchers further applied this algorithm to the popular open-source autopilot ArduPlane and compared it with existing alternative algorithms in a series of real-time flight experiments. The results proved that the controller based on partially observable MDP had significant advantages.
Autonomous soaring logically divides into a structured approach and “a lumbered approach” [9]. The structured approach means that the aircraft identifies updrafts by comparing updraft velocity measurements with knowledge of classified updrafts. The lumbered approach means that aircraft can only sense the nearby updrafts and use local real-time decision making to navigate the wind field, without the need to classify the wind. Both approaches involve two steps: exploring and exploiting thermal updrafts. Many researchers have conducted a lot of studies exploring the thermal center [2,4,12], but few works have focused on efficiently exploiting thermals when gliders encounter one. Some experienced pilots used MacCready speed ring (determining the optimal speed to maximize the average cross-country speed) to help their thermal soaring [14]. Walton et al. [15] studied optimal glider trajectories for gaining altitude from thermal updraft and suggested that off-centered and figure-eight trajectories may have the potential to increase the options for energy harvesting. Edwards et al. [16] investigated the energy efficiency between the radius of a circular orbit and thermal updraft size, as well as solar input.
As presented before, lots of works have made use of reinforcement learning algorithms to conduct autonomous soaring and achieved great results [1,3,8,10,11]. However, most of these works use the original reinforcement learning algorithms, which are not optimized for autonomous soaring, which makes them either use a complex thermal model [1] or widely varying initial conditions [3,11] or assume that the location of the thermal center is known a priori by the vehicle [3,17].
In this paper, novel strategy optimization methods for exploiting thermal task in autonomous soaring are proposed. A simple round thermal model [18], which assumes thermals to be approximately round, is used as the thermal environment. A continuous reinforcement learning algorithm, the Twin Delayed Deep Deterministic policy gradient algorithm (TD3) [19], is used to train an unpowered glider.
First, a brief training method is utilized to train a TD3 glider agent and obtain the basic energy harvesting strategies. Prompt tuning [20] is a competitive technique for adapting frozen pre-trained language models to downstream tasks. Inspired by prompt tuning, an extra correction module is proposed and introduced to the basic strategies to solve the weak generalization problems. These problems are caused by the partially observable problem that often occurs in reinforcement learning [21]. Furthermore, based on the unique characteristic of the energy-harvesting task in autonomous soaring, a flight strategy symmetry method is introduced to the basic strategies. Based on these two optimized methods, basic soaring strategies are obtained in the sample bell-shaped thermal updraft with a brief training method, which can have more effective soaring strategies. Simulation tests using a bell-shaped thermal model, the Dryden continuous turbulence model [22], and the Gedeon thermal model [23] are conducted to verify the effectiveness of the proposed training and optimization methods for energy-harvesting tasks in autonomous soaring.
The paper is organized as follows. In Section 2, the simulation glider dynamics model and round updraft model are established. In Section 3, the TD3 algorithm and the basic elements of reinforcement learning are introduced, and the training results of the basic soaring strategies in the round thermal are displayed. Also, a brief training method is discussed. Section 4 introduces further optimized methods and compares the results from round updrafts, the Dryden continuous turbulence model, and the Gedeon thermal model. Finally, in Section 5, we summarize the current work and demonstrate the results of our research.

2. Glider and Updraft Model

2.1. Glider Dynamics Model

A six-degree-of-freedom (6-DOF) glider dynamics model is used in this research, which is governed by the 6-DOF motion equations in the body axis system:
F x b = m u ˙ + q w e v F y b = m v ˙ + e u p w F z b = m w ˙ + p v q u ,
L ¯ = I x x p ˙ I z x e ˙ + p q I y y I z z q e M = I y y q ˙ I z x e 2 p 2 I z z I x x e p N = I z z e ˙ I z x p ˙ q e I x x I y y p q ,
In Equation (1),  F x b , F y b , F z b , u , v , w , p , q , e  denote the component of the resultant force, airspeed, and angular velocity vector on the  x b , y b , z b  axis in the body axis system;  m  denotes the mass of glider. In Equation (2),  L ¯ ,   M ,   N  denote the component of the resultant moment of force on the  x b , y b , z b  axis in the body axis system;  I x x , I y y , I z z , I z x  denote the product of inertia. Some of the mainly geometric and aerodynamic features used are provided in Table 1.
In this study, the glider is considered a rigid body model, and the impact of excessive maneuvering was ignored. Many simulation studies on autonomous soaring have set a large range of flight parameters for glider models [9,15,18]. Referring to these, we impose the following bounds on state values:
60 ° γ 60 ° ,   10 ° θ 10 ° , h 20 ,   5 V a 30 ,   α 14 °
where  γ  denotes roll angle,  θ  denotes pitch angle,  h  is flight altitude,  V a  is airspeed, and  α  is angle of attack. The definition of the coordinate system used in this research is illustrated in Figure 1, as well as the roll angle,  γ , and pitch angle,  θ , that are used as the glider actions.
The Proportion Integral Differential (PID) is a model-free controller based on the feedback of errors, which requires precisely designed parameters. The PID has been widely used in various industrial control scenarios. Here, the PID controller is used for pitch angle,  θ , and roll angle,  γ , control. The formula of PID is
u ~ t = K p e t + K i 0 t e t d t + K d d e t d t ,
where  u ~ ( t )  is the output signal of the PID controller;  K p , K i , K d  are the proportional coefficient, integral coefficient, and differential coefficient; and  e t  is the error between the target angle and the actual angle.

2.2. Updraft Model

The agent training approach in this work uses a simplified round thermal updraft model [17] in which the wind velocity is defined as
V w x , y = w 0 e ( x x c ) 2 + y y c 2 2500 m / s ,
where  V w  denotes the thermal updraft velocity;  w 0  is the maximum updraft velocity; and  x , y  represent the position of the glider in the ground axis system. The thermal center  ( x c , y c )  is set to  0,0 , which also has the highest vertical air velocity of  w 0 . Figure 2 illustrates an example of a round updraft when  w 0 = 8 . The glider dynamics model and updraft model were developed and simulated in MATLAB/Simulink.
In this research, to focus on the utilization of updraft, only the vertical airflow was considered, while the horizontal airflow was ignored. In the round and Gedeon thermal updraft, none of the parameters of the updraft model changed over time in a simulation episode. As for the Dryden continuous turbulence, air flow velocity changed over time.

3. Reinforcement Learning Algorithm and Training Results

Reinforcement learning considers the example of interaction between an agent and its environment with the aim of learning the behavior of maximum reward function. These problems are typically described in the framework of the Markov decision process (MDP). In reinforcement learning, the agent obtains a state,  s S , at each discrete time step,  t , and selects actions,  a A , with respect to its policy,  π . The agent obtains a reward,  r , and a new environment,  s , after each action. The learning objective is to maximize the sum of discounted future rewards,  R t = i = t t m a x μ i t r ( s i , a i ) , for each state,  s , where  μ  is a discount factor ( 0 μ 1 ) determining the priority of short-term rewards [24].
The majority of earlier studies used discrete control reinforcement learning algorithms [1,3,17], in which the agent’s action space comprises a number of preset actions. A discrete control reinforcement learning method can only increase or decrease a fixed value from the existing discrete bank angles or leave it unaltered, which limits the flight performance of the glider. This paper makes use of a continuous control reinforcement learning algorithm, TD3, which takes actions within a range rather than several fixed actions.

3.1. TD3 Algorithm

The Twin Delayed Deep Deterministic policy gradient algorithm (TD3) is an actor–critic algorithm that considers the interplay between function approximation errors in both policy and value updates [19]. The TD3 is based on the Deep Deterministic Policy Gradient algorithm (DDPG) [24], an advanced actor–critic method for continuous control, and it solves the problem wherein DDPG too easily overestimates current policies. There are six networks in the TD3: critic networks,  Q ϑ 1 , Q ϑ 2 ; target critic networks,  Q ϑ 1 , Q ϑ 2 ; the actor network,  π ; and the target actor network,  π .
At the beginning of training, the critic networks,  Q ϑ 1 , Q ϑ 2 , and actor network,  π , are initialized with random parameters,  ϑ 1 , ϑ 2 , , and target networks are initialized by  ϑ 1 ϑ 1 , ϑ 2 ϑ 2 , . Also, the TD3 initializes a replay buffer,  B , to store historical experiences.
At each time step,  t , actions,  a , are first selected and the agent observes a reward,  r , and a new state,  s ,; meanwhile, the transition tuple  ( s , a , r , s )  is stored in  B . A mini-batch of  N  transition tuples are sampled from  B  and obtain  a ~ = π s + ϵ y = r + μ m i n i = 1,2 Q ϑ i ( s , a ~ ) , where  ϵ  is a noise, and  μ  is a discounted factor determining the priority of short-term rewards. Then, the weights,  ϑ 1 , ϑ 2 , of the critic networks,  Q ϑ 1 , Q ϑ 2 , are updated by the minimized loss function as follows:
J ϑ i = 1 N j = 1 N y Q ϑ i s , a 2
After time step  d , the weight,  , of the actor network,  π , is updated by the deterministic policy gradient:
J = 1 N a Q ϑ 1 s , a | a = π s π s
and target networks are undated by
ϑ i τ r ϑ i + 1 τ r ϑ i τ r i + 1 τ r
where  τ r  is the update rate. When time step  t  reaches its maximum,  T , or other conditions are met, the training process stops.

3.1.1. Action Space

In this research, the pitch angle,  θ , and roll angle,  γ , of the glider are continuously controlled. The actions are determined by the pitch angle within the range of  θ [ 10 ° , 10 ° ]  and the roll angle within the range of  γ [ 60 ° , 60 ° ] .

3.1.2. Reward Function

The purpose of this study is to make an unpowered glider learn to harvest as much energy as possible in the updraft. For this purpose, the reward function,  R , is set as the total energy gain of the glider divided by the mass, including gravitational potential energy and kinetic energy, i.e., energy gain independent of mass, as shown in the following formula:
R t = g h t h t 1 + 1 2 V t 2 V t 1 2
where  g  is gravitational acceleration,  R t  is the reward at simulation time  t , and  h t  and  V t  are glider altitude and airspeed.

3.1.3. Break Conditions

Consistent with the reward function, the break conditions include a restriction on the glider altitude and airspeed, as well as a simulation time limit. The minimum altitude,  h m i n , is set as 20 m to prevent the glider from touching the ground, and the maximum altitude is not set. We encourage the glider to gain as much gravitational potential energy as possible. The minimum airspeed,  V m i n , is set as 5 m/s, and the maximum airspeed,  V m a x , is set as 30 m/s. In fact, the trained glider always maintains the flight airspeed at around 10 m/s. The maximum flight time of the simulation,  t m a x  = 100 s, is enough for the glider to fly to the bell-shaped thermal center. When any of the three conditions are met, the current episode ends, and a new episode begins.

3.1.4. Observations

For the glider agent, observations are the source of decision making. In this research, to provide enough information for the glider agent and to minimize electronic sensory devices required for control, the selection of observations follows two principles. (1) Observations should be sufficient to allow the glider to make sound energy-harvesting decisions. (2) Observations should not contain information that the glider cannot obtain in reality, such as updraft distribution. Reddy et al. used the pair of vertical wind acceleration and torque as a sensorimotor cue sensed by the glider, namely, the observation of the glider agent, to minimize the electronic sensory devices required for control [1].
In this research, “torque” is also used as an observation to help the glider perceive the possible location of a stronger updraft, which is defined as
τ = V w l V w r
where  V w l  and  V w r  are the updraft velocities at the left and right wings.
Instead of vertical wind acceleration, updraft velocity,  V w , is directly used as the second observation of the glider agent here.  V w  tells the glider how strong the current updraft is and helps it find the possible location of stronger updraft.  τ  and  V w  provide enough succinct information for the glider agent to control  θ  and  γ , as well as notify the glider how strong the present updraft is and assist the glider in locating a probable stronger updraft.
To manage  τ  at the same level as  V w , the wind speed differential,  τ , is multiplied by a scaling factor,  K , which is set to 100. Therefore, the observation vector received by the glider agent is  o t = [ K τ ,   V w ] . The change trend of  τ ,   V w  in a flight test after training is shown in Figure 3.

3.2. Training Method

Reinforcement learning is very sensitive to both initialization and the dynamics of the training process, which leads to failed training [25]. In order to enable reinforcement learning to be effectively used for autonomous soaring, previous researchers always used complex thermal models and widely varying initial conditions. In this research, a sample round thermal [18] is used to reduce the complexity of the thermal model, and brief initial conditions are set. The initial position  ( x 0 , y 0 )  is set to  ( 50 , 50 )  m, and the initial altitude,  h 0 , is set to 100 m. Initial airspeed is set to 8 m/s, and the maximum wind strength is  w 0  = 8. The initial  θ γ , and yaw angle,  φ , of the glider are all set to zero. All the initial conditions are fixed without any variance, which greatly reduces training difficulty. During the training process, the time step of the TD3 algorithm is set to 1 s, and the target and policy update frequency is set to 2 s. The learning rate is set to 0.01 at the beginning of the training and decreases to 0.001 during the training process. The discount factor,  μ , is set to 0.99, as each episode lasts for 100 s. We can train five agents with different random seeds, and the training stops when the average reward fluctuation of 40 episodes is within 100. The reinforcement learning training and simulation processes are conducted in MATLAB/Simulink. The CPU used for simulation is an Intel Xeon Platinum 8255C.

3.3. Training Results

Training results for the five agents with different random seeds at four different initial positions are shown in Figure 4. All five agents achieved good training results, indicating a high training success rate. The first agent had the best result and, therefore, will be used for subsequent research.
Figure 5 displays the training results of the flight paths, and Table 2 displays the energy gain independent of mass at different initial positions on a circle with a radius of  50 2  m. It can be observed that at different initial positions, the glider can successfully locate the center of the round updraft and hover around it. When  x 0  = −50 m, as shown in Figure 5a, the initial direction of the glider flies toward the larger updraft, implying that it only requires a modest yaw angle to fly to the updraft center. When  x 0  = 50 m, as shown in Figure 5b, the initial flying direction of the glider is toward the weaker updraft, necessitating a turn in order to fly toward the updraft center. An interesting phenomenon is that the trained glider performs better in a clockwise hover, which is caused by the unequal sampling of the updraft environment during the training phase.
Figure 6a depicts the trajectories, and Table 3 shows the energy gain independent of mass at different  x 0  values when  y 0  = −50 m. It can be observed that when  x 0  −90 m, the glider can successfully detect the center of the round updraft and hover around it. However, when  x 0  −100 m, the glider is unable to precisely recognize the updraft center and flies to the opposite direction. This indicates that the current glider agent is ineffective at recognizing a weak updraft. Also, at different  x 0  values, the glider tends to hover clockwise. The trajectories at different maximum thermal velocities,  w 0 , when the initial position is  ( x 0 , y 0 )  = (−50 m, −50 m) are shown in Figure 6b. and Table 4 shows the energy gain independent of mass. The glider can effectively locate the center of the round updraft and hover around it when  w 0  4. However, when  w 0  2, the glider is unable to correctly recognize the updraft center.
Figure 7 depicts the learned flight strategies. For the control strategies of roll angle,  γ , shown in Figure 7a, there is a distinct boundary between  τ > 0  and  τ < 0 . It can be observed that on the boundary,  γ 0 . For  τ > 0 γ  is always be positive, and  γ  has a relatively significant reaction to changes in observations only when  V w < 1 . Conversely, when  τ < 0 γ  is very sensitive to changes in observations, implying that the glider has more precise control over the roll angle,  γ . This explains why the glider always hovers clockwise. When the glider hovers clockwise, the updraft velocity at the left wing  V w l  is smaller than  V w r , resulting in  τ < 0 , allowing the glider to select the optimal  γ  to maximize energy gain. As illustrated in Figure 7b, the glider agent constantly tries to keep a higher pitch angle,  θ , because a higher  θ  corresponds to a higher lift drag ratio and a lower sink rate.

4. Further Optimized Algorithm

According to the previous presentation, the glider agent trained by a brief method has high perception in round updrafts and can effectively enhance its ability to harvest energy. However, it also shows that, when the updraft is weak, for example, when  w 0 2  or  x 0  −100 m, the glider cannot identify the updraft center. This is because the initial position during training is  ( 50 , 50 )  m, where the updraft is relatively strong, making the sampling data in weak updrafts difficult for the agent. One alternative is to move the initial position further away, but this approach may lead to degradation when the initial position is closer to the updraft center. As demonstrated in Figure 6a and Table 3, the energy gain independent of mass is less when  x 0  = −40 m than  x 0  = −50 m or even −70 m. Another solution is to use variable initial distance between the initial position and the updraft center. However, the maximum energy gain corresponding to various initial distances is inconsistent, making it challenging to gauge training quality. It is also difficult to determine the range of changes in distance.
Furthermore, as shown in Figure 7, the learned flight strategies exhibit a significant asymmetry about  τ  for both  γ  and  θ τ  is the observation that represents the strength difference between the left and right updrafts, which guides the glider to fly toward the left or right. For the same  V w  value,  γ  should be an odd function of  τ . During the training process, after the glider agent has explored a successful hovering flight strategy (clockwise or counterclockwise) for the first time, the glider will prefer to choose the same hovering strategies for all future training. As a consequence, a large proportion of similar hovering strategies will be gathered into the TD3 experience replay buffer. This will cause future training to concentrate on optimizing the same hovering strategy, resulting in asymmetry.
Prompt tuning is a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks [20]. Inspired by prompt-tuning mechanisms, as we have developed a basic soaring strategy, novel-strategy-optimized methods are proposed for “prompt tuning” for basic strategies. The optimized methods are aimed at the energy-harvesting task in autonomous soaring and are additional to the reinforcement learning algorithm and also applicable to algorithms other than the TD3. Comparisons of the simulated flight results of the TD3 and the optimized TD3 are conducted in round updrafts and the Gedeon thermal model [23]. Furthermore, flight results in the Dryden continuous turbulence model [22,26], a gust model used to simulate turbulence, can be compared.

4.1. Optimized Methods

As shown in Figure 7 γ  has a significant reaction to changes in  τ , while the reaction to changes in  V w  is rather minor. Furthermore,  τ  is an observation that perceives the potential location of stronger updrafts. As a result, before inputting observation  τ  into the glider agent, we can add an extra correction module,  N τ , to improve the agent’s capacity to recognize weak updraft.
Firstly, to correct  τ , a threshold value,  V w t , is determined to divide the strong and weak updrafts. In the correction module, two artificial neural networks,  N τ 1  and  N τ 2 , are employed to correct the strong and weak updrafts, respectively, and the corrected  N τ ( K τ )  is input into the glider agent. As mentioned earlier,  K  is set to 100.
Once the training of the TD3 glider agent is completed, none of the parameters of the trained glider agent will change. For the trained glider agent, different scaling factors,  K , will result in different energy gains independent of mass (total reward,  R ) at varied initial positions, as shown in Figure 8. Then, we can derive the  K m  that corresponds to the maximum  R  at each initial positions. Based on this, we can further obtain mapping between  τ  and  K m τ . As a result, the artificial neural networks  N τ 1  and  N τ 2  can be trained to correct  τ , making the input to the glide agent more reasonable.  N τ 1  and  N τ 2  can be considered an upgraded version of  K . In the round updraft,  V w t  is set to 0.1 m/s.
Additionally, in order to solve the problem of flight strategy asymmetry, the trained glider is operated with strategy symmetry so that it has better control over both clockwise and counterclockwise flight. In Figure 7a, when  τ < 0 γ  is very sensitive to changes in observations; thus, the flight strategies when  τ < 0  are used as the basic strategies. Specifically,  τ  is taken as an absolute value and multiplied by −1 before being input into the glider agent. For the  γ  value output by glider agent, no operation is performed when  τ < 0 , and  γ  is multiplied by −1 for  τ 0 . No operation is performed for either  τ < 0  or  τ > 0  because a higher  θ  always corresponds to a higher lift drag ratio and a lower sink rate. Figure 9 depicts a schematic of the optimized reinforcement learning system with the extra correction module and the strategy symmetry method.

4.2. Results for the Round Thermal

Figure 10 compares simulated flight paths for TD3 and optimized TD3, and Table 2 displays the energy gain independent of mass at different initial positions on a circle with a radius of  50 2  m. The performance of the optimized TD3 glider agent is superior to the original one. When  x 0  = −50 m, both of the initial flying directions of the glider are toward the larger updraft. However, with the optimized algorithm, the glider flies more directly to the updraft center than the original one. When  x 0  = 50 m, both of the initial flying directions are toward the weaker updraft; the optimized glider turns faster to the updraft center. Figure 11 depicts the trajectories of the two agents, and Table 3 illustrates the energy gain independent of mass at  x 0 =  −100 m, where the original glider could not exactly recognize the updraft center. At both the initial  x 0  values, the original glider is unable to precisely recognize the updraft center. The optimized glider, however, could readily locate the updraft center and fly directly to it when  x 0 =  −100 m (yellow line in Figure 11). Even with  x 0  values are at very distant initial positions of −500 m and −780 m (green and red lines in Figure 11), the optimized glider is able to effectively detect the updraft center, demonstrating the effectiveness of the optimized algorithm. According to Figure 6b and Table 4, when  w 0  2, the glider is unable to properly recognize the updraft center. With the help of the correction module, the trained glider agent is able to accurately recognize the updraft center. Comparisons of the results between the TD3 and the optimized TD3 when  w 0  = 2 m/s are shown in Figure 12 and Table 4. The optimized glider agent can also successfully exploit the updraft to harvest energy in a rather weak updraft.
Figure 13 depicts a flight strategy comparison between the TD3 and the optimized TD3 in a very narrow region of  τ  and  V w . According to Figure 13a,c, the boundary around  τ = 0  is thinner in the optimized TD3. In this boundary,  γ  always has a large value because the glider rarely encounters cases where  τ  is very small during training, preventing the strategies from being trained effectively. This result in incomplete, and unsatisfactory flight strategies result when  τ  is very small. Also, it explains the oscillation phenomenon of the flight paths during long-distance flights, as shown in Figure 11. When the glider position is far from the updraft center,  τ  is very small, leading to a large  γ  and a rapidly rolling glider, resulting in an opposite  τ  and an opposite  γ V w t  is set to 0.1 m/s; thus, there is a distinct boundary for the optimized TD3 at  V w =  0.1. As shown in Figure 13b,d, through the benefit of the application of strategy symmetry,  θ  always has a high value in the optimized TD3, obtaining a higher lift drag ratio and a lower sink rate. Also, there is a distinct boundary for the optimized TD3 at  V w =  0.1.

4.3. Results for the Gedeon Thermal

To test the generalization of the trained glider agent, we can extend the original round updraft into a more complex Gedeon thermal environment [23]. The Gedeon thermal model is described as
V w x , y = w 0 1 x x c 2 + y y c 2 2500 e x x c 2 + y y c 2 2500   m / s ,
where the thermal center  ( x c , y c )  is set to  0,0 .
Updraft velocity in the round updraft model is always positive, while the Gedeon model considers a downdraft outside the thermal center, which increases the difficulty for the glider agent in identifying the updraft center. We tested the initial positions of (−50 m, −50 m), where updraft velocity is exactly the lowest and (−60 m, −50 m). Figure 14 shows a comparison of simulation flight paths, and energy gains independent of mass are shown in Table 5. When  x 0 =  −50 m, the original TD3 agent cannot effectively harvest energy, while the optimized TD3 agent can still fly to the updraft center and hover to harvest energy. When  x 0 =  −60 m, neither the TD3 nor the optimized TD3 agents can achieve energy gain, implying that the current training method is not sufficient for the agent to successfully recognize the updraft center when  V w <  0.

4.4. Results for Dryden Continuous Turbulence

Comparisons in the preceding section highlight the effectiveness of the optimized methods in assisting the glider in accurately recognizing and harvesting energy in weaker round updrafts and the Gedeon thermal. In order to deeply verify the generalization of the correction module, we can test the same TD3 glider agent and the optimized TD3 agent that was obtained from the preceding section for Dryden continuous turbulence. The specification MIL-F-8785C is used [26]. The turbulence is generated for a moderate level of turbulence at low altitude (less than 304.8 m), with a mean wind speed of 15 m/s at 6.1 m altitude [22,27].
As both thermals in the round updraft model and the Gedeon model thermal center have a center, the most important task for the glider agent is to locate the thermal center. In Dryden continuous turbulence, there is no obvious thermal center, and wind velocity changes are time-varying. Therefore, in the turbulence environment, the difficulty lies in how to fly toward larger updraft velocity regions through observations. In addition, we can test an extra agent with fixed  θ  and  γ  values during flight simulation. The fixed  θ  and  γ  parameters allow the glider to harvest the most energy without extra control commands.
Figure 15 shows a comparison of the position parameters across the three different control strategies and energy gains independent of mass are shown in Table 6. The original TD3 glider agent cannot harvest energy from the turbulence environment, whereas the optimized TD3 glider agent succeeds and surpasses the other two in energy gain, implying that the optimized TD3 has a greater potential to explore unknown turbulence fields. The optimized TD3 agent outperforms the other two controllers in terms of altitude gain, which means that it has higher gravitational potential energy gain. Both the TD3 and the optimized TD3 agents fly upward when  V w > 0  and downward when  V w < 0 , as shown in Figure 15a,b. For the flight velocity, the original TD3 agent has a peak flight velocity of about 19.2 m/s, and the flight velocity remains at about 15.4 m/s at the end of the simulation. The optimized TD3 agent has a peak flight velocity of about 14.5 m/s, and the flight velocity remains at about 7.8 m/s at the end of the simulation,  t =  100 s. The original TD3 agent has a larger kinetic energy gain, but it losses too much altitude, resulting in low energy gain.

5. Conclusions

This paper studied the flight strategy of maximizing the harvesting energy of an unpowered glider in a round updraft using a continuous control reinforcement learning algorithm: the TD3. A PID controller was used to control pitch angle,  θ , and roll angle,  γ , in a 6-DOF glider dynamics model in this research. The testing results show that the trained glider is able to recognize the updraft center successfully and hover around it to harvest more energy. In order to improve the trained glider agent’s perception of weak updrafts without increasing training difficulty and to solve the asymmetric strategy problem, further optimized methods are proposed. An extra correction module is added to the trained glider agent, and a strategy symmetry method is applied. The comparison results confirm that the optimized TD3 agent can fly from a long distance to the round updraft center and could gain energy in a relatively weak updraft environment. Furthermore, flight simulations in a Gedeon thermal model and Dryden continuous turbulence demonstrate that the further optimized methods significantly improve the harvest energy ability of the glider agent.
Compared with the TD3 policy networks, the proposed correction networks have fewer parameters. Therefore, we may be able to add various correction modules for different autonomous soaring tasks or for different weather conditions rather than training multiple reinforcement learning policy networks with a large number of parameters.
The research in this paper demonstrates the potential of using reinforcement learning algorithms to maximize the energy gain of unpowered gliders in updrafts. However, the energy-harvesting strategies learned by the TD3 are not comprehensive, and situations of updraft velocities that are less than zero have not been considered. In future work, more comprehensive energy-harvesting strategies and methods for maximizing harvesting energy from various types of updrafts should be studied. Furthermore, the training of the complex reinforcement learning algorithm requires a certain amount of computing resources, which limits the possibility of glider training during real-world flight. This means that we can only use fixed flight strategies that are trained in simulations to achieve real-world autonomous soaring, thereby weakening the generalizability to real updrafts.

Author Contributions

Conceptualization, J.Z.; formal analysis, J.Z.; funding acquisition, J.L.; investigation, J.Z. and L.Z.; methodology, J.Z., J.L. and L.Z.; project administration, J.L.; resources, J.L.; software, L.Z.; supervision, J.L. and L.Z.; validation, J.Z.; writing—original draft, J.Z.; writing—review and editing, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by specialized research projects of Huanjiang Laboratory. The APC was funded by Zhejiang University.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

This research was supported by the specialized research projects of Huanjiang Laboratory.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

  A R aspect ratio
  b wind span, m
  C L 0 lift coefficient at zero AOA
  C L a lift coefficient due to AOA
  C D 0 zero-lift drag coefficient
  g gravitational acceleration, m/s2
  K scaling factor
  m glider mass, kg
  R reward function
  S wind area, m2
  V airspeed, m/s
  V w updraft velocity, m/s
  w 0 maximum updraft velocity, m/s
  x , y , h glider positions, m
  x c , y c updraft center, m
  α angle of attack (AOA), deg
  θ pitch angle, deg
  γ roll angle, deg
  τ wind speed differential, m/s

References

  1. Reddy, G.; Antonio, C.; Terrence, J.S.; Vergassola, M. Learning to soar in turbulent environments. Proc. Natl. Acad. Sci. USA 2016, 113, E4877–E4884. [Google Scholar] [CrossRef] [PubMed]
  2. Kahn, A.D. Atmospheric Thermal Location Estimation. J. Guid. Control Dyn. 2017, 40, 2363–2369. [Google Scholar] [CrossRef]
  3. Woodbury, T.; Dunn, C.; Valasek, J. Autonomous Soaring Using Reinforcement Learning for Trajectory Generation. In Proceedings of the 52nd Aerospace Sciences Meeting, AIAA SciTech Forum, National Harbor, MD, USA, 13–17 January 2014. AIAA Paper 2014-0990. [Google Scholar]
  4. Allen, M. Autonomous Soaring for Improved Endurance of a Small Uninhabited Air Vehicle. In Proceedings of the 43rd AIAA Aerospace Sciences Meeting and Exhibit, Reno, NV, USA, 10–13 January 2005. AIAA Paper 2005-1025. [Google Scholar] [CrossRef]
  5. Depenbusch, N.T.; Bird, J.J.; Langelaan, J.W. The AutoSOAR Autonomous Soaring Aircraft, Part 1: Autonomy Algorithms. J. Field Robot. 2018, 35, 868–889. [Google Scholar] [CrossRef]
  6. Depenbusch, N.T.; Bird, J.J.; Langelaan, J.W. The AutoSOAR Autonomous Soaring Aircraft Part 2: Hardware Implementation and Flight Results. J. Field Robot. 2018, 35, 435–458. [Google Scholar] [CrossRef]
  7. Edwards, D.J.; Silverberg, L.M. Autonomous Soaring: The Montague Cross-Country Challenge. J. Aircr. 2010, 47, 1763–1769. [Google Scholar] [CrossRef]
  8. Reddy, G.; Wong-Ng, J.; Celani, A.; Sejnowski, T.J.; Vergassola, M. Glider soaring via reinforcement learning in the field. Nature 2018, 562, 236–239. [Google Scholar] [CrossRef] [PubMed]
  9. Thomas, C.P.; Larry, M.S.; Ashok, G. Artificial Lumbered Flight for Autonomous Soaring. J. Guid. Control Dyn. 2020, 43, 553–566. [Google Scholar]
  10. Notter, S.; Zürn, M.; Groß, P.; Fichter, W. Reinforced Learning to Cross-Country Soar in the Vertical Plane of Motion. In Proceedings of the AIAA Scitech 2019 Forum, San Diego, CA, USA, 7–11 January 2019. AIAA 2019-1420. [Google Scholar]
  11. Notter, S.; Schimpf, F.; Müller, G.; Fichter, W. Hierarchical Reinforcement Learning Approach for Autonomous Cross-Country Soaring. J. Guid. Control Dyn. 2023, 46, 114–126. [Google Scholar] [CrossRef]
  12. Chakrabarty, A.; Langelaan, J.W. Energy-Based Long-Range PathPlanning for Soaring-Capable Unmanned Aerial Vehicles. J. Guid. Control Dyn. 2011, 34, 1002–1015. [Google Scholar] [CrossRef]
  13. Guilliard, I.; Rogahn, R.; Piavi, J.; Kolobov, A. Autonomous Thermalling as a Partially Observable Markov Decision Process. In Proceedings of the Robotics: Science and Systems XIV, Robotics: Science and Systems Foundation, Pittsburgh, PN, USA, 26–30 June 2018. [Google Scholar] [CrossRef]
  14. MacCready, P.B.J. Optimum airspeed selector. Soaring 1958, 10–11. Available online: https://soaringweb.org/Soaring_Index/1958/PDF/1958_Jan-Feb_10.html (accessed on 16 October 2023).
  15. Walton, C.; Kaminer, I.; Dobrokhodov, V.; Jones, K.D. Alternate Strategies for Optimal Unmanned Aerial Vehicle Thermaling. J. Aircr. 2018, 55, 2347–2356. [Google Scholar] [CrossRef]
  16. Edwards, D.J.; Kahn, A.D.; Kelly, M.; Heinzen, S.; Scheiman, D.A.; Jenkins, P.P.; Walters, R.; Hoheisel, R. Maximizing Net Power in Circular Turns for Solar and Autonomous Soaring Aircraft. J. Aircr. 2016, 53, 1237–1247. [Google Scholar] [CrossRef]
  17. Chung, J.J.; Lawrance, N.R.J.; Sukkarieh, S. Learning to Soar: Resource-Constrained Exploration in Reinforcement Learning. Int. J. Robot. Res. 2014, 34, 158–172. [Google Scholar] [CrossRef]
  18. Wharington, J.; Herszberg, I. Control of a high endurance unmanned air vehicle. In Proceedings of the 21st Congress of International Council of the Aeronautical Sciences, Melbourne, Australia, 13–18 September 1998. [Google Scholar]
  19. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference in Machine Learning, Macau, China, 26–28 February 2018; pp. 1–15. [Google Scholar]
  20. Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3045–3059. [Google Scholar]
  21. Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artif. Intell. 1994, 101, 99–134. [Google Scholar] [CrossRef]
  22. Gage, S. Creating a Unified Graphical Wind Turbulence Model from Multiple Specifications. In Proceedings of the AIAA Modeling and Simulation Technologies Conference and Exhibit, Austin, TX, USA, 11–14 August 2003. AIAA Paper 2003-5529. [Google Scholar]
  23. Gedeon, J. Dynamic Analysis of Dolphin-Style Thermal Cross-Country Flight. Tech. Soar. 1973, 3, 9–19. [Google Scholar]
  24. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR) 2016, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  25. Irpan, A. Deep Reinforcement Learning Doesn’t Work Yet. 2018. Available online: https://www.alexirpan.com/2018/02/14/rl-hard.html (accessed on 16 October 2023).
  26. MIL-F-8785C: Military Specification: Flying Qualities of Piloted Airplanes; Naval Publications and Form Center: Philadelphia, PA, USA, 1980.
  27. Lawrance, N.R.J.; Sukkarieh, S. Autonomous Exploration of a Wind Field with a Gliding Aircraft. J. Guid. Control Dyn. 2011, 34, 719–733. [Google Scholar] [CrossRef]
Figure 1. Definitions of the coordinate system: roll angle,  γ , and pitch angle,   θ .
Figure 1. Definitions of the coordinate system: roll angle,  γ , and pitch angle,   θ .
Aerospace 10 00895 g001
Figure 2. Velocity distribution of a round updraft model ( w 0 = 8 ).
Figure 2. Velocity distribution of a round updraft model ( w 0 = 8 ).
Aerospace 10 00895 g002
Figure 3. The change trend of  τ ,   V w .
Figure 3. The change trend of  τ ,   V w .
Aerospace 10 00895 g003
Figure 4. Training results for 5 agents with different random seeds at different initial positions ( x 0 , y 0 ).
Figure 4. Training results for 5 agents with different random seeds at different initial positions ( x 0 , y 0 ).
Aerospace 10 00895 g004
Figure 5. Flight paths of the glider at different initial positions ( x 0 , y 0 ).
Figure 5. Flight paths of the glider at different initial positions ( x 0 , y 0 ).
Aerospace 10 00895 g005aAerospace 10 00895 g005b
Figure 6. Flight paths of the glider at (a) different  x 0  values when  y 0  = −50 m and (b) different maximum thermal velocities,  w 0 , when ( x 0 , y 0 ) = (−50 m, −50 m).
Figure 6. Flight paths of the glider at (a) different  x 0  values when  y 0  = −50 m and (b) different maximum thermal velocities,  w 0 , when ( x 0 , y 0 ) = (−50 m, −50 m).
Aerospace 10 00895 g006
Figure 7. Learned flight strategies: the change trend of (a γ  and (b θ  corresponding to various observations, that is,  τ  and  V w .
Figure 7. Learned flight strategies: the change trend of (a γ  and (b θ  corresponding to various observations, that is,  τ  and  V w .
Aerospace 10 00895 g007
Figure 8. Comparisons of total reward,  R , between different scaling factors,  K , at different initial positions.
Figure 8. Comparisons of total reward,  R , between different scaling factors,  K , at different initial positions.
Aerospace 10 00895 g008
Figure 9. The optimized Reinforcement learning algorithm diagram.
Figure 9. The optimized Reinforcement learning algorithm diagram.
Aerospace 10 00895 g009
Figure 10. Comparisons of two-dimensional flight paths between TD3 and optimized TD3 at different initial positions.
Figure 10. Comparisons of two-dimensional flight paths between TD3 and optimized TD3 at different initial positions.
Aerospace 10 00895 g010
Figure 11. Comparison of flight paths between the TD3 and the optimized TD3 at  x 0 =  [−100 m, −500 m, −780 m].
Figure 11. Comparison of flight paths between the TD3 and the optimized TD3 at  x 0 =  [−100 m, −500 m, −780 m].
Aerospace 10 00895 g011
Figure 12. Comparison of flight paths between the TD3 and the optimized TD3 when  w 0  = 2 m/s.
Figure 12. Comparison of flight paths between the TD3 and the optimized TD3 when  w 0  = 2 m/s.
Aerospace 10 00895 g012
Figure 13. Comparison of flight strategies between the TD3 and the optimized TD3: (a γ  of TD3, (b θ  of TD3, (c γ  of optimized TD3, (d θ  of optimized TD3.
Figure 13. Comparison of flight strategies between the TD3 and the optimized TD3: (a γ  of TD3, (b θ  of TD3, (c γ  of optimized TD3, (d θ  of optimized TD3.
Aerospace 10 00895 g013
Figure 14. Comparison of flight paths between the TD3 and the optimized TD3 at  x 0 =  [−50 m, −60 m] in the Gedeon thermal model.
Figure 14. Comparison of flight paths between the TD3 and the optimized TD3 at  x 0 =  [−50 m, −60 m] in the Gedeon thermal model.
Aerospace 10 00895 g014
Figure 15. Comparison of position parameters between the TD3; the optimized TD3; and fixed  θ γ  in Dryden continuous turbulence.
Figure 15. Comparison of position parameters between the TD3; the optimized TD3; and fixed  θ γ  in Dryden continuous turbulence.
Aerospace 10 00895 g015
Table 1. Geometric and aerodynamic features of the glider model.
Table 1. Geometric and aerodynamic features of the glider model.
FeatureParameterValue
Mass   m 5.5 kg
Wingspan   b 4.95 m
Wind reference area   S 0.9225 m2
Aspect ratio   A R 26.56
Lift coefficient at zero AOA   C L 0 0.1284
Lift coefficient due to AOA   C L a 0.1132
Zero-lift drag coefficient   C D 0 0.023
Table 2. Energy gain independent of mass at different initial positions.
Table 2. Energy gain independent of mass at different initial positions.
Initial Position, m(−50, −50)(50, −50)(−50, 50)(50, 50)
TD36669653666356511
Optimized TD36692654366926543
Table 3. Energy gain independent of mass at different  x 0  values when  y 0  = −50 m.
Table 3. Energy gain independent of mass at different  x 0  values when  y 0  = −50 m.
x 0 , m−40−70−90−100−500−780
TD3661466336516−607−604−604
Optimized TD3672366106472644940412344
Table 4. Energy gain independent of mass at different maximum thermal velocities,  w 0 , when the initial position is  ( x 0 , y 0 )  = (−50 m, −50 m).
Table 4. Energy gain independent of mass at different maximum thermal velocities,  w 0 , when the initial position is  ( x 0 , y 0 )  = (−50 m, −50 m).
w 0 , m/s106421
TD3845948702910−469−580
Optimized TD38501474728891192343
Table 5. Energy gain independent of mass at  x 0 =  [−50 m, −60 m] in the Gedeon thermal model.
Table 5. Energy gain independent of mass at  x 0 =  [−50 m, −60 m] in the Gedeon thermal model.
x 0 , m−50−60
TD3−761−766
Optimized TD35548−733
Table 6. Energy gain independent of mass in Dryden continuous turbulence.
Table 6. Energy gain independent of mass in Dryden continuous turbulence.
AlgorithmEnergy Gain Independent of Mass
TD3−344
Optimized TD3435
Fixed  θ  and  γ  245
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, J.; Li, J.; Zeng, L. Energy-Harvesting Strategy Investigation for Glider Autonomous Soaring Using Reinforcement Learning. Aerospace 2023, 10, 895. https://doi.org/10.3390/aerospace10100895

AMA Style

Zhao J, Li J, Zeng L. Energy-Harvesting Strategy Investigation for Glider Autonomous Soaring Using Reinforcement Learning. Aerospace. 2023; 10(10):895. https://doi.org/10.3390/aerospace10100895

Chicago/Turabian Style

Zhao, Jiachi, Jun Li, and Lifang Zeng. 2023. "Energy-Harvesting Strategy Investigation for Glider Autonomous Soaring Using Reinforcement Learning" Aerospace 10, no. 10: 895. https://doi.org/10.3390/aerospace10100895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop