Next Article in Journal
Silicone-Based Membranes as Potential Materials for Dielectric Electroactive Polymer Actuators
Next Article in Special Issue
Research on Renewable-Energy Accommodation-Capability Evaluation Based on Time-Series Production Simulations
Previous Article in Journal
Forward Modeling of Natural Fractures within Carbonate Rock Formations with Continuum Damage Mechanics and Its Application in Fuman Oilfield
Previous Article in Special Issue
Using Bayesian Deep Learning for Electric Vehicle Charging Station Load Forecasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Deep Reinforcement Learning Considering Discrete-Continuous Action Spaces for Real-Time Energy Management in More Electric Aircraft

1
College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
2
College of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China
3
Green Rooftop Inc., Hangzhou 310032, China
*
Author to whom correspondence should be addressed.
Energies 2022, 15(17), 6323; https://doi.org/10.3390/en15176323
Submission received: 28 July 2022 / Revised: 23 August 2022 / Accepted: 27 August 2022 / Published: 30 August 2022

Abstract

:
The increasing number and functional complexity of power electronics in more electric aircraft (MEA) power systems have led to a high degree of complexity in modelling and computation, making real-time energy management a formidable challenge, and the discrete-continuous action space of the MEA system under consideration also poses a challenge to existing DRL algorithms. Therefore, this paper proposes an optimisation strategy for real-time energy management based on hybrid deep reinforcement learning (HDRL). An energy management model of the MEA power system is constructed for the analysis of generators, buses, loads and energy storage system (ESS) characteristics, and the problem is described as a multi-objective optimisation problem with integer and continuous variables. The problem is solved by combining a duelling double deep Q network (D3QN) algorithm with a deep deterministic policy gradient (DDPG) algorithm, where the D3QN algorithm deals with the discrete action space and the DDPG algorithm with the continuous action space. These two algorithms are alternately trained and interact with each other to maximize the long-term payoff of MEA. Finally, the simulation results show that the effectiveness of the method is verified under different generator operating conditions. For different time lengths T, the method always obtains smaller objective function values compared to previous DRL algorithms, is several orders of magnitude faster than commercial solvers, and is always less than 0.2 s, despite a slight shortfall in solution accuracy. In addition, the method has been validated on a hardware-in-the-loop simulation platform.

1. Introduction

Facing continuous depletion of energy, improving energy efficiency, and reducing energy consumption have become hot discussion topics in the world today. In recent years, similar to efforts to move towards electric vehicles [1], much research has focused on the idea of more electric aircraft (MEA) [2]. The mechanical, hydraulic and pneumatic power systems on traditional aircraft are gradually replaced by electric power systems (EPS), which will be the inevitable trend of future aircraft development [3]. Compared with conventional aircraft, the more electric aircraft not only improves efficiency and reliability, but also reduces the weight, energy consumption and operating costs of the aircraft [4].
Due to the increased size of its power management system, MEA is facing significant challenges in energy management. Currently, extensive research has been carried out by many researchers on the energy management control of MEA. In [5], a decentralised energy management strategy based on improved droop control is proposed for a fuel cell/supercapacitor-based MEA auxiliary power unit. However, it is difficult to achieve accurate power sharing between supercapacitor units of arbitrary capacity due to the effect of the parallel virtual resistors used for state of charge (SoC) recovery. Therefore, [6] proposes a decentralised energy management strategy using a virtual inductor droop strategy and a virtual resistance droop strategy for fuel cell and supercapacitor units respectively, which not only minimises the effect of line impedance on dynamic power distribution but also reduces the order of the system. A novel EPS of MEA containing multiple batteries, PV generators and fuel cells is proposed in [7]. An autonomous energy management strategy with a composite sag scheme is used to achieve dynamic coordination of the three, but the approach does not consider the impact of different scenarios on the energy management of the MEA. To ensure that the MEA operates in different scenarios, [8] uses the optimal droop gain method as a power-sharing method to minimise the losses of EPS of the MEA and a finite state machine to implement the reconfiguration operation.
Although the energy management strategy of MEA is well established in terms of control, less work has been done on its optimization. An adaptive online power management algorithm for hybrid energy storage systems based on a combination of genetic algorithm and Lyapunov optimization method is proposed in [9], aiming at minimizing the power fluctuations of the generator. In [10], the problem is formulated as a multi-objective optimization problem in order to improve the system-level efficiency of the generator and converter. A multi-objective optimization problem similar to [10] was proposed in [11], but with the difference that the lifetime of the energy storage system (ESS) was considered. In the complex MEA power management system, ref. [12] proposed a new distributed framework to reduce the complexity of problem-solving. However, the aforementioned works [9,10,11,12] address only the optimization strategy on the generation side of the MEA, and do not manage the load side in order to prevent overloads. In [13], a mixed integer linear programming solution is used with the objective of load curtailment and the number of generators used, which effectively reduces the number of load curtailment and generator use, while solving the power distribution, contactor switching and battery charging and discharging problems. This decision problem cannot meet the practical requirements of MEA operation due to the possibility of abnormal generator operation. In [14], the energy management problem of MEA is described as a mixed integer quadratic programming problem (MIQP) with the objective of minimizing load shedding and optimal generator power output, and the generator output power schedule, load connection, and battery charging and discharging schedule as decision variables, and is solved using the commercial solver CPLEX. The power allocation and load management optimization strategy of MEA is implemented considering generator and bus damage. In summary, although all of the above methods can achieve higher performance, as the complexity of the system increases, it will lead to increased computational difficulty in the decision-making and control process, creating more uncertainty for real-time energy management. Therefore, it is still a very challenging topic to achieve real-time energy management, considering that most of the current studies on energy management for MEA have paid less attention to the real-time nature of their methods.
With the development of artificial intelligence techniques, reinforcement learning (RL) is gaining attention from researchers and is beginning to be applied in many areas of EPS. Reinforcement learning was introduced in [15] to build the corresponding model for smart island management, and decent results were achieved in terms of computational speed. However, as traditional reinforcement learning methods store Q-values in tabular form, they are prone to dimensional disasters in the face of complex environments and are difficult to apply and promote in practice. As a result, researchers have fused deep learning (DL) with perceptual capabilities and RL with decision-making capabilities to form the deep reinforcement learning (DRL) algorithm [16], which provides solutions for practical energy management scenarios. In [17], a deep Q-learning (DQL) algorithm is developed for real-time energy management in microgrids considering the uncertainty of load demand, renewable energy sources and electricity prices. A new optimization method based on DQL for updating neural networks is designed in [18] and its rapidity and optimality are verified in the energy management of hybrid electric vehicles. However, the DQL algorithm is only suitable for handling discrete actions and needs to be discretized for continuous actions. When the number of controllable devices increases or the granularity of discretization becomes small, the number of action spaces will grow exponentially, making the problem difficult to solve. For the continuous control strategy of a plug-in hybrid electric bus, a deep deterministic policy gradient (DDPG) algorithm was developed in [19], which achieves the optimal energy allocation of the bus on the continuous space. In [20], the energy cost minimization problem for smart homes was studied using the DDPG algorithm. In addition to the above-mentioned energy management, DDPG algorithm is also applied to the energy management of microgrid [21,22] and smart buildings [23,24]. However, current DRL-based methods mainly consider either discrete or continuous action space [25]. For the discrete-continuous hybrid action space, a DDPG-based energy management strategy for residential multi-energy systems is proposed in [26]. When dealing with discrete actions, the continuous outputs are discretized and mapped to discrete actions. However, treating discrete actions as continuous actions may significantly increase the complexity of the action space. Therefore, it is particularly important to design a DRL algorithm that can handle the discrete-continuous hybrid action space.
In this paper, we propose an energy management strategy for MEA based on hybrid deep reinforcement learning (HDRL). The objective function and constraints of the energy management are described based on modelling the characteristics of the generators, buses, loads, and ESS, and the problem is transformed into a Markov decision process (MDP) solution. In this process, the integer and continuous variables of the objective function are considered as discrete action space and continuous action space, respectively. For the hybrid action space, we use the duelling double deep Q network (D3QN) algorithm [27] and the DDPG algorithm to solve the problem and train these two algorithms alternately. The main contributions of this paper are summarized as follows:
(1)
A new MEA energy management model has been developed taking into account the operating state of the generators, the connection priority relationship between the generators and the buses and the shedding priority relationship of the loads.
(2)
For the model of MEA, a hybrid deep reinforcement learning (HDRL) algorithm incorporating D3QN and DDPG is proposed to solve the energy management problem of MEA by taking the generator-bus connection relationship and the load shedding relationship as the discrete action space, and the generator output power and ESS charging and discharging power as the continuous space. The HDRL algorithm proposed in this paper inherits the advantages of DQL in dealing with discrete action space and exploits the advantages of DDPG in dealing with continuous action space. However, in the training process of the HDRL algorithm, we first train on discrete actions and then on continuous actions, alternating between the two until full convergence.
(3)
Simulation studies based on data from the [12] show that the study and numerical analysis of the generator under different operating conditions prove the effectiveness of the proposed HDRL method. The real-time nature of the method is verified by applying it to different time lengths T.
The rest of the paper is organized as follows. Section 2 presents the mathematical formulation of the MEA energy management problem. Section 3 constructs the MDP formulation and proposes an energy management strategy for HDRL. Section 4 describes the case study and simulation results. Section 5 presents the conclusions and future research directions.

2. Mea System Description

2.1. Eps Model

Conventional constant-frequency AC EPS requiring a large number of AC and DC rectifiers to meet power quality requirements results in a significant increase in the weight of the MEA. In [28], a high voltage DC (HVDC) EPS is used, which greatly reduces the number of power conversion units because AC/DC rectifiers are no longer needed, thus achieving a weight reduction effect. In this paper, the EPS structure of MEA consists of an engine, a main generator, an APU, a 270 V HVDC bus, an ESS, various different types of loads and power converters, as shown in Figure 1 [11]. The power generated by the main generator and APU in AC form is fed into the HVDC bus via AC/DC rectifiers and supplied directly to the HVDC load or to the low voltage DC (LVDC) and AC loads via power converters, while access to ESS such as batteries requires bidirectional DC/DC converters to meet peak loads during operation and prevent unessential load shedding.

2.2. Mathematical Formulations

2.2.1. Load and BUS Priority Constraints

Two priority constraints are involved in the considered power system, one is the load connection priority constraint, where the connection priority of critical loads is always higher than that of non-critical loads. During the aircraft flight, critical loads should not be shed in any emergency, while the shed of non-critical loads is related to its priority rules. The second one is the connection priority constraint between each generator and the bus, whose predefined priority relationship is shown in Table 1.

2.2.2. Power Balance Constraint

According to EPS of MEA, the generator provides power for the load as the only power source, and the bus plays the role of collecting, distributing and transmitting electrical energy. Considering the power loss, the power balance constraint can be expressed as
P G k t = i D P B k i t η B k i
P B i t = k G P B k i t
P B i t = P L i ( t ) + P E S S i ( t )
P L i ( t ) = j S C i y C i j ( t ) P C i j ( t ) + j S N C i y N C i j ( t ) P N C i j ( t ) y C i j ( t ) { 1 } , y N C i j ( t ) { 0 , 1 }
where P G k t is the output power of generator k at time t; P B i t is the power of bus i at time t; P B k i t is the transmission power between generator k and bus i at time t; η B k i is their transmission efficiency; P E S S i ( t ) is the charging and discharging power of the energy storage system connected to bus i at time t; P L i ( t ) is the total load demand on bus i at time t; P C i j ( t ) , P N C i j ( t ) are the critical and non-critical loads; y C i j ( t ) , y N C i j ( t ) are their connections; G , D are the sets of generator and bus, respectively; S C i , S N C i are the sets of critical and non-critical loads, respectively.

2.2.3. BUS Connection and Generators State Constraints

Since the frequency and phase difference of the power sources are different, only one generator can be connected to each bus at time t. This can be expressed as:
k G ξ k i t = 1
where ξ k i ( t ) { 0 , 1 } is the state of connection between generator k and bus i at time t.
The current state of the generator directly determines the state of the connection between the generator and the bus, which can be described as
U G k ( t ) = min { 1 , i D ξ k i ( t ) }
where U G k ( t ) is the operating state of the generator k at time t, and the operating state of the main generator indirectly affects the connection of non-critical loads.

2.2.4. Generator and BUS Power Capacity Constraints

The output power of each generator should be limited to its power capacity, and the transmission power of each bus should also be no greater than its upper capacity limit.
U G k ( t ) P G k ̲ P G k ( t ) U G k ( t ) P G k ¯ , k G
0 P B i t P B i ¯ , i D
0 P B k i ( t ) ξ k i ( t ) P B k i ¯
where P G k ¯ , P G k ̲ are the upper and lower limits of the output power of the generator k, P B i ¯ is the upper limit of the power capacity of the bus i, and P B k i ¯ is the upper limit of the transmission power.

2.2.5. Generator Optimum Operating Range Constraint

In order to save engine fuel during each phase of flight, the generator needs to be kept in its optimal range, which can be expressed as
P G k , o p t t ̲ P G k , o p t t P G k , o p t t ¯
where P G k , o p t t is the optimal operating point of generator k at time t; P G k , o p t t ¯ , P G k , o p t t ̲ is the upper and lower limit of the optimal generator power range determined at time t based on the engine state.

2.2.6. ESS Constraints

In this paper, the ESS model considered consists of batteries without internal dynamics. It is a voltage source that depends on the SoC of the battery itself. In energy management of MEA, The ESS can be modelled as a generator that provides positive and negative power to the EPS. Assume that the power P E S S i t < 0 of the ESS connected to bus i is in the charging state at time t, and P E S S i t 0 is in the discharging state, but the charging and discharging processes cannot be carried out simultaneously. The amount of change of SoC ( Δ S o C ) effectively reflects the charging and discharging process of the ESS, and their relationship is as follows.
Δ S o C E S S i t = ε c P E S S i t τ E E S S i , P E S S i t < 0
Δ S o C E S S i t = P E S S i t τ ε d i s c E E S S i , P E S S i t 0
where E E S S i is the total capacity of the ESS; τ denotes the time interval; ε c , ε d i s c are the efficiency factors of the charging and discharging process. The dynamic process of S o C can be described as follows.
S o C E S S i t + 1 = S o C E S S i t + Δ S o C E S S i t
To avoid overcharging and overdischarging the energy storage system, P E S S i ( t ) , S o C E S S i t needs to be limited to a predetermined upper and lower limit, i.e.,
P E S S i ̲ P E S S i ( t ) P E S S i ¯
S o C E S S i ̲ S o C E S S i t S o C E S S i ¯

2.2.7. MEA Energy Management Optimization Model

In this paper, the main objectives of energy management are to stabilize the generators in the optimal operating range as much as possible; to enforce predefined connection priority rules between each generator and the bus; to minimize non-critical load shedding in case of overload, and to optimize the lifetime of the batteries in the ESS. Thus, the energy management problem can be described as a multi-objective optimization problem.
(1)
The cost function that limits the generator to operate in the optimal range is [29]
F 1 t = k G P G k t U G k ( t ) P G k , o p t t 2
(2)
The cost function for implementing the predefined connection rules between the generator and the bus is
F 2 t = k G i D ω k i ξ k i t
where ω k i is the penalty factor and the connection priority decreases with the increase of the penalty factor.
(3)
The cost function for reducing non-critical load shedding is
F 3 ( t ) = i D j S N C i ω i j ( 1 y N C i j )
where ω i j is the penalty factor and the priority of load shedding decreases with the increase of the penalty factor.
(4)
Considering the cost function of the lifetime of the battery in ESS is
F 4 ( t ) = i D P E S S i ( t ) P E S S i , c a p
where P E S S i , c a p is the maximum rechargeable and rechargeable power of ESS.
Based on the above cost function and constraints, the energy management problem of MEA can be formulated as
min P G k ( t ) , P E S S i ( t ) ξ k q ( t ) , y N C i j ( t ) J = t = 1 T α 1 F 1 ( t ) + α 2 F 2 ( t ) + α 3 F 3 ( t ) + α 4 F 4 s . t . α 1 + α 2 + α 3 + α 4 = 1 C o n s t r a i n t s 1 15
where T is the total time of MEA energy management; α 1 , α 2 , α 3 , α 4 are the weight share coefficients.
This research solution can obtain the optimal output power of the generator during the flight of MEA and effectively extend the life of the battery, which is economically beneficial to the power system of MEA. It is not difficult to find that the constructed energy management problem is a MIQP problem, which can be solved by many existing mathematical programming solvers such as CPLEX or Gurobi. Although the optimal solution can be found using this method under certain conditions, the computation time increases exponentially as the number of integer variables included in the NP-hard problem increases. Therefore, we propose a DRL-based optimization method for real-time energy management without losing optimality.

3. Hdrl-Based Solution

3.1. Mdp Formulation

In DRL, the agent is constantly interacting with and learning from the external environment. In short, at each time t the agent chooses the appropriate action a t based on the state s t of the external environment, which acts on the external environment, whereupon the environment returns a reward r t and transitions to a new state s t + 1 . By repeating the above process, the agent gains the experience of “trial and error” in interacting with the environment, and finds the optimal strategy for solving the problem in the experience. In essence, this process is called MDP, as shown in Figure 2.
In this paper, we need to construct the energy management problem of MEA as an MDP model with state space, action space and reward function.
(1) State space: For the MEA model, the state information provided to the agent by the external environment at t time generally contains three categories. (1) The operating state U G k ( t ) of the generators; (2) The load demand P L i t of the MEA system, which needs to be normalized; (3) The charge state S o C E S S i t of ESS. Therefore, the state space of MEA is defined as
s t = { U G k ( t ) , P L i ( t ) , S o C E S S i ( t ) } , k G , i D
(2) Action space: From the energy management objective of the MEA, the control actions generated by the agent at t moments are (1) the output power P G k t of the generator; (2) the charging and discharging power P E S S i t of the ESS; (3) the connection strategy ξ k i t of the generator to the bus; (4) the shedding strategy P N C i j ( t ) of non-critical load. Therefore, the action space of the MEA is as follows.
a t = { a t c = { P G k ( t ) , P E S S i ( t ) } a t d = { ξ k i ( t ) , y N C i j ( t ) } } , k G , i D , j S N C i
where a t c , a t d are the continuous action space and discrete action space, respectively.
(3) Reward function: For the discrete-continuous action space, we designed two different reward mechanisms. Combining Equation (20), given the current state s t and action a t , the reward function from the current state s t to the next state s t + 1 is defined as
r t d ( s t , a t d ) = ( α 2 F 2 ( t ) + α 3 F 3 ( t ) + ρ d ( t ) )
r t c ( s t , a t c ) = ( α 1 F 1 ( t ) + α 4 F 4 ( t ) + ρ c ( t ) )
Since the objective of RL is to maximize the reward function, and the energy management of MEA in this paper is a minimization problem, they need to be inverted. Equation (23) contains the connection function of the generator to the bus, the shedding function of the uncritical load and the penalty term ρ d ( t ) . For Equation (24), it includes the quadratic function of the output power of the generator and its optimal operating point, the lifetime function of the battery and the penalty term ρ c ( t ) . When constraints (1)–(15) are satisfied ρ c ( t ) , ρ d ( t ) = 0 , otherwise a sufficiently large constant will be assigned to ρ c ( t ) , ρ d ( t ) . If the value is small, the agent may tolerate behaviour that does not satisfy the constraint.

3.2. HDRL

In the above MDP formulation, the energy management problem of MEA can be viewed as a control problem with discrete-continuous mixed action. However, existing DRL algorithms deal with such problems by either discretizing the continuous space or converting the discrete space into a continuous space, and these methods can significantly increase the complexity of the action space. Therefore, an HDRL algorithm combining D3QN and DDPG is proposed in this paper to solve the MEA energy management problem of hybrid control.

3.2.1. D3QN

General Q-learning is a widely used RL algorithm for solving small state-action space problems. In this paper, the agent determines the action a t d by a policy based on the current state s t . The expected gain of the Q-value function can be expressed as
Q ( s t , a t d ) = Q ( s t , a t d ) + γ [ r ( s t , a t d ) + μ max a t + 1 d Q ( s t + 1 , a t + 1 d ) Q ( s t , a t d ) ]
where γ is the learning rate and μ is the discount factor. Since the algorithm stores state-action values in the form of tables, it is only applicable to the low-dimensional state-action space.
For dealing with the high-dimensional state-action space, a neural network is used to approximate the DQN algorithm of the Q-value function. In DQN, the Q-value function is defined by Q ( s t , a t d ; θ ) to approximate Q ( s t , a t d ) , where θ is the training weight of the network. DQN has two networks with the same structure, one is the target network, which is used to calculate the target Q value, and the other is the estimation network, which is used to estimate the Q value of the current state. The goal of DQN is to train suitable weights to minimize the loss of the target Q value with respect to the current state Q value.
L ( θ ) = E [ ( y t d Q ( s t , a t d ; θ ) ) 2 ]
where y t d is the Q-value of the target network, which can be expressed as
y t d = r ( s t , a t d ) + μ max a t + 1 d Q ( s t + 1 , a t + 1 d ; θ )
where θ is the weight of the target network, which is the parameter replicated from the estimated network at regular intervals.
The DQN algorithm selects and evaluates the Q-value function in terms of the maximum action value, which may lead to the overestimation problem. To solve this problem, the Double DQN algorithm was proposed in [30] to decouple the action selection and evaluation of the target network, i.e., instead of finding the maximum Q value directly in the target network, the Double DQN selects the action with the maximum Q value from the estimated network.
y t d = r ( s t , a t d ) + μ Q ( s t + 1 , arg max a t + 1 d Q ( s t + 1 , a t + 1 d ; θ ) ; θ )
In order to guarantee the dominance of action selection in a particular state, the Dueling DQN algorithm was proposed in [31]. The structure of the Dueling network is shown in Figure 3. The main difference with the structure of the DQN network is that the states do not output Q values directly after the hidden layer, but output the state value function V ( s t ) and the action dominance function A ( s t , a t d ) , which is then coupled to each action Q value function and can be expressed as
Q ( s t , a t d ) = V ( s t ) + A ( s t , a t d )
where the state value function V ( s t ) is the degree of the state, and the action dominance function A ( s t , a t d ) is the degree of a given action relative to other actions in that state.
In the constructed network structure, it is difficult to determine the unique V ( s t ) , A ( s t , a t d ) by learning Q ( s t , a t d ) , which will lead to poor training of the network, so Equation (29) can be redescribed as
Q ( s t , a t d ) = V ( s t ) + ( A ( s t , a t d ) 1 A a t d A A ( s t , a t d ) )
where A denotes the number of all executable actions and A denotes the set containing all executable actions.
In this paper, the D3QN algorithm is constructed on Double DQN by using the framework of Dueling DQN in dealing with discrete action space. It combines the features of Double DQN and Dueling DQN to solve the Q-value overestimation problem while fitting to obtain more accurate Q-values.

3.2.2. DDPG

For continuous motion control, we use the DDPG algorithm based on the actor-critic architecture proposed by Deepmind. In the DDPG network structure, the actor estimation network β is used to approximate the policy function, the input is the state, the output is the deterministic action, and the neural network parameters are θ β . The critic estimation network ψ is used to evaluate the goodness of the current state action, its input is the state and action, the output is the Q value of the state action, and the neural network parameters are θ ψ . The critique estimation network is essentially a fit to the Q function, similar to Equation (25).
To improve the stability and convergence of training, the DDPG network structure also introduces the actor target network β and the critic target network ψ . They have the same structure as the corresponding estimated network structure, but different parameters, with parameters θ β and θ ψ , respectively. The way the parameters of each network in DDPG are updated is shown below.
The parameter θ β update of the β network is related to its own parameter gradient and the gradient of the generated action-value function of the ψ network. Its expression is given by
θ μ J E [ a c Q ( s , a c ; θ ψ ) s = s t , a c = β ( s t ) θ β β ( s ; θ β ) s = s t ]
Equation (31) shows that the β network has to modify the parameter θ β along the direction that will obtain a larger Q value.
The ψ network uses a minimization loss function to update the parameters θ ψ .
L ( θ ψ ) = E [ ( y t c Q ( s t , a t c ; θ ψ ) ) 2 ]
where y t c = r t c + μ ψ ( s t + 1 , β ( s t + 1 ; θ β ) ; θ ψ ) is the target Q value.
The parameters θ β , θ ψ of the β network and ψ network use the soft update method. The expressions are
θ β σ θ β + 1 σ θ β θ ψ σ θ ψ + 1 σ θ ψ
where σ 1 makes the realistic network parameters vary less, training more stable and easy to converge.

3.2.3. HDRL Algorithm Process

The HDL algorithm proposed in this paper uses D3QN to control discrete actions and DDPG to control continuous actions to achieve hybrid control between generators, buses, loads, and ESS. The specific HDRL algorithm process is shown in Figure 4. D3QN and DDPG interact with the environment independently, and they obtain the same state from the environment and update to the next state by their respective actions and rewards. In our design, these two algorithms are trained alternately and interact with each other. In simple terms, if one is training, the other one will be fixed and act as part of the environment.
Considering that the power distribution of the MEA is related to the connection state of the generator, bus and load, we first need to train the D3QN and then the DDPG, the complete algorithm procedure is shown in Figure 5. Here, an empirical replay pool is introduced to store the data and to make effective use of the historical data by taking small batches of samples for training. To deal with the balance between exploration and exploitation, a ε -greedy strategy is used, which operates in detail as agent randomly selects the action a t with probability ε and selects a t d = arg max a t d Q ( s t , a t d ; θ ) and a t c = β ( s t ; θ β ) + N t , where N t is the Gaussian noise with zero mean.

4. Case Study

The simulation is based on TensorFlow deep learning framework and Python 3.6, with Intel(R) Core(TM) i5-1035G1 CPU and 16 GB RAM.

4.1. System Setup

The MEA system tested consisted of two main generators and one APU, two 270V HVDC buses, and each bus contained three critical loads, three non-critical loads, and an ESS, as shown in Figure 1. The parameter settings of the system are summarized in detail in Table 2.

4.2. Model Training

In this paper, the load data are obtained from [11], and in the algorithm training process, for D3QN, the initial exploration constant is set to 1 and reduced to 0.05 with continuous iterations, the learning rate is 0.001, the discount factor is 0.9, the empirical replay pool capacity is 500, the random sampling batch size is 32, the target network parameters are replicated from the estimation network every 300 steps, the number of neurons in the network input layer is the number of state space dimensions. The number of neurons in the input layer of the network is the state space dimension, and the middle is a 3-layer hidden layer then connects the Dueling network structure to the output layer, the number of neurons in the hidden layer is 200, 100, 50, and the number of neurons in the output layer is discrete action space dimension. In the DDPG, both the actor-network and the critic network contain the same hidden layer with 256 and 128 hidden neurons, respectively. the activation function uses the ReLu function, while the output layer of the actor-network uses the tanh function. The learning rate is set to 0.001, the discount factor is 0.9, the soft update factor is 0.01, the empirical replay pool capacity is 10,000, and the random sampling batch size is 32.
As shown in Figure 6, it can be seen from the training process that the proposed algorithm converges faster compared to DDPG, and has more stability and convergence compared to DQN + DDPG. In the initial phase of training, the intelligent body explores the environment and the reward values obtained in this phase are relatively small and accompanied by fluctuations. When the training reaches a certain number of rounds, the algorithm starts to converge gradually and the rewards eventually stabilize, because the intelligent body can learn the optimal strategy and choose the appropriate action based on the experience accumulated in the early stages.

4.3. Simulation Results

A neural network model of the well-trained HDRL was saved and invoked to test the model under three different MEA cruising conditions. The results of this method test are then compared with DQN+DDPG, DDPG and Gurobi respectively.
Condition 1: The main generator 1 and main generator 2 operate normally when the APU is not involved in the EPS of the MEA, this is because the main generator is always prioritised over the APU in connection with the bus during normal operation during the cruise. The variation of the charge/discharge power and SoC of each battery reflects the fact that when the charge/discharge power is less than zero, the battery is charged, otherwise it is discharged. As can be seen from the battery life cost function, the life of the ESS is related to the charge/discharge power. According to Figure 7 it is easy to notice that the output power of both main generator 1 and main generator 2 in the method mentioned above fluctuates above and below the optimal operating point and that the variation in charge and discharge power is not particularly significant within the limits of the charge/discharge power, thus effectively extending the life of the battery.
Condition 2: Main generator 1 is damaged (t = 8 s), and the main generator 2 is normal, as shown in Figure 8. When the main generator 1 fails, according to the principle of load shedding and priority of generators and buses, the main generator 2 should provide all the power and the load is shed in priority at this time. However, in consideration of minimizing load shedding, the APU starts to work to compensate for the impact on the power supply due to the damage to main generator 1.
Condition 3: Damage to main generator 1 (t = 8 s), damage to main generator 2 (t = 13 s). According to Figure 9, at t = 13 s, both main generators fail and only the APU is available in the MEA power system at this time. Due to the output power range limitation, the APU cannot provide power to all the loads. Considering the principle of load shedding priority, part of the load starts to shed, as shown in Figure 10.
In summary, the methods proposed in this paper, DQN + DDPG, DDPG and Gurobi are all capable of good self-adaptation in the face of different environments. To further demonstrate the difference between the proposed HDRL-based method and theirs, we validate it by varying the management length T = 20, 40, 60 and 80, as shown in Table 3. It is not difficult to find that our proposed method gives better target values compared to DQN+DDPG and DDPG, but slightly larger than the results obtained by the commercial solver Gurobi. The solution time required by Gurobi increases significantly as the management length T increases. At T = 80, Gurobi even takes 7169.95 s to solve this energy management problem, which hardly meets the requirements of MEA in practice. In contrast, our proposed method does not exceed 0.2 s for all four management lengths, which is consistent with the real-time performance of MEA.

4.4. On-Line Hardware-in-the-Loop Test

To further validate the effectiveness of the proposed method experimentally, hardware-in-the-loop (HIL) tests were carried out based on OPAL-RT. In terms of energy management, the OPAL-RT testbed was used to implement power HIL simulations by interfacing the HDRL algorithm part with the energy management part, as shown in Figure 11. In order to demonstrate that the online experiments are also able to effectively implement the power distribution of the MEA, the results of the offline tests and the online experiments were compared when different load requirements were provided, as shown in Figure 12.
In order to assess the adaptive capability of the proposed method under different operating conditions of the main generator, two cases were designed where G 1 was not available at t = 7 s and G 2 was not available at t = 13 s. The real-time scheduling results were plotted in the laboratory using an oscilloscope. The real-time power dispatch results obtained using this method are shown in Figure 13. We can observe that the output power of all three generators is reasonably dispatched and the capacity of the ESS is fully utilized.

5. Conclusions

In this work, we propose a real-time energy management strategy for MEA based on HDRL using the features that the D3QN algorithm can handle discrete action space and the DDPG algorithm can handle continuous action space, and construct its energy management problem as an MDP to solve it, and verify it through simulation. The simulation results show that the HDRL algorithm is able to obtain an effective energy management strategy for different generator conditions, which not only stabilizes the generator output power around the optimal operating point, but also optimizes the charging and discharging power of the ESS and extends its service life. Compared to previous DRL algorithms and the conventional commercial solver Gurobi, the algorithm achieves an order of magnitude improvement in decision time while maintaining a certain degree of solution accuracy, which meets the practical operational requirements of MEA. In future research, a distributed deep reinforcement learning strategy will be considered to solve the real-time energy management problem in MEA complex power systems from the perspective of a distributed framework.

Author Contributions

Conceptualization, B.L., B.X. and F.G.; methodology, B.L.; software, B.L. and T.H.; validation, B.L. and T.H.; formal analysis, B.L., B.X., T.H., W.Y. and F.G.; investigation, B.L. and B.X.; data curation, W.Y. and F.G.; writing—original draft preparation, B.L.; writing—review and editing, B.L., B.X. and F.G.; visualization, B.L.; supervision, B.X., W.Y. and F.G.; project administration, B.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Mohamed, M.A.; Abdullah, H.M.; El-Meligy, M.A.; Sharaf, M.; Soliman, A.T.; Hajjiah, A. A novel fuzzy cloud stochastic framework for energy management of renewable microgrids based on maximum deployment of electric vehicles. Int. J. Electr. Power Energy Syst. 2021, 129, 106845. [Google Scholar] [CrossRef]
  2. Sarlioglu, B.; Morris, C.T. More electric aircraft: Review, challenges, and opportunities for commercial transport aircraft. IEEE Trans. Transp. Electrif. 2015, 1, 54–64. [Google Scholar] [CrossRef]
  3. Wheeler, P.; Bozhko, S. The more electric aircraft: Technology and challenges. IEEE Electrif. Mag. 2014, 2, 6–12. [Google Scholar] [CrossRef]
  4. Rosero, J.A.; Ortega, J.A.; Aldabas, E.; Romeral, L.A.R.L. Moving towards a more electric aircraft. IEEE Aerosp. Electron. Syst. Mag. 2007, 22, 3–9. [Google Scholar] [CrossRef]
  5. Chen, J.; Song, Q. A decentralized energy management strategy for a fuel cell/supercapacitor-based auxiliary power unit of a more electric aircraft. IEEE Trans. Ind. Electron. 2019, 66, 5736–5747. [Google Scholar] [CrossRef]
  6. Chen, J.; Song, Q.; Yin, S.; Chen, J. On the decentralized energy management strategy for the all-electric APU of future more electric aircraft composed of multiple fuel cells and supercapacitors. IEEE Trans. Ind. Electron. 2019, 67, 6183–6194. [Google Scholar] [CrossRef]
  7. Li, X.; Wu, X. Autonomous energy management strategy for a hybrid power system of more-electric aircraft based on composite droop schemes. IEEE Trans. Ind. Electron. 2021, 129, 106828. [Google Scholar] [CrossRef]
  8. Mohamed, M.A.; Yeoh, S.S.; Atkin, J.; Hussaini, H.; Bozhko, S. Efficiency focused energy management strategy based on optimal droop gain design for more electric aircraft. IEEE Trans. Transport. Electrific. 2022. Available online: https://ieeexplore.ieee.org/abstract/document/9734043 (accessed on 14 March 2022).
  9. Wang, Y.; Xu, F.; Mao, S.; Yang, S.; Shen, Y. Adaptive online power management for more electric aircraft with hybrid energy storage systems. IEEE Trans. Transp. Electrif. 2020, 6, 1780–1790. [Google Scholar] [CrossRef]
  10. Zhang, Y.; Peng, G.O.H.; Banda, J.K.; Dasgupta, S.; Husband, M.; Su, R.; Wen, C. An energy efficient power management solution for a fault-tolerant more electric engine/aircraft. IEEE Trans. Ind. Electron. 2018, 66, 5663–5675. [Google Scholar] [CrossRef]
  11. Zhang, Y.; Yu, Y.; Su, R.; Chen, J. Power scheduling in more electric aircraft based on an optimal adaptive control strategy. IEEE Trans. Ind. Electron. 2019, 67, 10911–10921. [Google Scholar] [CrossRef]
  12. Zhang, Y.; Chen, J.; Yu, Y. Distributed power management with adaptive scheduling horizons for more electric aircraft. Int. J. Electr. Power Energy Syst. 2021, 126, 106581. [Google Scholar] [CrossRef]
  13. Maasoumy, M.; Nuzzo, P.; Iandola, F.; Kamgarpour, M.; Sangiovanni-Vincentelli, A.; Tomlin, C. Optimal load management system for aircraft electric power distribution. In Proceedings of the 52nd IEEE Conference on Decision and Control, Florence, Italy, 10–13 December 2013. [Google Scholar]
  14. Barzegar, A.; Su, R.; Wen, C.; Rajabpour, L.; Zhang, Y.; Gupta, A.; Lee, M.Y. Intelligent power allocation and load management of more electric aircraft. In Proceedings of the 11th IEEE International Conference on Power Electronics and Drive Systems, Sydney, NSW, Australia, 9–12 June 2015. [Google Scholar]
  15. Zou, H.; Tao, J.; Elsayed, S.K.; Elattar, E.E.; Almalaq, A.; Mohamed, M.A. Stochastic multi-carrier energy management in the smart islands using reinforcement learning and unscented transform. Int. J. Electr. Power Energy Syst. 2021, 130, 106988. [Google Scholar] [CrossRef]
  16. François-Lavet, V.; Henderson, P.; Islam, R.; Bellemare, M.G.; Pineau, J. An introduction to deep reinforcement learning. Found. Trends® Mach. Learn. 2018, 11, 219–354. [Google Scholar] [CrossRef]
  17. Ji, Y.; Wang, J.; Xu, J.; Fang, X.; Zhang, H. Real-time energy management of a microgrid using deep reinforcement learning. Energies 2019, 12, 2291. [Google Scholar] [CrossRef]
  18. Du, G.; Zou, Y.; Zhang, X.; Liu, T.; Wu, J.; He, D. Deep reinforcement learning based energy management for a hybrid electric vehicle. Energy 2020, 201, 117591. [Google Scholar] [CrossRef]
  19. Wu, Y.; Tan, H.; Peng, J.; Zhang, H.; He, H. Deep reinforcement learning of energy management with continuous control strategy and traffic information for a series-parallel plug-in hybrid electric bus. Appl. Energy 2019, 247, 454–466. [Google Scholar] [CrossRef]
  20. Yu, L.; Xie, W.; Xie, D.; Zou, Y.; Zhang, D.; Sun, Z.; Jiang, T. Deep reinforcement learning for smart home energy management. IEEE Internet Things J. 2019, 7, 2751–2762. [Google Scholar] [CrossRef]
  21. Fan, L.; Zhang, J.; He, Y.; Liu, Y.; Hu, T.; Zhang, H. Optimal scheduling of microgrid based on deep deterministic policy gradient and transfer learning. Energies 2021, 14, 584. [Google Scholar] [CrossRef]
  22. Zhu, Z.; Weng, Z.; Zheng, H. Optimal Operation of a Microgrid with Hydrogen Storage Based on Deep Reinforcement Learning. Electronics 2022, 11, 196. [Google Scholar] [CrossRef]
  23. Yu, L.; Qin, S.; Zhang, M.; Shen, C.; Jiang, T.; Guan, X. A review of deep reinforcement learning for smart building energy management. IEEE Internet Things J. 2021, 8, 12046–12063. [Google Scholar] [CrossRef]
  24. Gao, G.; Li, J.; Wen, Y. DeepComfort: Energy-efficient thermal comfort control in buildings via reinforcement learning. IEEE Internet Things J. 2020, 7, 8472–8484. [Google Scholar] [CrossRef]
  25. Huang, C.; Zhang, H.; Wang, L.; Luo, X.; Song, Y. Mixed Deep Reinforcement Learning Considering Discrete-continuous Hybrid Action Space for Smart Home Energy Management. J. Mod. Power Syst. Clean Energy 2022, 10, 743–754. [Google Scholar] [CrossRef]
  26. Ye, Y.; Qiu, D.; Wu, X.; Strbac, G.; Ward, J. Model-free real-time autonomous control for a residential multi-energy system using deep reinforcement learning. IEEE Trans. Smart Grid 2020, 11, 3068–3082. [Google Scholar] [CrossRef]
  27. Huang, Y.; Wei, G.; Wang, Y. Vd d3qn: The variant of double deep q-learning network with dueling architecture. In Proceedings of the 2018 37th Chinese Control Conference (CCC), Wuhan, China, 25–27 July 2018; pp. 9130–9135. [Google Scholar]
  28. Chen, J.; Wang, C.; Chen, J. Investigation on the selection of electric power system architecture for future more electric aircraft. IEEE Trans. Transp. Electrif. 2018, 4, 563–576. [Google Scholar] [CrossRef]
  29. Xu, B.; Guo, F.; Xing, L.; Wang, Y.; Zhang, W.A. Accelerated and Adaptive Power Scheduling for More Electric Aircraft via Hybrid Learning. IEEE Trans. Ind. Electron. 2022. Available online: https://ieeexplore.ieee.org/abstract/document/9714232 (accessed on 15 February 2022).
  30. Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
  31. Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1995–2003. [Google Scholar]
Figure 1. Power system architecture of MEA.
Figure 1. Power system architecture of MEA.
Energies 15 06323 g001
Figure 2. Markov decision process.
Figure 2. Markov decision process.
Energies 15 06323 g002
Figure 3. Dueling network structure.
Figure 3. Dueling network structure.
Energies 15 06323 g003
Figure 4. HDRL algorithm design mechanism.
Figure 4. HDRL algorithm design mechanism.
Energies 15 06323 g004
Figure 5. HDRL algorithm procedure.
Figure 5. HDRL algorithm procedure.
Energies 15 06323 g005
Figure 6. HDRL training process.
Figure 6. HDRL training process.
Energies 15 06323 g006
Figure 7. Condition 1 (main generators are normal).
Figure 7. Condition 1 (main generators are normal).
Energies 15 06323 g007
Figure 8. Condition 2 (main generator 1 damaged).
Figure 8. Condition 2 (main generator 1 damaged).
Energies 15 06323 g008
Figure 9. Condition 3 (main generator are damaged)).
Figure 9. Condition 3 (main generator are damaged)).
Energies 15 06323 g009
Figure 10. Non-critical loads connection status.
Figure 10. Non-critical loads connection status.
Energies 15 06323 g010
Figure 11. Hardware-in-the-Loop test for the proposed approach.
Figure 11. Hardware-in-the-Loop test for the proposed approach.
Energies 15 06323 g011
Figure 12. Comparison of offline and online simulation.
Figure 12. Comparison of offline and online simulation.
Energies 15 06323 g012
Figure 13. The proposed method obtains online energy management results.
Figure 13. The proposed method obtains online energy management results.
Energies 15 06323 g013
Table 1. Priority of connection between generator and bus.
Table 1. Priority of connection between generator and bus.
BusPriority
1st2nd3rd
Bus1Main Gen1Main Gen2APU
Bus2Main Gen2Main Gen1APU
Table 2. System parameters.
Table 2. System parameters.
ParametersValueParametersValue
P G k ̲ , P G k ¯ 100, 600 η B k i 0.98
P G 1 , o p t , P G 2 , o p t , P G 3 , o p t 480, 480, 480 ε c , ε d i s c 0.95
P B i ¯ , P B k i ¯ 600 S o C E S S i ( 0 ) 0.5
P E S S i ̲ , P E S S i ¯ −100, 100 E E S S i 400
S o C E S S i ̲ , S o C E S S i ¯ 0.2, 1 P E S S i , c a p 10 9
τ , T 1, 20 α 1 , α 2 , α 3 , α 4 0.001, 0.2495, 0.2495, 0.5
Table 3. Comparison of the performance of the proposed method with DQN + DDPG, DDPG and Gurobi.
Table 3. Comparison of the performance of the proposed method with DQN + DDPG, DDPG and Gurobi.
Management length T
(time slots)
20406080
Number of integer variables42084012601680
Number of continuous variables40080012001600
Optimal objective value
(time slots)
DDPG11.9021.6931.2438.36
DQN + DDPG13.3622.4531.5640.87
Gurobi8.3216.6624.6431.34
Proposed8.6518.7126.4135.87
Solution time
(second)
DDPG1.73 × 10−11.81 × 10−11.83 × 10−11.90 × 10−1
DQN + DDPG1.75 × 10−11.78 × 10−11.85 × 10−11.95 × 10−1
Gurobi13.42109.422270.097169.95
Proposed1.67 × 10−11.72 × 10−11.86 × 10−11.88 × 10−1
TimesGurobi/Proposed80.36636.1612,204.7838,138.03
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Liu, B.; Xu, B.; He, T.; Yu, W.; Guo, F. Hybrid Deep Reinforcement Learning Considering Discrete-Continuous Action Spaces for Real-Time Energy Management in More Electric Aircraft. Energies 2022, 15, 6323. https://doi.org/10.3390/en15176323

AMA Style

Liu B, Xu B, He T, Yu W, Guo F. Hybrid Deep Reinforcement Learning Considering Discrete-Continuous Action Spaces for Real-Time Energy Management in More Electric Aircraft. Energies. 2022; 15(17):6323. https://doi.org/10.3390/en15176323

Chicago/Turabian Style

Liu, Bing, Bowen Xu, Tong He, Wei Yu, and Fanghong Guo. 2022. "Hybrid Deep Reinforcement Learning Considering Discrete-Continuous Action Spaces for Real-Time Energy Management in More Electric Aircraft" Energies 15, no. 17: 6323. https://doi.org/10.3390/en15176323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop