Power Battery Scheduling Optimization Based on Double DQN Algorithm with Constraints

Xiong, Haijun; Chen, Jingjing; Rong, Song; Zhang, Aiwen

doi:10.3390/app13137702

Open AccessArticle

Power Battery Scheduling Optimization Based on Double DQN Algorithm with Constraints

by

Haijun Xiong

^1,2,

Jingjing Chen

^1,*,

Song Rong

³ and

Aiwen Zhang

¹

Department of Computer Science, North China Electric Power University, Baoding 071051, China

²

Hebei Key Laboratory of Knowledge Computing for Energy & Power, Baoding 071066, China

³

Shenke Technology Group Co., Ltd., Shijiazhuang 052360, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(13), 7702; https://doi.org/10.3390/app13137702

Submission received: 21 May 2023 / Revised: 9 June 2023 / Accepted: 25 June 2023 / Published: 29 June 2023

(This article belongs to the Topic Battery Design and Management)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Power battery scheduling optimization can improve the service life of the battery, but the existing heuristic algorithm has poor adaptability, and the capacity fluctuates significantly in the cycle aging process, which makes it easy to fall into the local optimal. To overcome these problems, we take the battery cycle life maximization as the goal, propose a reinforcement learning scheduling optimization model with temperature and internal resistance difference constraints, so as to determine whether to charge or discharge during battery cycle aging. We do this using the deep−learning−based battery capacity estimation model as the learning environment for the agent, using the Double DQN algorithm to train the agent, and proposing the principal component analysis method to reduce the dimension of the state space. These experiments, using multiple publicly available battery aging data sets, show that the principal component analysis method and the constraint functions reduce the computational time to find the optimal solution, providing the possibility of obtaining larger reward values. Meanwhile, the trained model effectively extends the cycle life of the battery, and has good adaptivity. It can automatically adjust parameters with the battery aging process to develop optimal charging and discharging protocols for power batteries with different chemical compositions.

Keywords:

double DQN; scheduling optimization; constraints; cycle life; adaptability

1. Introduction

The power battery has played an essential role in many intelligent systems. Taking battery electric vehicles as an example, power batteries provide driving power for the whole vehicle, which is the power source of electric vehicles. The driving range, economy, and safety of electric vehicles are all determined by the performance of power batteries. The development of power batteries is also a key constraint for electric vehicles’ large−scale and rapid development.

As the core component of various power systems, power batteries must function stably to ensure the reliability and security of the entire system. However, with recycling, battery degradation and battery failure are very common. The ability of a battery to provide energy gradually decreases as its lifespan increases. Once the power battery degrades below the necessary operating threshold, it no longer performs its expected functions and is prone to failure; this will not only affect the regular operation, resulting in high maintenance costs, but also may pose significant security risks, resulting in disastrous consequences. Therefore, it is exceedingly essential to accurately and reliably evaluate the aging situation of batteries and create timely scheduling management strategies [1].

Many studies have been conducted to improve battery performance by proposing various equalization means to solve battery inconsistencies or to monitor the battery aging state. Battery equalization hopes to bring the energy of the battery to a balanced state and guarantee that the difference between the batteries in the operating condition is in a safe range, with a total of three methods of battery selection, passive equalization, and active balancing, which are achieved: by designing the circuit [2]; from the perspective of a single battery; the battery state of capacity (SOC) estimation [3]; the state of health (SOH) representation [4]; and the remaining useful life (RUL) [5] are monitored and predicted from various perspectives. However, the battery will only be replaced once it reaches the level of obsolescence. The aging process will not process the battery, so dynamic planning cannot be achieved.

Therefore, from the perspective of decision−making, in the context of battery inconsistency, the optimization of power battery management in different scenarios is automatically achieved through a method with simple models, rapid feedback, and uncomplicated dynamic adjustment using relevant data. It can effectively prolong battery life, reduce system complexity, and improve computing efficiency. At present, the literature related to power battery scheduling optimization can be divided into three categories: hardware circuit design [6], heuristic algorithm [7], and reinforcement learning algorithm [8].

The battery management system is vital to the battery power supply, and its technology is relatively complex. Some scholars design circuits and optimize communication protocols from the perspective of hardware. Xu et al. [9] proposed a distributed battery management system, created the central control module and sampling module, analyzed the health status, developed a test platform, and verified the compatibility and stability of the design. Hedra et al. [10] proposed a novel power management strategy to manipulate the power control unit (PCU) of the battery by fuzzy logic control (FLC) and validated and experimented with a simulation using MATLAB software. The proposed strategy can reduce energy loss and improve battery life. Jan et al. [11] designed and simulated a two−stage hybrid structure of a battery supercapacitor (SC). An easy−to−implement control algorithm based on the decision was adopted to share the power delivery capacity of the hybrid battery supercapacitor in proportion, effectively dealing with light load changes and realizing economic advantages in the industry. Although the previous work in the literature has proposed battery scheduling optimization schemes in terms of hardware, it is still a challenge to realize these schemes in real life, owing to the ever−popular new technology and the development of the general trend of data exchange.

Since non−heuristic algorithms blindly search for information and do not consider the characteristics of the problem, many studies are based on heuristic algorithms. Heuristic algorithms can provide feasible solutions for each instance of the combinatorial optimization problem to be solved at an acceptable expense and are more economical and efficient in algorithm design compared to hardware design [12]. Currently, heuristic algorithms are more widely applied in battery scheduling optimization. Navin et al. [13] proposed a day−ahead dispatch optimization model based on power cells, using an integrated grid containing wind farms, solar PV, and battery storage for problem simulation, and an improved artificial bee colony algorithm (ABC), to study the optimal dispatch of battery storage with operational constraints in high−voltage power systems, improving the performance of the battery. Quan et al. [14] proposed a set of optimal battery pack charging control strategies based on the leader−follower framework, obtained the optimal average charging state trajectory based on the nominal battery model through multi−objective optimization, established online model deviation compensation, and effectively improved the robustness. Liu et al. [15] proposed a synergistic optimization method for the battery’s entire life cycle and life maximization, established a scheduling model consisting of a host of nonconvex nonlinear programming problems, designed equivalent problem transformation and acceleration algorithms, promoted the lifetime profit. Traditional optimization methods and heuristic algorithms, such as linear programming and genetic algorithms, have high convergence speed stability and solving effect. However, they cannot effectively deal with high−dimensional space, and the algorithms are too much influenced by their parameters and easily fall into the local optimum.

A robust battery scheduling technique should be able to respond to different applications and different underlying architectures and determine the optimal management policy on time. A promising approach is dynamic battery scheduling optimization based on reinforcement learning (RL); by estimating the time−out feedback from the external environment, the agent learns to augment itself from environmental transactions. RL has shown its potential in scheduling optimization and has been successfully applied to power battery scheduling management [16,17].

Mbuwir et al. [18] formulated the problem of using the battery to provide flexibility as a sequential decision problem under uncertainty, while the article proposed a batch reinforcement learning algorithm based on Q−fitting iterations to help solve the battery supply flexibility problem for electricity demand. Sui et al. [19] took different battery types and application scenarios as environmental models. They proposed a battery scheduling framework based on deep reinforcement learning (DRL) to solve the problem of battery scheduling and prolong the service life of the battery as much as possible. Huang et al. [20] coordinated the security control algorithm using the serial strategy with the PPO algorithm in deep reinforcement learning. PPO agent safely implemented the capacity scheduling strategy of the photovoltaic cell system and increased the cumulative net revenue. Gao et al. [8] implemented the deep deterministic policy gradient algorithm (DDPG) to design a model to determine the optimal real−time charging and discharging power of the battery to achieve scheduling control, which brought lower operating costs while providing sufficient exchangeable battery.

Data and model−driven deep reinforcement learning (DRL) algorithms can correctly identify complex timeless environments and provide optimal decisions. Applying DRL to battery scheduling optimization strategies and decision−making, through well−trained DNNs to form a complete sensing and decision system, can effectively drive optimal scheduling toward reliability and accuracy.

In this study, we first construct the environment of reinforcement learning, namely the battery aging model. We apply a long short−term memory neural network (LSTM) to extract temporal features of continuous data and develop health indicators closely related to battery capacity. The Gaussian regression process (GPR) is proposed to mine the dependency between battery SOC and health indicators, and model verification is performed to ensure that the environment responds to the agent promptly. In the reinforcement learning algorithm section, we propose the double DQN method to find the optimal solution for battery charging and discharging scheduling, expecting the number of battery cycle life to reach the maximum. To alleviate the problem of the “dimensional curse”, principal component analysis is proposed to reduce the dimensional state space. It saves computation time by projecting the high−dimensional continuous state set into a lower−dimensional feature space. We add the temperature and internal resistance difference constraints to the reward function of reinforcement learning, which provides direction for the agent. The results indicate that the model extends the number of aging battery cycle lives, and that it can adjust the parameters automatically through the reward and punishment mechanism to adapt to different batteries.

The main contributions of the method are as follows:

(1) An optimization model of a battery aging strategy based on reinforcement learning is proposed, which can dynamically provide an optimization strategy for subsequent stages at any aging stage and enhance the model’s adaptability.

(2) A new method is proposed to optimize the power battery scheduling problem using principal component analysis and double DQN, enhancing the battery’s charging and discharging cycles in the training set.

(3) Adding temperature and internal resistance difference constraints in the reward function of the reinforcement learning model, strengthening the training efficiency.

This paper is structured as follows: Section 2 introduces a mathematical description of the optimization problem. In Section 3, we describe the battery aging environment model and conduct experiments to evaluate the accuracy of the model in describing battery aging behavior. The double DQN algorithm for solving the optimization problem is presented in Section 4. Section 5 explains the experimental results and discussion and, finally, Section 6 presents the conclusion.

2. Mathematical Description of Optimization Problem

The optimization problem of power battery scheduling needs to consider multiple aspects comprehensively, including current, voltage, battery internal resistance, battery capacity, battery temperature, etc. The objective to be optimized is battery cycle life, and other factors to be considered can serve as constraint conditions in the process of battery charging and discharging scheduling. The mathematical model we constructed is as follows:

m a x f (m, x) s . t . \{\begin{matrix} I_{m i n} \leq I_{i} \leq I_{m a x} \\ 0 \leq V_{i} \leq V_{m a x} \\ {I R}_{m i n} \leq {I R}_{i} \leq {I R}_{m a x} \\ {∆ I R}_{m i n} \leq {∆ I R}_{i} \leq {∆ I R}_{m a x} \\ {T e m}_{m i n} \leq {T e m}_{i} \leq {T e m}_{m a x} \end{matrix}

(1)

where,

m = [{p o i n t}_{d a t a}, {t i m e}_{t e s t}, {t i m e}_{d a t a}, {t i m e}_{s t e p}, {i n d e x}_{s t e p}, {i n d e x}_{c y c l e}, I, V, {e n e r g y}_{c h a r g e}, {e n e r g y}_{d i s c h a r g e}, \frac{d V}{d t}, I R, T e m]

(2)

x = g (m)

(3)

where

x

represents the battery capacity. This study solves the optimization problem on the basis of a public battery aging data set, so the optimization variables are the column attributes of the data file. The optimization goal is to maximize the number of charging and discharging cycles. In this paper, we use reinforcement learning to solve the optimization problem. After the double DQN algorithm training is finished, the objective function is also obtained. We input the variables

m

and

x

to obtain the maximum cycle life and a series of action decisions in this case. Therefore, the whole training process of the double DQN algorithm is the process of finding the objective function

f (m, x)

, which will be introduced in Section 4.1. Meanwhile, we must build an environment model when using reinforcement learning algorithms. Battery aging is the result of many factors. In this area of research, there is no universal battery aging model. In the column attributes of the data file, there is an unknown functional relationship between battery capacity and other columns. Therefore, we established the deep−learning−based battery capacity estimation model. The trained battery aging model is the function

g (m)

, which will be described in Section 3. In addition, the constraints provide the basis for the reward function and action space of reinforcement learning, and the optimization variables provide the direction for the state space. Section 4.3 describes this relationship.

The charging and discharging protocols of battery aging experiments limit the breadth of the dataset, further limiting the constraint values of the optimization model. The given constraints’ included current interval is

I_{m i n} = - 4.4 A, I_{m a x} = 8.8 A

, where: positive current means charging, and negative current means discharging; the maximum permissible voltage is

V_{m a x} = 3.6 V;

the internal resistance interval is

{I R}_{m i n} = 0 Ω, {I R}_{m a x} = 0.1 Ω;

the internal resistance difference interval is

{△ I R}_{m i n} = 0 Ω, {△ I R}_{m a x} = 1 \times 10^{- 4} Ω

; and the temperature interval is

{T e m}_{m i n} = 30 ° C, {T e m}_{m a x} = 40 ° C

.

3. Battery Model and Verification

Next, we review the data−driven battery model used in this study. The data−driven approach can model without understanding the complex electrochemical properties of the battery and by relying on only a large number of aging data. Hence, the data−driven approach is emerging as another practical approach in battery health state estimation. This section will introduce the data set and the battery aging model, and then verifies the model.

3.1. Dataset

The data set adopted in this study is the substantial and easy−to−use battery cycle aging data set, released in 2019 by the Toyota Research Institute in cooperation with MIT and Stanford University [21]. Cyclic aging experiments were conducted on 124 commercial lithium–ion batteries. Cyclic aging occurs in the process of battery charging or discharging. It directly results from various factors such as service mode, temperature condition, and battery current. The battery consists of a LiFePO4 (LFP) positive electrode and a negative graphite electrode, with a nominal capacity of 1.1 Ah and a nominal voltage of 3.3 V.

All cells in this dataset use a one−step or two−step fast−charging strategy, as shown in Figure 1. The format of the policy is “C1 (Q1) − C2 (80%) −1 C (3.6 V) − 3.6 V (1/50 C)”, this means that the SOC is first charged from 0% to Q1% of the nominal capacity of the battery at a C1 multiplier, then from Q1% to 80% of the SOC at a C2 multiplier. At a 1 C multiplier to a voltage of 3.6 V, then at a constant voltage of 3.6 V to a current of 1/50 C, the battery capacity reaches 100% of the SOC, called a whole charge process. After the whole charging process is completed, the constant current and constant voltage discharge process starts at an interval of 60 s. Discharging at a 4 C rate to a voltage of 2 V, and then discharging at 2 V constant voltage to a current of 1/50 C, is known as a whole discharge process. One full charge and one full discharge is called a small cycle. At the end of each small cycle, measure its SOH, SOH = current discharge capacity/nominal capacity, and end this cycle aging process when SOH = 80%.

The capacity degradation curves of the same batteries under different charging and discharging protocols are shown in Figure 2. Under different aging conditions, the same battery exhibits different capacity degradation patterns. This paper uses data ranging from new cells to 20% degradation.

3.2. Regression Model Based on Deep Learning

Since early battery degradation does not immediately lead to capacity degradation but is reflected in the discharge voltage curve, incremental capacity (

I C

) curve analysis is considered an effective method to study battery degradation mechanisms [22], as shown in Figure 3. The graph reveals that there is no complete way to describe the

I C

curve using simple linear functions. Therefore, it is advisable for us to select health indicators (HIs) with actual physical correlation to the aging state as information in deep learning, which can calculate capacity Q more accurately.

For data−driven capacity prediction, the extraction of HI is an important step in determining the estimation accuracy and reliability [23]. Select CH5, CH11, CH17, CH22, and CH32 files in the original data set. The

d Q / d V

corresponding to voltage is calculated according to Equation (4) to obtain the

I C

curve of the loop. Calculate the variance(var), skewness(ske), and kurtosis(kur) as battery health indicators to assess the health status of the battery [24,25]. Furthermore, in order to explore the relationship between the proposed HIs and the actual capacity, the Pearson correlation coefficient (PCC) and Spearman correlation coefficient (SCC) are used for correlation analysis [26].

I C = \frac{d Q}{d V} = \frac{△ Q}{△ V} = \frac{Q_{k + 1} - Q_{k}}{V_{k + 1} - V_{k}}

(4)

The linear correlations between the extracted health indicators var, ske, kur, and capacity are calculated using the equations and battery data files, as shown in Table 1. PCC analysis is a better method to reflect the linear correlation between HIs and capacity, while SCC analysis is better to assess the monotonicity between HIs and capacity. The results indicate that the variance, skewness, and kurtosis of the

I C

curve have high correlations with capacity (both PCC and SCC are more significant than 0.9), which can reflect the attenuation characteristics of battery capacity well.

The neural network layer of deep learning covers a wide range by transmitting the known information in the learning dataset, which can theoretically map to arbitrary functions and can solve very complex problems. This paper passes a pre−trained degradation model of HIs sequences to a test battery using LSTM. Figure 4a shows the structure of the LSTM network model used in this article, including the input layer, hidden layer, fully connected layer, and output layer. The apparent advantage of the LSTM chosen is that it can extract the backward and forward dependencies between samples and better handle time series problems, alleviating the “gradient disappearance” disadvantage caused by backpropagation during training [27,28].

Figure 4b shows the detailed structure of the LSTM used in this paper. As shown in the figure, the forget gate retains the information that the previous memory is beneficial for this output and discards the useless information; the update gate is responsible for generating new important information, which is related to the past memory and the current memory, whilst the output gate generates the final information to be output. The network parameters are continuously adjusted in order to make the output optimal. The output HI is expressed as follows:

y (t) = W_{y} y_{d} (t) + b_{y}

(5)

We define some parameters of LSTM as follows: dropout is 0.05, the learning rate of model training is 0.0008 and the epoch is 300, the number of cells in the layer is 15, each cell has 64 units, each unit has 4 fully connected layers, and the input sequence is 5.

3.3. Capacity Estimation Model Based on GPR

Gaussian process regression is a nonparametric probability estimation model that extends multivariate Gaussian distribution and has advantages in solving regression and classification problems due to Bayesian−based optimization [29,30]. GPR has many applications in battery SOH and charging state estimation, all of which have yielded satisfactory results [31,32]. Therefore, this paper adopts Gaussian process regression to build a capacity estimation model. In general, the relationship can be written as follows:

y = f (x) + ε, ε ~ N (0, σ_{n}^{2})

(6)

where

x

is input,

y

is output,

ε

is the noise variable with variance

σ_{n}^{2}

,

f (x)

is a linear function, and its probability distribution is:

f (x) ~ G P (μ (x), κ (x, x_{*}))

(7)

where

μ (x)

is the mean function

E [f (x)]

and

κ (x, x_{*})

is the kernel function. The expression is as follows:

κ (x, x_{*}) = σ_{f}^{2} \exp (- \frac{{||x - x_{*}||}_{2}^{2}}{2 l^{2}})

(8)

where

l

is the characteristic length scale. The kernel matrix of the training set is calculated according to the kernel function. For the test set

X_{*}

, the prediction vector

f_{*} (x)

is expected to be obtained. The model assumption for Gaussian process regression is:

[\begin{matrix} y \\ f_{*} (x) \end{matrix}] ~ G P (0, (\begin{matrix} κ (x, x) + σ_{n}^{2} I_{n}, κ (x, x_{*}) \\ κ (x_{*}, x), κ (x_{*}, x_{*}) \end{matrix}))

(9)

where

I_{n}

is an n−dimensional unit matrix. According to the Bayesian regression method, the posterior probability is found and the expression for the prediction point is determined. The right side of Equation (9) is simplified as

G P (0, [\begin{array}{l} κ_{y} & κ_{*} \\ κ_{*}^{T} & κ_{* *} \end{array}])

:

P (f_{*} (x)| x_{*}, x, y) = G P (μ_{*}, \sum *)

(10)

where

μ_{*} = κ_{*}^{T} κ_{y}^{- 1} y; \sum * = κ_{* *} - κ_{y}^{- 1} κ_{*}

The predicted mean

\bar{y^{*}}

and predicted covariance values

cov (y^{*})

are expressed as follows:

\bar{y^{*}} = {κ_{*}^{T} [κ_{y} + σ_{n}^{2} I_{n})]}^{- 1} y

(11)

cov (y^{*}) = κ_{* *} - κ_{*}^{T} {{[κ}_{y} + σ_{n}^{2} I_{n})]}^{- 1} κ_{*}

(12)

Finally, the final predicted value can be obtained by calculating

\bar{y^{*}}

and 95% confidence interval. Based on this, the GPR model of capacity estimation is established. The input of this model is the HI of each cycle in the battery aging data, and the output is the corresponding battery capacity.

In establishing the battery aging model, the HI is first sought and the correlation coefficient and experiment validated the selected HI. Then, construct the HI degradation prediction model based on LSTM, with the original dataset as input and the corresponding HIs as output. Build the battery capacity estimation model based on GPR, input the predicted HIs, and output the battery capacity. We use this battery aging model as a reinforcement learning environment. The process of constantly training the battery aging model is the process of constantly finding the Equation (3) function

g (m)

.

3.4. Model Verification

An appropriate experimental validation phase is commonly required to assess the model’s accuracy in describing the battery behavior. Experimental data evaluate the results of battery life prediction and degradation prediction. The prediction accuracy of the LSTM is first verified, followed by an analysis of the feasibility of the HI degradation model prediction and, finally, the capacity estimation model is evaluated. The root mean square error (RMSE) and the mean absolute error (MAE) [33] are used as measures of the evaluation accuracy of the proposed method.

Figure 5a shows the ability of the LSTM to predict health indicators. The LSTM can mine past input information and better learn the time−domain characteristics of the data. Experimental results show that MAE and RMSE are less than 10%, which is accurate for predicting health indicators. The statistical results for the battery input of three HIs and input of two HIs, and a single HI, are shown in Figure 5b. The goal of capacity estimation with a single HI is to evaluate the performance of that HI. The results indicate that MAE and RMSE gradually decrease with increased health indicators. The combination of multiple HI can compensate for the shortcomings of a single HI, resulting in a narrower error distribution. Overall, RMSE and MAE are less than 5%, indicating that they strongly correlate with capacity and can be used for capacity estimation. Combined with the HI degradation prediction model and capacity estimation model, the test set is used for verification. Figure 5c shows the statistical results of partial dispersion points utilizing this strategy, indicating that the capacity prediction strategy is effective. Through confirmation, we can learn that the output prediction capacity error is within the acceptable range and can guarantee a good accuracy to predict the capacity by the trained model.

4. Methodology

4.1. Deep Reinforcement Learning Based on Double DQN

Reinforcement learning is a branch within machine learning that emphasizes exploring action learning based on the environment [34]. As shown in Figure 6, the Markov decision process is a classical mathematical framework for describing reinforcement learning problems. Firstly, the agent perceives the state S_t at the current moment t, according to strategy π, chooses an action from the action space, and executes it. The environment will feedback the action selected by the agent and judge the correctness of the action choice of time t with the reward function. The agent enters a new state S_t+1 and continues to select actions. In order to weaken the vision limitations of agent, considering the impact of future rewards on existing choices, the reward at time

t

can be written as:

R_{t} = r_{t + 1} + {γ r}_{t + 2} + \dots = \sum_{k = 1}^{\infty} γ^{k} r_{t + k + 1}

(13)

The goal of reinforcement learning is that the agent finds the optimal strategy

π_{*}

that brings the maximum reward through continuous learning. The optimal strategy consists of the actions chosen at each step. We assign a value to each action, writing them as

Q (s, a)

, which represents the Q value obtained by the agent when selecting action a in state s. The larger the Q value, the greater the probability that the action will be selected by the agent, and the greater the probability that the agent will bring a greater reward in the future after selecting this action.

Q^{*} (s_{t}, a_{t}) = {a r g m a x}_{π} Q^{π} (s_{t}, a_{t})

(14)

The reward function in reinforcement learning is essentially a long−term feedback. The Q value is easier to calculate compared to the reward function in each step. Initially, the Q−learning algorithm used the exhaustive method to fit the Q value, but it could only represent limited state action space. There is a risk of falling into the “curse of dimensionality”. In 2015, the DeepMind team combined deep neural networks with reinforcement learning to form the DQN algorithm [35]. Reinforcement learning has entered a new world.

The overall idea of the double DQN algorithm adopted in this paper is that, starting from the initial state, the agent always selects the action with the highest Q value until the target state. The model outputs the reward value and, finally, the optimal strategy.

To calculate the Q value, the double DQN algorithm uses the neural network as function approximator to fit the Q value, allowing for continuous state spaces. The input is the current state of the agent, and the output is the Q value of each action. The parameters are trained until convergence.

Q^{^} (s, a | θ) \approx Q (s, a)

(15)

where

θ

is the neural network parameter. Neural network training should have sufficient samples, so there is an experience pool in the double DQN algorithm. During initialization, the agent randomly selects the action and stores results in the form of quads (s, a, r, s′) in the experience pool. When training the neural network, randomly sampling a batch of data from the empirical pool, which is called experience replay, alleviates the correlation between the data. In addition, a target network with stable training effect is added to prevent the original network overfitting. Therefore, there are two networks with identical structures and different parameters in the double DQN algorithm, which are called policy network and target network, respectively. A fixed goal can ensure the stability of the network obtained by learning.

Set the policy network and target network parameters to

θ

and

θ^{'}

, respectively. Randomly take a quadruple

(s_{t}, a_{t}, {r_{t}, s}_{t + 1})

from the experience pool; the policy network selects an action that maximizes the output of the neural network according to Equations (16) and (17).

y = Q (s_{t}, a_{t}; θ_{t})

(16)

a^{*} = {a r g m a x}_{a \in A} Q (s_{t + 1}, a; θ_{t})

(17)

Neural network parameters are iterated by the temporal difference method. Through forward propagation of the neural network, target network calculates the value of the agent’s selection of action

a^{*}

under the state

s_{t + 1}

, and obtains the target and error of the temporal difference iteration method.

y^{^} = R_{t} + γ Q (s_{t + 1}, a^{*}; θ_{t}^{'})

(18)

{t d}_{e r r o r} = y^{^} - y

(19)

The loss function is defined as Equation (20), and the neural network is backpropagated to obtain gradient

\nabla_{θ} Q (s_{t}, a_{t}; θ_{t})

.

L (θ) = \frac{1}{2} {[y^{^} - y]}^{2}

(20)

At this point, the parameters of policy network are updated by gradient descent, and the parameters of target network are updated by weighted average.

θ_{t + 1} \leftarrow θ_{t} - α \cdot {t d}_{e r r o r} \cdot \nabla_{θ} Q (s_{t}, a_{t}; θ_{t})

(21)

θ_{t + 1}^{'} \leftarrow τ \cdot θ_{t}^{'} + (1 - τ) \cdot θ_{t + 1}

(22)

where

τ \in (0,1)

is a hyperparameter. After the agent starts to learn, every time the state changes, the parameters of the neural network will be updated until the agent reaches the final state. Figure 7 shows the double DQN structure. The policy network parameter parameters are updated with the step size, and the target network parameters are updated every N step, copying the policy network parameters at that time and reducing the model complexity. Section 4.4 shows the detailed CNN structure used in this study. The structure consists of two convolutional layers, two pooling layers, two fully connected layers and, finally, a softmax function to output the result, following the model step update. The Double DQN algorithm alleviates the overestimation problem of DQN by changing the calculation method of the target value [36]. The underlying principle is that using another function to choose the action that cannot reach the level of the original function max, can effectively limit the growth of overestimation. Different networks perform different things and the effect may be improved.

4.2. Principal Component Analysis Algorithm

Although deep reinforcement learning has the apparent advantage of avoiding the Bellman dimension curse, it has also been found, in various studies, that appropriate feature selection procedures can distinctly improve performance. Hence, it is necessary to downscale the dimensionality of the high−dimensional continuous state space with principal component analysis (PCA), and transform it into a simplified observation space.

Suppose that our time series data have m rows of samples with a number of features n, denoted as the matrix

A_{m \times n}

. The original data are averaged and decentered to obtain the matrix

D a t a A d j u s t_{m \times n}

. Then, calculate the

D a t a A d j u s t_{m \times n}

covariance matrix among multiple variables.

C = (\begin{matrix} c o v (x_{1}, x_{1}) c o v (x_{1}, x_{2}) c o v (x_{1}, x_{3}) \dots c o v (x_{1}, x_{15}) \\ c o v (x_{2}, x_{1}) c o v (x_{2}, x_{2}) c o v (x_{2}, x_{3}) \dots c o v (x_{2}, x_{15}) \\ c o v (x_{3}, x_{1}) c o v (x_{3}, x_{2}) c o v (x_{3}, x_{3}) \dots c o v (x_{3}, x_{15}) \\ \dots \\ c o v (x_{15}, x_{1}) c o v (x_{15}, x_{2}) c o v (x_{15}, x_{3}) \dots c o v (x_{15}, x_{15}) \end{matrix})

(23)

where

cov (X, Y) = E X Y - E X * E Y

, and

E X

and

E Y

refer to the expectation of

X

and

Y

, respectively.

Solve the eigenvalues and eigenvectors of the covariance matrix, arrange and output the eigenvalues in descending order, and obtain the number of principal components to be extracted by calculating the contribution degree and cumulative contribution degree of the eigenvalues:

|λ E - C| = 0

(24)

μ_{i} = \frac{λ_{i}}{\sum_{i = 1}^{n} λ_{i}}

(25)

where

E

is the unit matrix,

μ_{i} (i = 1,2, \dots, n)

is the contribution degree, and

λ_{i} (i = 1,2, \dots, n)

is the non−negative eigenvalue of the covariance matrix

C_{m \times n}

. The calculated contribution degree and cumulative contribution degree are shown in Figure 8. The incremental contribution degree of the first three principal components exceeds 80% (84.28%), so it can be considered that the first three principal components contain more than 80% of the original data information. The multiple characteristic parameters can be replaced by eliminating correlation [37].

PCA has the benefit of reducing the input size of the policy network in the double DQN algorithm, which can accelerate the training process. It is clear that some of the constraints associated with aging, such as resistance variation, may not be held in the model since the factor varies with too little accuracy throughout the cycle for the computational performance to reach that level [38]. Figure 8 depicts the results of the principal component analysis on the battery model, where the general state evolution can be compressed into three main components that explain 84.28% of the state information. In the overall results graph of the principal component model on the right, the elliptical confidence interval indicates the PCA score plot, which shows that the closer the sample points, the higher the similarity; the arrows correspond to the PCA loadings diagram, and the values of the original variables, corresponding to the arrows in the direction of projection to the corresponding axes, can reflect the correlation between the variable and PC1, PC2, and PC3, respectively. For example, the current enormously contributes to PC2, while dV/dt positively correlates with PC3.

4.3. The Relationship between the Optimization Problem and Double DQN

The double DQN algorithm is used to determine the function

f (m, x)

, which is a method for collaborative tasks such as target detection [39] and internet scheduling [40]. We define the state space according to Equations (1) and (2), as follows:

S = [{p o i n t}_{d a t a}, {t i m e}_{t e s t}, {t i m e}_{d a t a}, {{t i m e}_{s t e p}, i n d e x}_{s t e p}, {i n d e x}_{c y c l e}, c u r r e n t, v o l t a g e, {c a p a c i t y}_{c h a r g e}, {c a p a c i t y}_{d i c c h a r g e}, {e n e r g y}_{c h a r g e}, {e n e r g y}_{d i a c h a r g e}, \frac{d V}{d t}, I R, T e m p e r a t u r e]

(26)

It is a high dimensional continuous space. Therefore, principal component analysis is chosen for dimensionality reduction, which is transformed into three principal components and then input into the double DQN algorithm. Since whether to charge the battery is highly operational in reality, and the number of charging protocols in the dataset limits our decision accuracy, we chose to schedule charging, discharging, and rest during the battery cycle aging process. The action space is written as:

A = [0, 1, 2]

(27)

where 0 is for resting, 1 is for discharging, and 2 is for charging. The current action space is set for constant current. Suppose the directly available capacity of the battery is below 80% of the nominal capacity. In that case, the battery is considered severely discharged, at which point this episode ends, and the state of the battery is reinitialized in the next episode. In this study, the duration of each action step is 5 min.

During the learning process of the agent, the environment accepts the action of the battery agent, processes the battery, and generates rewards. The reward function is designed based on the optimization problem in the mathematical description to maximize the battery’s cycle life.

It is well known that both the internal resistance and temperature of the battery are essential factors affecting battery aging. However, the battery aging dataset used in this paper mainly affects the cyclic aging process by controlling charging and discharging current. Temperature and internal resistance are secondary factors in this dataset, and the variation accuracy is too small. In order to prevent the reinforcement learning agent from deviating from the search space during the training process, we add temperature and internal resistance constraints to the reward function according to the constraints in Equation (3). The reward for each step is written as follows:

R = α R_{s o c} + β R_{c o n}

(28)

where:

R_{s o c} = λ_{i} |x - 80|, i = 1,2, 3

(29)

R_{c o n} = ω_{1} R_{i n} + ω_{2} R_{t e m}

(30)

R_{i n} = \{\begin{matrix} 0 & i f ∆ {I R}_{m i n} \leq ∆ I R \leq ∆ {I R}_{m a x} \\ - m a x \{|∆ {I R}_{m i n} - ∆ I R|, |∆ I R - ∆ {I R}_{m a x}|\} & o t h e r w i s e \end{matrix}

(31)

R_{t e m} = \{\begin{matrix} 0 & i f T_{m i n} \leq T \leq T_{m a x} \\ - m a x \{|T_{m i n} - T|, |T - T_{m a x}|\} & o t h e r w i s e \end{matrix}

(32)

where

α, β, λ_{i} (i = 1,2, 3), ω_{i} (i = 1,2)

are weight parameters describing the different objectives and

x

represents the battery health state. In each episode, the result of each state change makes the current SOH greater than 80%, then the reward value is updated to positive and

λ

is taken as a real number greater than 0. If the action selected in this step causes the battery SOH to be 80% or even less than 80%, then the reward value is negative and

λ

takes a real number less than 0. When the internal resistance and temperature of the battery are within the constraint range, the agent will not be penalized. When it exceeds this range, the agent will be punished, thus directing the battery to be correct during the overall cyclic aging process. The agent managing the battery for a longer lifetime is rewarded at a higher level.

4.4. Overall Framework

The overall framework structure of this paper is shown in Figure 9. Based on the public battery cycle aging data set, the battery aging model is formed by combining health indicators, LSTM and GPR, which serves as the environment model for reinforcement learning. Then, the double DQN algorithm is used to train the agent, and the principal component analysis method is used to reduce the dimension of the state space, which reduces the computing time of the model. We add temperature and internal resistance difference constraints to the reward function, which provide policy guidance for the agent through the reward and punishment mechanism. In each decision, the environment sends the current feedback to the agent. The agent receives the feedback and makes action selection according to the optimal solution provided by the double DQN algorithm, and reacts to the environment model. Then, the agent enters the next state. When the battery capacity is 80% of the nominal capacity, end this learning and output the cycle life of this schedule. The more training times of reinforcement learning, the more stable the model, and the more accurate the optimal solution. However, due to the limitation of computational ability and time, we cannot continuously train the model. Therefore, the optimal solution found in this study tends to be local. The method of obtaining the global optimal solution can be discussed in future research.

5. Experimental Evaluation and Discussion

This section mainly introduces the experimental results. The training hyperparameters are shown in Table 2. We compare the proposed method with heuristic scheduling algorithms, such as PSO and GA algorithms. In addition, we compare PCA−DDQN with reinforcement learning algorithms, such as the DDQN and Q−learning algorithm. We consider evaluating the performance of the proposed method, including cycle life and cumulative reward during learning, SOC curve variation during convergence, and the generalizability of the algorithm.

5.1. Cycle Life and Cumulative Rewards

In the first part, the performance of the proposed methods is compared from the perspective of battery cycle life. In this experiment, six different battery states are input into each algorithm, and the experimental results are shown in Table 3. We find that the PCA−DDQN agent has better results under all input settings. At the same time, other heuristic scheduling algorithms have achieved good results in only one or several cases because the state space is too ample and easy to fall into the local optimum. DDQN and Q−learning have too large a search space. They may fall into the risk of the “curse of dimensionality” without principal component analysis for dimensionality reduction, and the performance is sometimes even worse than heuristic scheduling algorithms. The overall inspection shows that the proposed method can provide appropriate charging and discharging methods with a number of cycles comparable to the optimal heuristic scheduling method.

The same reward function is set for the DDQN algorithm and the Q−learning algorithm to compare the reward and convergence of the proposed method. Figure 10 shows the experimental results. The average reward during training is an essential and decisive factor reflecting the convergence process, training efficiency, and learning ability [41], as shown in Figure 10a. Maintaining a stable average return in the training process implies reliable convergence. The results show that PCA−DDQN converges much faster than Q−learning and DDQN due to the former’s additional PCA dimensionality reduction capability. Because DRL adds neural networks and the experience pool, DDQN converges faster than Q−learning and can handle continuous problems and stabilize the training effect to some extent. In addition, combining Equation (28) and Figure 10b reveals that the PCA−DDQN algorithm has a higher reward value than the other two algorithms. This may be due to the excellent convergence speed and more opportunities for the agent to try. With the agent’s continuous learning, the experience accumulation process makes the agent constantly adjust the control strategy quickly. It can be explained that the Q−learning algorithm exhibits the worst training efficiency and learning ability due to the inherent defects of the Q−table output. The comparison in this section shows that the proposed PCA−DDQN method in this paper has an excellent performance in terms of cycle life and cumulative reward, which provides ideas for the following experiments.

According to Equation (1), it can be seen that the temperature and internal resistance differences inside the battery are assumed to be within a specific range. The comparison of the operating points of different reinforcement learning algorithms’ violating constraints is shown in Figure 10c,d. An observation of the original dataset shows that the battery’s temperature range, after the beginning of cyclic aging, is 30–40 °C, and that the internal resistance increases continuously in the aging process. The change accuracy of internal resistance is too small, which cannot sufficiently reflect the strengths and weaknesses of violating the constraint. The internal resistance difference

Δ I R

is used, and its range,

0 ~ 1 \times 1 0^{- 4} Ω

, is represented by an orange solid line. Suppose the point is not within the range of two orange lines, meaning the agent violates the constraint. We can observe that, under the same cycle setting, there are not so many points violating the constraint in the training process because the constraint range of temperature is more extensive than that of

Δ I R

. In both images, the method proposed in this paper is less likely to violate the policy constraint than other algorithms, which can mitigate some internal resistance violations and help the agent in the broad search space.

5.2. SOC Curve Change during Convergence

The SOC and difference change curves in the training process of different algorithms are shown in Figure 11. As shown in Figure 11a, the SOC variation trends of the machine learning algorithm and physical aging experiment are generally consistent, indicating that our algorithm follows the basic battery aging rules. Meanwhile, although all algorithms end training at 80% of battery capacity, the SOC of the PCA−DDQN algorithm is slightly higher than the results of most algorithms in many cycles, which can be explained by the fact that the agent, in reinforcement learning, modifies the policy based on a series of state−action−reward processes toward an enormous reward value. In contrast, the traditional heuristic scheduling algorithms take the error difference as an indicator for updating parameters and do not update them in the corresponding range, with high model fluctuations. The PSO algorithm is better at dealing with continuously varying variables, and the GA algorithm has more advantages in discontinuously varying problems [42]. In other cases, the RL agent performs slightly lower than that of the heuristic algorithm due to the

ε

greedy strategy of the agent in selecting action. The

ε

value set in this study is 0.9, which means that, in the action selection process, there is a 90% probability of using the previous optimal solution to select the action with the maximum Q value. There is a 10% probability of random selection, which allows the agent to search a larger space and avoid the agent falling into the local optimum. As can be seen in Figure 11b, there are capacity changes during battery aging, and the SOC trajectories of the proposed method, with the partial heuristic scheduling algorithms, fluctuate in a relatively narrow range, while the other algorithms fluctuate within a more extensive range, which is not conducive to the overall cycle life of the battery. In addition, the results in Figure 11 show that none of the overall changes in SOC are evident in the initial stage of the cyclic aging experimental process, which is because the early battery physical and chemical degradation does not immediately lead to capacity decay, and the capacity decay process is accelerated when a certain critical point is reached.

5.3. Adaptive Capability of the Algorithm

One of the significant challenges in the development of battery management systems is the adaptability of the controller to changes in battery cycling behavior. The proposed deep RL method demonstrates the advantages of the automatic interactive learning of optimal strategies [19]. Specifically, it can adapt to changing input data. We analyze the adaptability of RL by comparing the cycle life between various scheduling algorithms and the proposed PCA−DDQN under different battery datasets. The proposed method in this paper gradually converges with continuous iterations of network parameters. The results of the number of cycles in the adaptive testing study are depicted in Figure 12. The change in the battery chemistry affects the electrochemical modeling.

We download the dataset mentioned in Zhu’s article [24] from the Zendo data platform, which contains three commercial 18,650 batteries with different chemical compositions, namely NCA batteries, NCM batteries, and NCM+NCA batteries. The batteries are cycled in the temperature−controlled chamber at different charging current rates. The selected temperatures are 25 °C, 35 °C, and 45 °C, and the current rates range from 0.25 C to 4 C. Data satisfying the temperature constraints proposed in this paper are selected to verify the generalizability of the proposed method for evaluation.

The experimental results indicate that, in Figure 12, the traditional heuristic scheduling strategy has good and bad cycle life counts in batteries with different chemical compositions. The heuristic scheduling algorithm has no significant advantage in the cycling process of more complex chemical composition batteries. The traditional heuristic scheduling strategy and PCA−DDQN have little difference in SOC in the first 100 cycles because the battery cells are still in a fresh state, and the chemical reaction is not apparent at this time. After a certain number of cycles, the battery quickly reaches the parameters limit due to aging, and the heuristic scheduling algorithm fluctuates significantly with the inherent physical degradation of the battery model, while PCA−DDQN can update the action sequence of the selected agent from the reward (punishment) and performs better than the heuristic scheduling strategy. Thus, PCA−DDQN shows a clear advantage in achieving adaptive scheduling optimization with different battery types.

5.4. Results and Discussion

The experimental results show that a trained model can set the optimal scheduling algorithm in diverse battery model environments without requiring precise physicochemical information about the battery or solving complex mathematical equations. The feature of RL scheduling allows the algorithms to be modified for specific user needs without changing the overall framework.

We demonstrate that the proposed PCA−DDQN can be used for power battery scheduling optimization problems. Firstly, the availability of the complete data set reduces the algorithmic difficulty by avoiding learning battery degradation mechanisms and building complex electrochemical models. Secondly, deep RL is usually associated with artificial intelligence. This work shows that RL is comparable to other methods, and its adaptability to aging is well−suited for battery charging and discharging control. Thirdly, DDQN is a very typical algorithm of continuous state space and discrete action space. How to deal with the continuous action space and reduce the error of the data−driven model is a problem that needs to be continued to be studied. Finally, temperature is a critical influencing factor in the battery aging process. However, in the publicly available dataset, the experimental conditions are often conducted at a fixed ambient temperature, lacking primary data. The reward value of the temperature factor cannot be clearly given, which should be experimented upon in future research.

6. Conclusions

This work develops a lithium−ion battery scheduling architecture based on deep learning and reinforcement learning, demonstrating the prospect of using a data−based capacity estimation environment model and DRL algorithm to solve the power battery scheduling problem. The SOC estimation model is trained using a deep learning algorithm as the reinforcement learning agent environment. We combine principal component analysis and double DQN for training the agent. Several publicly available power battery cycle aging datasets are used for validation. The experimental results show that the SOC estimation model trained by the neural network interacts well with the agent. Compared with PSO and GA algorithms, the proposed method has a slight advantage in the battery cycle life index and has strong adaptability, which does not need to develop specific schemes for batteries with specific chemical compositions. Compared with DDQN and Q−learning algorithms, this method outperforms other methods regarding convergence time and cumulative rewards. However, we cannot guarantee that the DDQN algorithm, based on principal component analysis is optimal. In future work, more accurate power battery models can be used for simulations, adding specific temperature−influencing factors and considering the varying application requirements in realistic scenarios.

Author Contributions

Conceptualization, H.X. and J.C.; methodology, J.C.; software, S.R.; validation, S.R., A.Z. and J.C.; formal analysis, H.X.; data curation, J.C.; writing—original draft preparation, J.C.; writing—review and editing, A.Z.; visualization, S.R.; supervision, H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key Research and Development Program “Key Technology for Collaboration and Interoperability of Distribution Grid Business Resources (2021YFB2401300)”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, S.; Gao, Z.; Sun, T. Safety challenges and safety measures of Li-ion batteries. Energy Sci. Eng. 2021, 9, 1647–1672. [Google Scholar] [CrossRef]
Gallardo-Lozano, J.; Romero-Cadaval, E.; Milanes-Montero, M.I.; Guerrero-Martinez, M.A. Battery equalization active methods. J. Power Sources 2014, 246, 934–949. [Google Scholar] [CrossRef]
Hu, C.; Youn, B.D.; Chung, J. A multiscale framework with extended Kalman filter for lithium-ion battery SOC and capacity estimation. Appl. Energy 2012, 92, 694–704. [Google Scholar] [CrossRef]
Li, J.; Adewuyi, K.; Lotfi, N.; Landers, R.G.; Park, J. A single particle model with chemical/mechanical degradation physics for lithium ion battery State of Health (SOH) estimation. Appl. Energy 2018, 212, 1178–1190. [Google Scholar] [CrossRef]
Li, Y.; Liu, K.L.; Foley, A.M.; Zulke, A.; Berecibar, M.; Nanini-Maury, E.; Van Mierlo, J.; Hoster, H.E. Data-driven health estimation and lifetime prediction of lithium-ion batteries: A review. Renew. Sustain. Energy Rev. 2019, 113, 109254. [Google Scholar] [CrossRef]
Lee, S.J.; Kee, M.; Park, G.H. Sensor data compression and power management scheme for low power sensor hub. IEICE Electron. Express 2017, 14, 20170974. [Google Scholar] [CrossRef] [Green Version]
Fang, X.; Hodge, B.M.; Bai, L.Q.; Cui, H.T.; Li, F.X. Mean-Variance Optimization-Based Energy Storage Scheduling Considering Day-Ahead and Real-Time LMP Uncertainties. IEEE Trans. Power Syst. 2018, 33, 7292–7295. [Google Scholar] [CrossRef]
Gao, Y.; Yang, J.J.; Yang, M.; Li, Z.S. Deep Reinforcement Learning Based Optimal Schedule for a Battery Swapping Station Considering Uncertainties. IEEE Trans. Ind. Inform. 2020, 56, 5775–5784. [Google Scholar] [CrossRef]
Xu, G.N.; Du, X.W.; Li, Z.J.; Zhang, X.J.; Zheng, M.X.; Miao, Y.; Gao, Y.; Liu, Q.S. Reliability design of battery management system for power battery. Microelectron. Reliab. 2018, 88–90, 1286–1292. [Google Scholar] [CrossRef]
Saleeb, H.; Sayed, K.; Kassem, A.; Mostafa, R. Power Management Strategy for Battery Electric Vehicles. IET Electr. Syst. Transp. 2019, 9, 65–74. [Google Scholar] [CrossRef]
Jan, K.U.; Dubois, A.M.; Diallo, D. Hybrid Battery-SC and Battery-Battery Multistage Design and Energy Management for Power Sharing. In Proceedings of the IECON 2021—47th Annual Conference of the IEEE Industrial Electronics Society, Toronto, ON, Canada, 13–16 October 2021. [Google Scholar] [CrossRef]
Shaheen, A.M.; Spea, S.R.; Farrag, S.M.; Abido, M.A. A review of meta-heuristic algorithms for reactive power planning problem. Ain Shams Eng. J. 2018, 9, 215–231. [Google Scholar] [CrossRef] [Green Version]
Paliwal, N.K. A day-ahead Optimal Scheduling Operation of Battery Energy Storage with Constraints in Hybrid Power System. Procedia Comput. Sci. 2020, 167, 2140–2152. [Google Scholar] [CrossRef]
Ouyang, Q.; Wang, Z.S.; Liu, K.L.; Xu, G.T.; Li, Y. Optimal Charging Control for Lithium-Ion Battery Packs: A Distributed Average Tracking Approach. IEEE Trans. Ind. Inform. 2020, 16, 3430–3438. [Google Scholar] [CrossRef]
Liu, Z.L.; Wang, X.; Zhang, F. Full Life-Cycle Optimal Battery Scheduling for Maximal Lifetime Value Considering Degradation. IEEE Trans. Energy Convers. 2022, 37, 1379–1393. [Google Scholar] [CrossRef]
Waschneck, B.; Reichstaller, A.; Belzner, L.; Altenmuller, T.; Bauernhansl, T.; Knapp, A.; Kyek, A. Optimization of global production scheduling with deep reinforcement learning. Procedia CIRP 2018, 72, 1264–1269. [Google Scholar] [CrossRef]
Rugwiro, U.; Gu, C.H.; Ding, W.C. Task Scheduling and Resource Allocation Based on Ant-Colony Optimization and Deep Reinforcement Learning. J. Internet Technol. 2019, 20, 1463–1475. [Google Scholar] [CrossRef]
Mbuwir, B.V.; Kaffash, M.; Deconinck, G. Battery Scheduling in a Residential Multi-Carrier Energy System Using Reinforcement Learning. In Proceedings of the 2018 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), Aalborg, Denmark, 29–31 October 2018. [Google Scholar] [CrossRef]
Sui, Y.; Song, S.M. A Multi-Agent Reinforcement Learning Framework for Lithium-ion Battery Scheduling Problems. Energies 2020, 13, 1982. [Google Scholar] [CrossRef] [Green Version]
Huang, B.; Wang, J.H. Deep-Reinforcement-Learning-Based Capacity Scheduling for PV-Battery Storage System. IEEE Trans. Smart Grid 2021, 12, 2272–2283. [Google Scholar] [CrossRef]
Severson, K.A.; Attia, P.M.; Jin, N.; Perkins, N.; Jiang, B.; Yang, Z.; Chen, M.H.; Aykol, M.; Herring, P.K.; Fraggedakis, D.; et al. Data-driven prediction of battery cycle life before capacity degradation. Nat. Energy 2019, 4, 383–391. [Google Scholar] [CrossRef] [Green Version]
Li, X.Y.; Wang, Z.P.; Zhang, L.; Zou, C.F.; Dorrell, D.D. State-of-health estimation for Li-ion batteries by combing the incremental capacity analysis method with grey relational analysis. J. Power Sources 2019, 410–411, 106–114. [Google Scholar] [CrossRef]
Che, Y.H.; Deng, Z.W.; Tang, X.L.; Lin, X.K.; Nie, X.H.; Hu, X.S. Lifetime and Aging Degradation Prognostics for Lithium-ion Battery Packs Based on a Cell to Pack Method. Chin. J. Mech. Eng. 2022, 35, 4. [Google Scholar] [CrossRef]
Roman, D.; Saxena, S.; Robu, V.; Pecht, M.; Flynn, D. Machine learning pipeline for battery state-of-health estimation. Nat. Mach. Intell. 2021, 3, 447–456. [Google Scholar] [CrossRef]
Zhu, J.G.; Wang, Y.X.; Huang, Y.; Gopaluni, R.B.; Cao, Y.K.; Heere, M.; Muhlbauer, M.J.; Mereacre, L.; Dai, H.F.; Liu, X.H.; et al. Data-driven capacity estimation of commercial lithium-ion batteries from voltage relaxation. Nat. Commun. 2022, 13, 2261. [Google Scholar] [CrossRef] [PubMed]
de Winter, J.C.F.; Gosling, S.D.; Potter, J. Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. Psychol Methods. 2016, 21, 273–290. [Google Scholar] [CrossRef] [PubMed]
Che, Y.H.; Deng, Z.W.; Lin, X.K.; Hu, L.; Hu, X.S. Predictive battery health management with transfer learning and online model correction. IEEE Trans. Veh. Technol. 2021, 70, 1269–1277. [Google Scholar] [CrossRef]
Zhang, Y.Z.; Xiong, R.; He, H.W.; Pecht, M.G. Long short-term memory recurrent neural network for remaining useful life prediction of lithium-ion batteries. IEEE Trans. Veh. Technol. 2018, 67, 5695–5705. [Google Scholar] [CrossRef]
Hu, X.S.; Che, Y.H.; Lin, X.K.; Onori, S. Battery health prediction using fusion-based feature selection and machine learning. IEEE Trans. Transp. Electrif. 2021, 7, 382–398. [Google Scholar] [CrossRef]
Deng, Z.W.; Hu, X.S.; Lin, X.K.; Xu, L.; Che, Y.H.; Hu, L. General discharge voltage information enabled health evaluation for lithium-ion batteries. IEEE/ASME Trans. Mechatron. 2021, 26, 1295–1306. [Google Scholar] [CrossRef]
Hu, X.S.; Che, Y.H.; Lin, X.K.; Deng, Z.W. Health prognosis for electric vehicle battery packs: A data-driven approach. IEEE-ASME Trans. Mechatron. 2020, 25, 2622–2632. [Google Scholar] [CrossRef]
Deng, Z.W.; Hu, X.S.; Lin, X.K.; Che, Y.H.; Xu, L.; Guo, W.C. Data-driven state of charge estimation for lithium-ion battery packs based on Gaussian process regression. Energy 2020, 205, 118000. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Vazquez-Canteli, J.R.; Nagy, Z. Reinforcement learning for demand response: A review of algorithms and modeling techniques. Appl. Energy 2019, 235, 1072–1089. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the Association-for-the-Advancement-of-Artificial-Intelligence (AAAI) Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Ait-Sahalia, Y.; Xiu, D.C. Principal Component Analysis of High-Frequency Data. J. Am. Stat. Assoc. 2019, 114, 287–303. [Google Scholar] [CrossRef] [Green Version]
Park, S.; Pozzi, A.; Whitmeyer, M.; Perez, H.; Kandel, A.; Kim, G.; Choi, Y.; Joe, W.T.; Raimondo, D.M.; Moura, S. A Deep Reinforcement Learning Framework for Fast Charging of Li-Ion Batteries. IEEE Trans. Transp. Electrif. 2022, 8, 2770–2784. [Google Scholar] [CrossRef]
Zuo, G.Y.; Du, T.T.; Lu, J.H. Double DQN Method For Object Detection. In Proceedings of the Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017. [Google Scholar]
Han, B.A.; Yang, J.J. Research on Adaptive Job Shop Scheduling Problems Based on Dueling Double DQN. IEEE Access 2020, 8, 186474–186495. [Google Scholar] [CrossRef]
Huang, R.C.; He, H.W.; Zhao, X.Y.; Wang, Y.L.; Li, M.L. Battery health-aware and naturalistic data-driven energy management for hybrid electric bus based on TD3 deep reinforcement learning algorithm. Appl. Energy 2022, 321, 119353. [Google Scholar] [CrossRef]
Ramdania, D.R.; Irfan, M.; Alfarisi, F.; Nuraiman, D. Comparison of genetic algorithms and Particle Swarm Optimization (PSO) algorithms in course scheduling. J. Phys. Conf. Ser. 2019, 1402, 022079. [Google Scholar] [CrossRef]

Figure 1. A charging protocol during cyclic aging of battery (taking 3C-60%-4C as an example): R represents relaxation, CC1V means constant current charging by C1 (Q1)-C2 (80%)-1C, CVC represents constant voltage charging by 3.6 V, and CCD means constant current discharging by 4C, CVD means constant voltage discharging by 2 V constant voltage discharge.

Figure 2. SOC curves of the same batteries under different discharge protocols.

Figure 3.

I C

curves under different cycle periods.

Figure 3.

I C

curves under different cycle periods.

Figure 4. (a) LSTM network structure; (b) LSTM memory cell.

Figure 5. (a) Accuracy of LSTM prediction of HI; (b) error of predicted capacity for different quantities of HI; (c) LSTM+GPR prediction accuracy scatter plot.

Figure 6. Markov decision process.

Figure 7. Double DQN neural network structure.

Figure 8. (a) Contribution rate and cumulative contribution rate results; (b) overall results of PCA.

Figure 9. Overall scheduling framework.

Figure 10. (a) Comparison of average rewards of different reinforcement learning algorithms; (b) rewards during training of different reinforcement learning algorithms; (c) performance of different algorithms under temperature constraint; (d) performance of different algorithms under

∆ I R

constraint.

Figure 10. (a) Comparison of average rewards of different reinforcement learning algorithms; (b) rewards during training of different reinforcement learning algorithms; (c) performance of different algorithms under temperature constraint; (d) performance of different algorithms under

∆ I R

constraint.

Figure 11. (a) SOC variation curve under different algorithm comparison; (b) variation curve of ΔSOC under different algorithm comparison.

Figure 12. Cycle life on LEP and NCA, NCM, NCA+NCM dataset.

Table 1. Numerical results of correlation coefficients between health indicators and capacity.

Battery	Var		Ske		Kur
Battery	PCC	SCC	PCC	SCC	PCC	SCC
CH5	0.9867	0.9789	0.9998	0.9878	0.9764	0.9777
CH11	0.9986	0.9686	0.9999	0.9799	0.9915	0.9837
CH17	0.9976	0.9966	0.9954	0.9733	0.9824	0.9968
CH22	0.9971	0.9974	0.9877	0.9843	0.9962	0.9695
CH32	0.9846	0.9732	0.9865	0.9812	0.9921	0.9999

Table 2. Double DQN hyperparameters.

Variables	Description	Value
$γ$	Discount factor	0.99
$η_{π}$	The learning rate of neural network	0.001
$ε$	$ε$ greed strategy, exploration, and exploitation	0.9

Table 3. Battery cycle life under different scheduling algorithms.

Experimental Input	Sequential	PSO	GA	Q−Learning	DDQN	PCA−DDQN
4.4 A, 2.01 V, 1.06 Ah (cycle = 10) 4 C−2 V	898	903	874	840	879	907
7.7 A, 3.5 V, 0.19 Ah (cycle = 5) C1−Q1	515	494	526	499	512	506
3.3 A, 3.5 V, 0.86 Ah (cycle = 6) C2−80	733	716	682	712	701	719
0 A, 3.3 V, 0.88 Ah (cycle = 7) rest	561	502	573	502	549	564
1.1 A, 3.43 V, 0.94 Ah (cycle = 9) 1 C−3.6 V	408	459	372	410	438	453
0.00016 A, 3.29 V, 0.88 Ah (cycle = 8) start	631	662	597	581	612	659

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiong, H.; Chen, J.; Rong, S.; Zhang, A. Power Battery Scheduling Optimization Based on Double DQN Algorithm with Constraints. Appl. Sci. 2023, 13, 7702. https://doi.org/10.3390/app13137702

AMA Style

Xiong H, Chen J, Rong S, Zhang A. Power Battery Scheduling Optimization Based on Double DQN Algorithm with Constraints. Applied Sciences. 2023; 13(13):7702. https://doi.org/10.3390/app13137702

Chicago/Turabian Style

Xiong, Haijun, Jingjing Chen, Song Rong, and Aiwen Zhang. 2023. "Power Battery Scheduling Optimization Based on Double DQN Algorithm with Constraints" Applied Sciences 13, no. 13: 7702. https://doi.org/10.3390/app13137702

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Power Battery Scheduling Optimization Based on Double DQN Algorithm with Constraints

Abstract

1. Introduction

2. Mathematical Description of Optimization Problem

3. Battery Model and Verification

3.1. Dataset

3.2. Regression Model Based on Deep Learning

3.3. Capacity Estimation Model Based on GPR

3.4. Model Verification

4. Methodology

4.1. Deep Reinforcement Learning Based on Double DQN

4.2. Principal Component Analysis Algorithm

4.3. The Relationship between the Optimization Problem and Double DQN

4.4. Overall Framework

5. Experimental Evaluation and Discussion

5.1. Cycle Life and Cumulative Rewards

5.2. SOC Curve Change during Convergence

5.3. Adaptive Capability of the Algorithm

5.4. Results and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI