3.2.1. Basic Concepts
In this section, the Qlearning algorithm is adopted to optimize the overall efficiency and obtain an effective energy management strategy for the WLTP. A series of basic concepts of reinforcement learning need to be introduced hierarchically to define the Qlearning algorithm.
Reinforcement learning solves and improves the control performance of Markov decision problems. Its main architecture revolves around a socalled learning agent, which has access to sensing the environment state and taking actions that conversely affect the controlled environment. To improve control performance, a reward signal is defined that can guide the agent to achieve higher cumulative values through a trialanderror mechanism. Reinforcement learning can be divided into two categories: modelbased learning and modelfree learning. In modelbased learning, considering the multistep reinforcement learning task, the machine has modeled the environment, which can simulate the same or similar situation with the environment inside the machine. Regarding modelfree learning, in a realistic reinforcement learning task, it is often difficult to know the transition probability and reward function of the environment or even how many states there are in the environment. If the learning algorithm does not depend on environment modeling, it is called modelfree learning, which is much more difficult than modelbased learning.
The biggest advantage of modelbased learning is that agents can plan in advance, try possible future choices in advance when they go to each step, and then clearly choose from these candidates. The biggest disadvantage is that agents often cannot get the real model of the environment. If the agent wants to use the model in a scene, it must learn completely from experience, which will bring many challenges. The biggest challenge is that there is an error between the model explored by the agent and the real model, which will cause the agent to perform well in the learning model but not well in the real environment. In order to obtain an energy management strategy that can cope with the real environment well, the modelfree learning method is used here. There are two kinds of methods in modelfree reinforcement learning: the MonteCarlo update and temporaldifference update. In actual working conditions, the required power is changing every second. According to this characteristic, Qlearning is selected to explore energy management. At the same time, the rulebased method is used to provide another integrated energy management system and be compared with Qlearning to verify its effectiveness.
3.2.2. Power Splitting Based on QLearning
Qlearning was proposed for solving Markov decision problems. As one of the most popular offpolicy RL methods, Qlearning is expected to maximize the total reward $\sum R$. Consequently, the optimal value function that guides the decision process of the policy can be defined as the distribution over the given current state S(t) and control action A(t).
In the integrated energy management system, there are three kinds of changing states: the SOC representing the battery state, stateofvoltage (SOV) representing the capacitor state, and
P_{dem} representing system output. The constraints of the state variable
S(
t) = {
SOC(
t),
SOV(
t),
P_{dem}(
t)} can be defined as:
where
P_{dem} is the required power (unit: kW).
The constraints of the control variable
A(
t) = {
I_{c}(
t),
I_{v}(
t)} are defined as:
where
I_{c} is the battery current, and
I_{v} is the ultracapacitor current (unit: A).
The reward function is:
where
$\eta $ is a variable; with the size of the total loss value under each second working condition, it is randomly selected in the corresponding interval. When the total loss is less than the required power of 20%,
$\text{}\gamma $ is 1; otherwise,
$\gamma $ is −1.
$\u25b3SOC$ =
$SOC$ −
$SO{C}_{pre}$, is used to limit the SOC range of battery packs.
R_{t} is the reward at a single time step
t; for estimating the longterm return, the return
G_{t} is used to represent the cumulative value of reward
R_{t} after time
t, and its recursion form is:
where γ∈(0,1) is the discount factor.
Strategy
b is a mapping from the state to the likelihood of selecting each action. The state value function
v_{b}(
s) is defined as the expected return starting from state
s and following strategy
b, expressed as:
where
S(
t) is the state at time
t.
Meanwhile, the action value function
q_{b}(
s,
a) is also defined as the expected return starting from state
s, taking action
a and following strategy
b:
where
A(
t) is the action at time
t. Then, again, the recursive form can be derived:
where
s(
t) and
s(
t + 1) represent specific states at time
t, and
t + 1.
a(
t) and
a(
t + 1) represent the specific actions at time
t and
t + 1.
The optimal action value function
q*(
s,
a) is defined as the maximum action value function in all strategies, and its recursive form can be expressed as:
If q*(s, a) is known, the optimal strategy b* can be obtained by maximizing q*(s, a).
As the real value of the optimal action value function is difficult to obtain, the estimated value of
q*(
S(
t),
A(
t)) −
Q(S(
t),
A(
t)) is used. In a sequential difference method including Qlearning, the difference between the estimated value
Q(
S(
t),
A(
t)) and the better estimated value
R(
t) +
γQ(
S(
t),
A(
t)) is used to update
Q(
S(
t),
A(
t)):
where α is the learning rate.
The algorithm block diagram is shown in
Figure 3, which demonstrates the basic method of the algorithm, including the usage of previous work. The exact procedures of the QL algorithm in this article are shown in Algorithm 1.
Algorithm 1: QLearning 
Initialization of Qlearning: Determine algorithm parameter boundary: α∈(0,1), γ∈(0,1), numbers of episodes N, working condition duration T, initialize action value target Q with random weights Q(s, a)’ and experience pool D with capacity N for episode = 1: N do for t = 1:T do With probability π select a random action A(t) Otherwise, select A(t) = arg maxQ(S(t), A(t)) execute action A(t) and observe reward R(t) and next state S(t + 1) update Q follows: Q(S(t), A(t)) = Q(S(t), A(t)) + α[R(t) + γmaxQ(S(t + 1), a) − Q(S(t), A(t))] update S(t) and A(t) end for end for
