Optimal Management for EV Charging Stations: A Win–Win Strategy for Different Stakeholders Using Constrained Deep Q-Learning

Paraskevas, Athanasios; Aletras, Dimitrios; Chrysopoulos, Antonios; Marinopoulos, Antonios; Doukas, Dimitrios I.

doi:10.3390/en15072323

Open AccessArticle

Optimal Management for EV Charging Stations: A Win–Win Strategy for Different Stakeholders Using Constrained Deep Q-Learning

by

Athanasios Paraskevas

¹,

Dimitrios Aletras

¹,

Antonios Chrysopoulos

^1,2,

Antonios Marinopoulos

³

and

Dimitrios I. Doukas

^1,*

¹

NET2GRID BV, Krystalli 4, 54630 Thessaloniki, Greece

²

School of Electrical and Computer Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

³

European Climate, Infrastructure and Environment Executive Agency (CINEA), European Commission, B-1049 Brussels, Belgium

^*

Author to whom correspondence should be addressed.

Energies 2022, 15(7), 2323; https://doi.org/10.3390/en15072323

Submission received: 2 March 2022 / Revised: 13 March 2022 / Accepted: 15 March 2022 / Published: 23 March 2022

(This article belongs to the Special Issue Digital Transformation in the Energy Sector: Data-Driven Analytics, Services and Business Models)

Download

Browse Figures

Versions Notes

Abstract

:

Given the additional awareness of the increasing energy demand and gas emissions’ effects, the decarbonization of the transportation sector is of great significance. In particular, the adoption of electric vehicles (EVs) seems a promising option, under the condition that public charging infrastructure is available. However, devising a pricing and scheduling strategy for public EV charging stations is a non-trivial albeit important task. The reason is that a sub-optimal decision could lead to high waiting times or extreme changes to the power load profile. In addition, in the context of the problem of optimal pricing and scheduling for EV charging stations, the interests of different stakeholders ought to be taken into account (such as those of the station owner and the EV owners). This work proposes a deep reinforcement learning-based (DRL) agent that can optimize pricing and charging control in a public EV charging station under a real-time varying electricity price. The primary goal is to maximize the station’s profits while simultaneously ensuring that the customers’ charging demands are also satisfied. Moreover, the DRL approach is data-driven; it can operate under uncertainties without requiring explicit models of the environment. Variants of scheduling and DRL training algorithms from the literature are also proposed to ensure that both the conflicting objectives are achieved. Experimental results validate the effectiveness of the proposed approach.

Keywords:

dynamic pricing; EV charging station; pricing and scheduling; reinforcement learning; deep Q-learning; demand response

1. Introduction

There has been increasing concern about global warming and climate change due to gas emissions [1]; at the same time, the energy demand is rapidly increasing [2,3], and for the most part it is satisfied through fossil-fuel energy sources [1]. Fossil fuel combustion and carbon dioxide (

{CO}_{2}

) emissions are significantly contributing to environmental pollution and global warming [1,4]. Therefore, the decarbonization of the transportation sector has naturally arisen a potential partial solution. In particular, the adoption of electric vehicles (EVs) is a promising option because of their benefits over standard fossil-fuel vehicles and their sustainable qualities [5,6]. The report of the International Energy Agency (IEA) [7] mentions that EVs are developing at a rapid pace, indicating that the global EV fleet exceeded 5.1 million in 2018 and that the there may be 250 million units by 2030.

To that end, establishing public charging infrastructure is a critical task that could lead to widespread EV adoption [8,9]. However, for public EV charging stations, there is a need to develop new business models and tackle additional challenges. For example, sub-optimal scheduling decisions when charging multiple EVs could result in high waiting times, while also significantly changing the demand profile of the utility by causing increased electricity demand, particularly at peak times [10,11]. Therefore, it is crucial to investigate different optimization strategies for EV charging stations while at the same time taking into consideration the perspectives of different stakeholders.

A literature survey revealed that most published research focuses on single-charger settings. According to [12], solving the global optimization problem is impractical in the absence of the distributions of future EV arrivals, charging duration, and base loads. Traditional approaches formulate the EV charging scheduling problem as a sequential decision-making task [13,14] and solve it using dynamic programming [15]. However, these conventional approaches require modeling the uncertainties, which is not necessarily feasible under real-life conditions.

On the other hand, reinforcement learning (RL) can be leveraged to solve problems formulated as Markov Decision Processes (MDP) [16]. In particular, deep reinforcement learning (DRL) methods have proven to be highly effective at complex tasks, outperforming human experts [17]. The main advantage of model-free DRL methods is that they are data-driven; i.e., the agents learn directly from experience without requiring explicit models of their environment. Note that different RL reward definitions lead to different optimization objectives, such as maximizing the EV owners’ profits, focusing on the EV charging stations’ profits, prioritizing the distribution system operator’s needs, or reducing waiting times [18].

In [19], kernel density estimation, was used to model the joint probability distribution of the arrival times and charging duration of EVs at a public charger. Then, a deep Q network (DQN) agent was trained to decide the charging/discharging rate in each time slot by choosing from a discrete number of levels. The observation space consists of the 24 h electricity price history, the remaining energy until the EV is fully charged, and the remaining time until departure. At the same time, the optimization objective takes into account minimizing charging costs and satisfying charging demands. In [20], arrival and departure times, and charging demand at a single charger, were modeled as truncated normal distributions. The observation space was similar to the one of [19], and a long-short-term-memory network was used to predict future electricity prices based on historical data. A modified deep deterministic policy gradient algorithm, called control deep deterministic policy gradient, allows the agent to choose charging/discharging rates from a continuous interval, aiming at maximizing the EV owner’s profit and satisfying the charging demand.

The proposed solution in [21] uses a combination of two networks, one for extracting representative features on the electricity price time series, and a DQN agent to control the EV real-time charging/discharging actions. The rewards’ definition considers both the charging costs and a penalty component proportional to the amount of uncharged energy, representing the “range anxiety” factor. The state information comprises the presence of the EV at home or not, the remaining battery energy, and the price time series for the past 24 h. Reference [22] introduces an EV charging station environment model and an admission system, where different types of EVs are modeled (simulating different customer profiles) and are presented with charging prices, accordingly to demand. The RL agent decides the amount of energy to purchase (which will be used to charge some of the parked EVs) and the price to announce to new EVs that arrive in each time slot. The models are trained using a variant of the well-known state–action–reward–state–action (SARSA) algorithm [16], called Hyperopia SARSA. The state information includes residual charging demands and parking times, and the reward is modeled towards optimizing the profit of the charging station.

The scope of this paper is to present an intelligent agent that optimally decides, in real-time and under uncertainties (such as the distribution of future EV arrivals and the electricity price), the pricing and scheduling actions needed to maximize a particular EV charging station’s profit. Simultaneously, the EV owners’ expectations and needs are taken into account. The main contributions of this paper are:

In contrast to prior strategies [19,20,21,22], the proposed strategy is a win–win for both stakeholders, i.e., the EV owners and the EV charging station operators. Fulfilling charging demands under agreed conditions is prioritized, and profit maximization from the charging station operator’s perspective follows.
Although direct bench-marking against pre-published literature is difficult because of the different operating conditions and data used, the financial benefit that is achieved for the charging station herein is considerable and comparable to the profit achieved in the literature [22].
A new training scheme is proposed for the Q-learning algorithm. The constraints imposed guarantee customer satisfaction, which is removed from the optimization objective to allow the RL agent to maximize EV charging station profit.
The proposed strategy is easy to adjust, and a different balance/prioritization between stakeholders needs can be selected (see Equation (13)).
The strategy takes into account real-time conditions and data. In contrast, some of the implementations already proposed in the literature [18,23] do not do so, and they mainly focus on the day-ahead time window.

The remainder of the paper is structured as follows: Section 2 presents the environment that was developed to represent the operations of an EV charging station, with regard to the pricing and scheduling decisions that are made. The problem is formulated as an MDP. Section 3 describes the proposed solution that is able to both decide the optimal sequence of actions and ensure that customers’ demands are being fulfilled. Furthermore, the architecture of the DRL agent is detailed in that section, along with the training algorithm used. Section 4 details the datasets on which the proposed approach was trained, and the settings of the experiments carried out. Section 5 presents the results of the training experiments that validate the effectiveness of the proposed approach. The agent’s decision-making ability is analyzed, and implications are discussed in the context of two case studies. Finally, Section 6 concludes the paper by stating the primary findings of this work.

2. System Model

2.1. EV Charging Station Environment

We use a DRL-based approach to tackle the problem, since it is data-driven and does not require explicit modeling of the uncertainties, as mentioned previously.

The entity of interest and the basis of the proposed model is an EV charging station environment. The environment is observed in discrete time slots (indexed by t). The length (duration) of each time slot in minutes is denoted by

t_{len}

. The notation and formulation that follows are based on [22].

At the beginning of each time slot t (meaning during the entire slot

t - 1

), a set of EVs

I_{t}

arrives at the station. We denote by

J_{t}

the set of EVs that are already parked in the station before time slot t and have not yet finished charging. Thus, the EVs that require charging at time slot t are denoted by

K_{t} : = I_{t} \cup J_{t}

.

Each EV

i \in I_{t}

that arrives at the beginning of time slot t is presented with the price rate

r_{t}

(determined by the charging station and measured in the currency/kWh) and accordingly responds with its charging demand,

d_{i}

and maximum desired waiting time

p_{i}

. The following assumptions are made:

EVs are price-sensitive; i.e., they adjust their charging demands based on the value of $r_{t}$ provided by the station. Thus, $d_{i} = D_{i} (r_{t})$ , where $D_{i} (\cdot)$ : $$ / kWh \to kWh$ is the demand–response function of EV i. Obviously, if EV i decides not to accept the presented rate, then $d_{i} = 0$ . Additionally, note that the demand–response function is EV-specific in the general case.
The price rate $r_{t}$ presented to $I_{t}$ will be constant for each EV in $I_{t}$ during its parking time.
There is a fixed and finite number of individual chargers at the station, N. Thus, for all time slots t, $|K_{t}| \leq N$ , which means that at any given time, at most N EVs are parked at the station. Suppose the number of EVs, $|I_{t}|$ , that arrive at the station overflow the available chargers. In that case, a subset of $I_{t}$ is selected, in a first-come-first-served manner, to meet the parking capacity of the station.

It directly follows from the above that if

(t_{i}^{a}, p_{i}, d_{i})

denote the arrival time, parking time, and charging demand of EV

i \in I_{t}

, then

d_{i}

must be fulfilled before the departure of the EV at time

t_{i}^{a} + p_{i}

.

In time slot t, the station also determines the charging rate

x_{i, t}

at which each EV

i \in K_{t}

will be charged during the time slot.

Let

x_{max}

be the maximum individual charging rate (limited by the specifications of every single charger) and

e_{max}

be the maximum total charging rate for the charging station. (It is assumed that the charging rate limit of the EV itself is always higher than the charger limit).

The following constraints hold:

\begin{matrix} 0 \leq x_{i, t} \leq x_{max} & , t = 1, 2, \dots, \forall i \in K_{t} \end{matrix}

(1)

\begin{matrix} \sum_{i \in K_{t}} x_{i, t} \leq e_{max} & , t = 1, 2, \dots \end{matrix}

(2)

\begin{matrix} α \sum_{t = t_{i}^{a}}^{t = t_{i}^{a} + p_{i}} x_{i, t} \geq d_{i} & , \forall i \end{matrix}

(3)

where the coefficient

α : = \frac{t_{len}}{60}

converts the charging rate

x_{i, t}

(kW) assigned to each EV for the current time slot t to the total amount of energy (kWh) that it will have received by the end of the time slot.

Equations (1) and (2) follow by definition. Equation (3) ensures that the charging demand of each EV i is fulfilled by the time it is set to leave the station. In general, the optimal pricing and scheduling policy might result in an EV not being charged at all for several time slots (when the electricity price is expected to be increased, for example), though of course, still being charged in the end. As explained in [24], these idle times could potentially negatively affect the charging infrastructure in terms of its availability, sizing, and cost. That aspect is not studied in the context of this work.

Finally, for each time slot t, the set of newly arrived EVs

I_{t}

pay a total of:

\sum_{i \in I_{t}} r_{t} D_{i} (r_{t})

(4)

to the charging station, according to the price rate

r_{t}

and the requested charge

d_{i} = D_{i} (r_{t})

of each EV i (of course, this is not valid unless the charging demand is actually satisfied by the departure time).

At the same time, in order to charge EVs in each time slot t, the charging station pays an electricity bill of

c_{t} \sum_{i \in K_{t}} α x_{i, t}

(5)

where

c_{t}

is the electricity price ($/kWh). It is assumed that

c_{t}

varies under the real-time pricing scheme [25].

The interactions between the different components of the charging station environment can be seen in Figure 1.

2.2. Problem Formulation Using the MDP Framework

The MDP definition [16] provides the basic framework on which RL agents are formally developed.

In particular, at each time step t, the environment is at state

S_{t}

; the agent interacts with the environment by selecting an action

A_{t}

; the environment responds by transitioning to the next state

S_{t + 1}

, which is returned to the agent along with the reward

R_{t + 1}

. The latter is a scalar signal that depends on the environment and the selected action

A_{t}

. In turn, the agent uses the information of

S_{t + 1}, R_{t + 1}

to decide the next action

A_{t + 1}

, so the above steps are repeated. This process is illustrated in Figure 2.

The optimization objective of an RL algorithm is to train an agent that selects a series of actions that maximize the total expected return. Equivalently set, the optimization criterion is:

max E [\sum_{t} γ^{t} R_{t}]

(6)

where

γ \in [0, 1)

is the discount rate, which is used to decrease the importance of distant future rewards, compared to immediate ones. We proceed to formulate the problem of optimal real-time scheduling and pricing in EV charging stations using the MDP framework.

State/Observation Space

The system state at time slot t is defined by:

\begin{matrix} S_{t} = & (J_{t}, {{\tilde{d}}_{j}^{t}} |_{j \in J_{t}}, {{\tilde{p}}_{j}^{t}} |_{j \in J_{t}}, \\ I_{t}, \\ c_{t : t - t_{24 h}}) \end{matrix}

(7)

and includes:

The EVs that are parked at the station $J_{t}$ , along with the residual charging demand ${\tilde{d}}_{j}^{t}$ and parking time ${\tilde{p}}_{j}^{t}$ for each EV $j \in J_{t}$
The newly arrived EVs, $I_{t}$
The last 24 h of values of the electricity price time series. Under the assumption that electricity price changes every $Δ t$ slots, the 24 h historical values can be represented by:

$c_{t}, c_{t - Δ t}, c_{t - 2 Δ t}, \dots, c_{t - M Δ t}$

(8)

where $M Δ t = 24 \frac{60}{t_{len}}$ . Equivalently, the number of samples M is given by:

$M = 24 \frac{60}{t_{len} Δ t}$

(9)

Action Space

At each time slot t, the action to be determined by the agent is the tuple

A_{t} = (r_{t}, e_{t});

(10)

that is, the price rate for new EVs that arrive at the station, and the total charging rate

e_{t} : = \sum_{i \in K_{t}} x_{i, t}

to be distributed among parked EVs.

As proved in [22], under certain conditions it is sufficient to determine, at each time slot t, the value of

e_{t}

instead of the individual charge amounts

x_{i, t}

. In turn, those can be found by applying the least laxity first (LLF) algorithm.

The laxity

l_{i, t}

of EV i at time slot t is defined as:

l_{i, t} : = {\tilde{p}}_{i}^{t} - \frac{{\tilde{d}}_{i}^{t} \cdot 60}{x_{max}}

(11)

{\tilde{d}}_{i}^{t}

is multiplied by 60 so as to convert the energy measured in kWh to kW·min, which in turn is divided by the maximum individual charging rate,

x_{max}

measured in kW. Intuitively,

l_{i, t}

represents the “headroom” between the remaining parking time and the minimum charging time required to fulfill the remaining demand.

Having determined the value of

e_{t}

, LLF schedules the values of

x_{i, t}

by assigning higher priority to those EVs presenting the least laxity. In other words, according to LLF, the station should first charge those EVs that are most urgent to finish charging. For more details on the LLF algorithm, the reader is referred to [22]. An improved implementation of the LLF algorithm, called constrained LLF, is described in Section 3.1.

Reward Modeling

The definition of the reward function is related to the optimization objective of the desired solution. In this work, the problem is studied from the points of view of the EV charging station and the EV owners; thus, the first objective is to maximize the station’s profit. Taking into account Equations (4)–(6), the reward at each time slot t is defined as the total payment the station collects from new EVs minus the cost for charging all parked EVs:

R_{t} : = \sum_{i \in I_{t}} r_{t} D_{i} (r_{t}) - α c_{t} e_{t}

(12)

Equation (12) is valid only as long as each EV

i \in I_{t}

is indeed fully charged with its required demand. Otherwise, the difference between the requested charging

D_{i} (r_{t})

and the actual charge provided should be introduced in the calculations. It should be noted that the use of the constrained total charging rate

e_{t}^{'}

that is obtained via the constrained LLF algorithm, presented in the next section, ensures that the laxity of each EV remains positive, leading to a successful charge.

3. Proposed Solution

3.1. Constrained Least Laxity First

As the agent can freely choose the total charging rate

e_{t}

, a situation could arise in which the EVs have not been adequately charged. Equivalently, the residual charging demand

{\tilde{d}}_{i}^{t}

of some EVs would not reach zero by the time they are set to leave the station (

{\tilde{p}}_{i}^{t} = 0

).

In that case, the constraint mentioned in Equation (3) is not satisfied, and EV owners could be discontent with the amount of energy they received during a charging session, thereby violating the second objective set in the previous section. Note that this case is not explicitly handled in [22].

It can be observed that, if at any time slot t, the laxity of an EV i,

l_{i, t}

, is negative, that EV can no longer be satisfied in its initial energy demand.

Constraining the total charging rate

e_{t}

could prevent such an event from occurring. With that in mind, a lower bound for

e_{t}

is introduced when applying the LLF algorithm, as described in Algorithm 1. Essentially, the constrained LLF algorithm first charges EVs with the least laxity with the maximum individual charging rate (

x_{max}

) until the total charging rate

e_{t}

is distributed. The algorithm then constrains the total charging rate if needed to prevent any negative laxities from occurring in the next time slot. It should be noted that the use of constrained LLF requires that the charging station is capable of charging all EVs at the maximum charging rate concurrently; i.e.,

e_{max} = N \cdot x_{max}

.

It should also be mentioned that the constraints of the LLF algorithm could be relaxed, allowing the agent to slightly undercharge EVs, with the aim of improving the charging station profit. Specifically, each laxity could be allowed to reach sub-zero levels, meaning that the residual demand of some EVs might not be met by the end of a charging session. The logical expression for the residual demand given the statement in Algorithm 1 would be written as:

l_{i, t + 1} < ξ

(13)

where

ξ < 0

is the relaxation coefficient. The maximum amount of residual demand that could potentially be unfulfilled is analogous to

| ξ |

.

Algorithm 1: Constrained least laxity first.

Require: Total charging rate

e_{t}

Require: Total number of chargers N
Require: Residual demand

{\tilde{d}}_{i}^{t}

,

i \in K_{t}

Require: Residual parking time

{\tilde{p}}_{i}^{t}

,

i \in K_{t}

Initialize remaining total charging rate

{\tilde{e}}_{t} \leftarrow e_{t}

for i = 1, N do

Initialize

x_{i, t} \leftarrow 0

Calculate laxity

l_{i, t} \leftarrow {\tilde{p}}_{i}^{t} - \frac{{\tilde{d}}_{i}^{t} \cdot 60}{x_{max}}

Initialize

l_{i, t + 1} \leftarrow l_{i, t}

end for

while

{\tilde{e}}_{t} > 0

do

Find EV

\hat{i}

with the least laxity that has

x_{\hat{i}, t} = 0

Update charging rate of EV

\hat{i}

:

x_{\hat{i}, t} \leftarrow min ({\tilde{e}}_{t}, x_{max}, {\tilde{d}}_{\hat{i}}^{t} \cdot \frac{1}{α})

Calculate laxity of EV

\hat{i}

for next time slot

t + 1

:

l_{\hat{i}, t + 1} \leftarrow l_{\hat{i}, t} + \frac{x_{\hat{i}, t} \cdot t_{len}}{x_{max}} - t_{len}

Update remaining total charging rate

{\tilde{e}}_{t} \leftarrow {\tilde{e}}_{t} - x_{\hat{i}, t}

end while

for i = 1, N do

if

l_{i, t + 1} < 0

then

Constrain charging rate of EV i:

x_{i, t} \leftarrow min (x_{max}, {\tilde{d}}_{i}^{t} \cdot \frac{1}{α})

end if

end for

Calculate constrained total charging rate

e_{t}^{'} \leftarrow \sum_{i = 1}^{N} x_{i, t}

3.2. Agent Architecture

The agent is modeled as a deep neural network, whose architecture is shown in Figure 3. The state information (Equation (7)) is provided as input to the agent. In particular, the network has:

N input nodes, each of which is the laxity of an EV at charger i, $l_{i, t}$ .
M input nodes corresponding to the values of the electricity price over the last 24 h, according to Equations (8) and (9).
One node corresponding to the number of EV arrivals observed at the admission zone of the station.

The network’s output approximates the total expected return for each action that the agent can choose in the current state. Since the total expected return per action provides information about the value of each action, it is called the action value. As will be described in Section 3.3, the training objective of the deep neural network is based on the Q-learning algorithm [26].

Most straightforward DRL algorithms, including standard deep Q-learning [17], operate on discrete action spaces; i.e., the set of all available actions

A_{t} = {A_{t}}, \forall t

is countable and finite. However, by definition (Equation (10)), the action space in this case is continuous. Therefore, actions are discretized as shown below:

Let $w_{r} = \{w_{r, 1}, w_{r, 2}, \dots, w_{r, L}\}$ be the L discrete price rate levels.
Let $w_{e} = \{w_{e, 1}, w_{e, 2}, \dots, w_{e, K}\}$ be the K discrete charging rate levels.
Then, the action space is:

$A_{t} = w_{e} \times w_{r} = \{(w_{r, 1}, w_{e, 1}), \dots, (w_{r, L}, w_{e, 1}), (w_{r, 1}, w_{e, 2}), \dots, (w_{r, L}, w_{e, K})\},$

(14)

i.e., the Cartesian product of the discrete level sets, with cardinality $|A_{t}| = L \cdot K$

A limitation of discretizing continuous action spaces is that the number of discrete actions could potentially explode. Therefore, the exploration phase of the algorithm and evaluating all individual actions become impractical [27]. Proper discrete levels should be selected that reflect the solution boundaries for the selected datasets/parameters.

3.3. Training Approach

During training, the agent consists of two identical deep neural networks: a policy and a target network, as explained in [28]. The target network copies the weights of the policy network every few updates of the latter, lagging behind a few episodes, to improve stability. The agent plays through episodes while storing experiences (observations, actions taken, rewards gained, and new observations) in a replay buffer. This buffer is then sampled at every step, and a batch is used for (continuously) training the networks.

A behavior policy is used to explore the environment while collecting data to prevent the agent from adhering to a sub-optimal policy due to the local minima of the loss function. The

ϵ

-greedy policy is commonly used to achieve such goals. According to the

ϵ

-greedy policy, the agent selects the greedy action that maximizes reward with probability

1 - ϵ

and a random action with probability

ϵ

. As training progresses, the probability

ϵ

decays to ensure convergence.

As mentioned in Section 2.2, the charging rate

e_{t}

selected by the agent should be above a lower bound in order to satisfy the problem formulation constraints. However, the action space is discretized, as described in Section 3.2. Thus, if the constrained charging rate

e_{t}^{'}

obtained by the constrained LLF algorithm is higher than the selected

e_{t}

, then the agent is forced to select the discrete charging level

w_{e, i}

which is closest to

e_{t}^{'}

and satisfies the inequality

w_{e, i} \geq e_{t}^{'}

. On the other hand, the selected price rate

r_{t}

does not have any constraints, so it is left unchanged.

Furthermore, it was experimentally found that gradually increasing the episode duration helps the agent grasp the EVs’ charging cycle. Specifically, the small initial episode duration provides the agent with data of the environmental state when the station is not yet busy. As the agent learns to charge a small number of EVs at the start of the episode, the duration increases, allowing the agent to apply the knowledge gained to charge many EVs and schedule charging concurrently.

The training approach is summarized in the Algorithm 2.

Algorithm 2: Constrained deep Q-learning.

Require: Episode length schema function, h
Require: Exploration rate schema, l

Initialize replay memory D to capacity N

Initialize action-value

Q \equiv Q (s, a,; θ)

parametrized with random weights

θ

Initialize target action-value

\hat{Q}

with weights

θ^{-}

for episode = 1, E do

Initialize state

s_{1}

Get current episode duration

T = h (episode)

for t = 1, T do

Get exploration rate

ϵ = l (episode, t)

With probability

ϵ

select a random action

a_{t}

, otherwise select

a_{t} = arg {max}_{a} Q (s, a; θ)

Constrain

a_{t}

using the Constrained LLF algorithm

Execute

a_{t}

and observe reward

R_{t}

and next state

s_{t + 1}

Store transition

(s_{t}, a_{t}, R_{t}, s_{t + 1})

Sample random minibatch of transitions

(s_{i}, a_{j}, R_{j}, s_{j + 1})

Set target

y_{j} = \{\begin{matrix} R_{j}, & if s_{j + 1} final state \\ R_{j} + γ max_{a^{'}} \hat{Q} (s_{j + 1}, a^{'}; θ^{-}), & otherwise \end{matrix}

Perform a gradient descent step on

{(y_{j} - Q (s_{j}, a_{j}; θ))}^{2}

Every C steps copy policy network weights to target network weights

θ^{-} = θ

end for

4. Evaluation Methodology

4.1. Datasets

Two datasets were used to model the EV charging station environment: one for the EV arrivals at the station and one for the hourly electricity price the station pays to the utility company for the energy purchased during each hour.

The dataset for the EV arrivals was provided by [22], which is an open-source code repository from the author of [22]. It contains vehicle arrivals per 30 s for Richards Ave station near downtown Davis. The EVs are divided into three types, namely, (a) emergent, (b) normal, and (c) residential. Each type has different demand preferences and available parking time, which are described in Section 4.2. The following preprocessing steps were performed on the data points to match these data to a realistic charging station scenario:

They were upsampled to 60 min intervals.
They were scaled by a factor of $\frac{1}{100}$ and rounded to the closest integer.
They were undersampled to 1 min intervals, by randomly distributing the 1 h samples to intermediate minutes using a uniform distribution.

An overview of the average number of EV arrivals per hour of the day for the different charging profiles can be observed in Figure 4. The averaging was performed for every hour separately, for all the days that are included in the dataset.

For the electricity price, a dataset from the Korean grid [29] was utilized, which is publicly available. It contains hourly prices per kWh of energy purchased from the grid. The dates of the observations range from 1 July 2021 to 31 July 2021, matching the month of the dataset for the arrivals mentioned above. Initially, a currency conversion was realized, and the price was scaled to achieve greater variance during a day and challenge the agent to adapt to intraday fluctuations. The conversion function is

f (x) = \frac{0.00084 x^{2}}{3}

, where x is price in Korean ₩ and

f (x)

is price in US $. An overview of the average electricity price per hour of the day can be observed in Figure 5. The averaging was performed in a similar manner as explained for the EV arrivals dataset.

4.2. Experimental Setup

Before conducting the experiments, a set of hyperparameters were selected. These include the following:

Chargers of the station: $N = 20$ .
Maximum charging rate per charger: According to the U.S. Department of Energy (https://afdc.energy.gov/fuels/electricity_infrastructure.html, accessed on 15 February 2022), most EVs on the road today are not capable of charging at rates higher than 50 kW. Thus, a more conservative approach of 30 kW was selected. Note that 22 kW is the closest standard charging rate (i.e., Level 2 EV charging), but the purpose of this work is to present a more general approach. $x_{max} = 30$ kW
Maximum total charging rate: $e_{max} = N \cdot x_{max} = 600$ kW.
Time slot length: $t_{len} = 5$ min.
Episode duration: 1 day or 1440 min or 288 time slots.
Discrete price rate levels: $\{1, 2, 3, 4, 5, 6\}$ $.
Discrete charging rate levels: $\{0, 60, 120, 180, 240, 300, 360, 420, 480, 540, 600\}$ kW.
Cardinality of action space: $|A_{t}| = 66$ .

Demand–Response Function

The demand–response function is modeled as a linear equation of the form:

D_{i} (r) = β_{1} r + β_{2} + N (0, σ^{2})

(15)

where

β_{1}, β_{2}

, and

σ

are the parameters of each EV i; and

N (0, σ^{2})

is Gaussian noise with mean

μ = 0

and standard deviation

σ

. Following the type division of the EV arrivals dataset, EVs are grouped into three different types, each with specific parameters, which are presented in Table 1. These parameters were adopted from the related work in [22]. The respective plot of the demand–response functions is illustrated in Figure 6. As can be seen in [30], the potential charging demands are in line with the battery capacities of some of the latest EV models. The dotted lines show the Gaussian Noise’s variance by adding one standard deviation

σ

to each demand–response function. These also provide an approximate limit to the maximum price that the customers of each type are willing to pay to the station.

$ϵ$ -Greedy Policy

The decaying probability

ϵ

of the

ϵ

-greedy policy is calculated by the equation:

ϵ = ϵ_{end} + (ϵ_{start} - ϵ_{end}) \cdot exp \{- \frac{x}{ϵ_{decay}}\}

(16)

where x is the episode number;

ϵ_{start} = 0.9

and

ϵ_{end} = 0.05

are the initial and final probabilities of a random action (for

x = 0

and

x \to \infty

, respectively); and

ϵ_{decay} = 200

is the rate of decay for

ϵ

. A plot of the above equation can be observed in Figure 7. Essentially, the probability

ϵ

converges to its final value after 800 episodes.

Episode Duration

The episode duration starts from 10 timeslots and increases by one timeslot every two episodes, up to 288 timeslots (a complete day cycle). Figure 8 shows the plot of the episode duration for each episode during training.

5. Results

5.1. Training Results

Training is performed over 1200 episodes and is repeated five times to ensure consistency. For each episode, a day is selected randomly from the EV arrivals and electricity price datasets, which include 31 days in total. The training curves are presented in Figure 9, where the accumulated reward and the invalid actions per episode are plotted. An invalid action refers to a selected charging rate

e_{t}

that was constrained to a higher charging level due to insufficient charge. The best training run is highlighted, and in Figure 10, the results from that run are averages over a moving window of 50 episodes.

It can be observed that the model gradually achieves better total reward per episode, but it also increases invalid actions taken up to a certain point. During the algorithm’s exploration phase, the agent is mostly choosing random actions and observes the rewards accumulated. Then, during the exploitation phase, it minimizes invalid actions and further increases reward. At that point, the agent mostly takes deterministic actions based on the values calculated for each observation–action pair and tries to find the sequence of actions that yields the best reward.

The maximum reward achieved over the five training runs was 5403 $. The mean value of the maximum reward per run was 4692 $, which indicates that training reaches a high accumulated reward consistently.

5.2. Policy Analysis

In this subsection, a trained model is examined for its policy during the day. The agent’s actions are monitored in response to electricity price and residual demand. From Figure 11, it can be concluded that the optimal policy that the agent follows dictates keeping prices provided to customers at a constant level for most of the time slots. Furthermore, Figure 12 indicates that the EVs are charged at maximum charging rates most of the time, since an increasing total residual demand increases the charging rate. Hence, the optimal policy could be summarized as, “Keep the price stable at 3 $ and charge as much as possible”.

Figure 13 and Figure 14 present an overview of the residual demand per charger. The agent keeps demands under a threshold and satisfies them as soon as possible. The very few flat lines show this, implying that an EV is not being charged for some time slots.

5.3. Case Study: Increasing Episode Time Horizon

It could be argued that, due to the smaller number of arrivals and lower electricity prices during night h, the agent should take actions that charge more conservatively during the end of the day to maximize profit. A three-day episode of training and testing was conducted with that in mind. The episode duration was incremented in the same manner as in the one-day maximum duration and is illustrated in Figure 15. Due to the higher number of episodes needed to reach maximum episode duration (i.e., 864 timeslots), the

ϵ_{decay}

from the

ϵ

-greedy policy was adjusted to 600 to compensate for the more extensive exploration phase and enable the agent to choose actions while episode duration still increases randomly. The plot of the adjusted probability of random action can be observed in Figure 16.

Figure 17, Figure 18, Figure 19, Figure 20, Figure 21 and Figure 22 show the results in a similar manner as in the one-day experiment. Regarding the price, the agent seems to have a similar optimal policy, which is to keep it constant at $3, according to Figure 19. There are also some $2 actions when the price is dropping, suggesting that the agent attempts to receive extra energy demands to fulfill during low price time slots. On the other hand, the charging rates do not exceed 240 kW during peak demand times, as seen in Figure 20, contrary to the 540 kW maximum charging rate for one-day episode duration, as illustrated in Figure 12. This means that the agent adapts to the expanded episode duration and attempts to stall charging EVs when close to a spike in electricity price. Another indication of that is evident in Figure 21 and Figure 22, since flat lines can be observed for EVs with high demands during time slots 100 to 300.

The behavior mentioned above negatively impacts the actual reward for the selected price parameters. The accumulated reward for three-day episodes is a little over double the accumulated reward for one-day ones, which can be deduced from observing Figure 17 in comparison to Figure 9 for the final episodes of training. However, it should be noted that one-day episodes may avoid charging costs at the end of each day, since not all EVs are charged when an episode ends. Three-day episodes include those costs to the accumulated reward for the two nights between the three days.

5.4. Case Study: Removing Constraints

An experiment with no constraints was conducted to test the efficiency of the constraining mechanism and provide a way of comparing the proposed method with models of the respective literature, such as [22]. The method used for the experiment is similar to the proposed one, with the following key difference:

The LLF algorithm was used with no constraints. Specifically, EVs with the least laxity were charged with the maximum individual charging rate

x_{max}

until the total charging rate

e_{t}

selected by the agent was distributed. No further checks concerning negative laxities were performed, introducing the possibility of an EV reaching its departure time with unfulfilled demand. Whenever this occurred during an episode, the unfulfilled demand amount (kWh) was monitored, and the accumulated unfulfilled demand is presented at the end.

The training curves of the unconstrained model are presented in Figure 23 and Figure 24. The unconstrained model achieved a maximum reward of 4044 $; however, this reward was achieved with most EVs leaving the station with unfulfilled energy demands (a total of 1177 kWh). Furthermore, as the accumulated reward increased, the accumulated unfulfilled demand increased proportionately. This observation stems from the fact that the agent was unconstrained and had no measure of customer dissatisfaction included in its reward, leaving the single optimization objective of maximizing charging station profit. Thus, the agent learned to charge customers money for electricity that it never provided, as the total charging rate

e_{t}

that it selected for each time slot t was almost always zero. In contrast, the electricity price

r_{t}

was non-zero.

In [22], a constraining method is proposed and mathematically proven to provide a feasible solution that satisfies all demands. This method is non-trivial and to be implemented with an online agent. It requires future information about the EV arrivals at the station, which is not available. In the present context, the constraining mechanism ensures customer satisfaction while optimizing the charging station profit.

6. Conclusions

In this paper, a DRL-based approach was developed to solve optimal pricing and scheduling in an EV charging station under a dynamic, varying electricity price scheme. The proposed approach is model-free, which means that the DRL agent can operate under uncertainties, such as EV arrivals and their charging demands, and stated parking times, without explicit knowledge about the randomness. Instead, it can learn directly from underlying patterns present in real-world data. In addition, a charging scheduling algorithm was proposed, and the standard deep Q-learning algorithm was modified that ensures that EVs are adequately charged. Experimental results validated the effectiveness of the proposed solution in two ways: on the one hand, the trained agent managed to follow a policy that maximizes the profit of the charging station; at the same time, EV owners’ charging demands were successfully fulfilled. Finally, it directly follows from the above analysis that the proposed system can make online decisions in real-time or near real-time by setting appropriate values for the duration of each slot.

The work presented in this study can be extended in many different directions. Some of them are listed below:

As a first step, the technique of constraining the estimated charging rate could be incorporated into different DRL training algorithms that would operate on continuous action spaces, thereby lifting the need for discretizing scheduling and pricing actions.
This work could serve as the basis for different formulations that consider more stakeholders, e.g., the grid operators and the corresponding constraints.
Furthermore, the assumption was made that the total charging rate requested by the charging station is constrained only by the number of individual chargers. Consequently, potentially all parked EVs can be scheduled to charge during each slot; respecting additional constraints placed by the grid operator is an aspect that naturally arises as a potential future extension.
In addition, more financial tools can be considered in modeling the relationship between different stakeholders.
Finally, a more automated version of such a system can also be tailor-made for real-time EV detection, on a non-intrusive load monitoring (NILM) basis [31].

Author Contributions

Conceptualization, methodology, software, visualization, formal analysis, and writing—original draft preparation, D.A. and A.P.; validation, investigation, and writing—review and editing, D.A., A.P., A.C. and D.I.D.; resources, D.A., A.P. and A.C.; data curation, D.A. and A.P.; supervision, A.M. and D.I.D.; project administration, funding acquisition, A.M. and D.I.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been co–financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH–CREATE–INNOVATE (project: T2EDK-03898).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The publicly available datasets for EV arrivals and electricity prices have been used in this study.

Conflicts of Interest

The information and views set out in this article are those of the authors and do not necessarily reflect the official opinion of the European Commission.

Nomenclature

t	The time slot index
$t_{len}$	The length (duration) of each time slot
$I_{t}$	The set of EVs that have arrived at the station at the beginning of time slot t
$J_{t}$	The set of EVs that are already parked in the station before time slot t
$K_{t}$	The set of EVs that require charging at time slot t
$r_{t}$	The price rate announced to the customers at time slot t
$t_{i}^{a}$	The arrival time of EV i
$d_{i}$	The charging demand of EV i
$p_{i}$	The maximum desired parking time of EV i
$D_{i} (\cdot)$	The demand–response function of EV i
$β_{1}, β_{2}, σ$	The parameters of the demand–response function
N	The total number of chargers in the station
$x_{i, t}$	The charging rate at which EV i will be charged during time slot t
$x_{max}$	The maximum individual charging rate for every charger
$e_{t}$	The total charging rate at time slot t
$e_{t}^{'}$	The constrained total charging rate at time slot t
$e_{max}$	The maximum total charging rate for the charging station
$α$	The charging rate to energy conversion coefficient
$c_{t}$	The electricity price that the charging station pays to the utility company
$(S_{t}, A_{t}, R_{t + 1}, S_{t + 1})$	The 4-tuple of elements of the Markov decision process
$γ$	The discount rate
${\tilde{d}}_{i}^{t}$	The residual charging demand for EV i at time slot t
${\tilde{p}}_{i}^{t}$	The residual parking time for EV i at time slot t
$l_{i, t}$	The laxity of EV i at time slot t
$ξ$	The relaxation coefficient
$A_{t}$	The set of all available actions
$w_{r}$	The set of discrete price rate levels
L	The number of discrete price rate levels
$w_{e}$	The set of discrete charging rate levels
K	The number of discrete charging rate levels
$ϵ$	The probability of a random action of the $ϵ$ -greedy policy
$ϵ_{start}, ϵ_{end}, ϵ_{decay}$	The parameters of the $ϵ$ -greedy policy

References

Azam, A.; Rafiq, M.; Shafique, M.; Yuan, J. Towards Achieving Environmental Sustainability: The Role of Nuclear Energy, Renewable Energy, and ICT in the Top-Five Carbon Emitting Countries. Front. Energy Res. 2021, 9, 804706. [Google Scholar] [CrossRef]
Shafique, M.; Azam, A.; Rafiq, M.; Luo, X. Evaluating the Relationship between Freight Transport, Economic Prosperity, Urbanization, and CO2 Emissions: Evidence from Hong Kong, Singapore, and South Korea. Sustainability 2020, 12, 664. [Google Scholar] [CrossRef]
Shafique, M.; Azam, A.; Rafiq, M.; Luo, X. Investigating the nexus among transport, economic growth and environmental degradation: Evidence from panel ARDL approach. Transp. Policy 2021, 109, 61–71. [Google Scholar] [CrossRef]
Shafique, M.; Luo, X. Environmental life cycle assessment of battery electric vehicles from the current and future energy mix perspective. J. Environ. Manag. 2022, 303, 114050. [Google Scholar] [CrossRef]
Yilmaz, M.; Krein, P.T. Review of the Impact of Vehicle-to-Grid Technologies on Distribution Systems and Utility Interfaces. IEEE Trans. Power Electron. 2013, 28, 5673–5689. [Google Scholar] [CrossRef]
Shafique, M.; Azam, A.; Rafiq, M.; Luo, X. Life cycle assessment of electric vehicles and internal combustion engine vehicles: A case study of Hong Kong. Res. Transp. Econ. 2021, 101112. [Google Scholar] [CrossRef]
International Energy Agency. Global EV Outlook. In Scaling-Up the Transition to Electric Mobility; IEA: London, UK, 2019. [Google Scholar]
Statharas, S.; Moysoglou, Y.; Siskos, P.; Capros, P. Simulating the Evolution of Business Models for Electricity Recharging Infrastructure Development by 2030: A Case Study for Greece. Energies 2021, 14, 2345. [Google Scholar] [CrossRef]
Almaghrebi, A.; Aljuheshi, F.; Rafaie, M.; James, K.; Alahmad, M. Data-Driven Charging Demand Prediction at Public Charging Stations Using Supervised Machine Learning Regression Methods. Energies 2020, 13, 4231. [Google Scholar] [CrossRef]
Moghaddam, V.; Yazdani, A.; Wang, H.; Parlevliet, D.; Shahnia, F. An Online Reinforcement Learning Approach for Dynamic Pricing of Electric Vehicle Charging Stations. IEEE Access 2020, 8, 130305–130313. [Google Scholar] [CrossRef]
Ghotge, R.; Snow, Y.; Farahani, S.; Lukszo, Z.; van Wijk, A. Optimized Scheduling of EV Charging in Solar Parking Lots for Local Peak Reduction under EV Demand Uncertainty. Energies 2020, 13, 1275. [Google Scholar] [CrossRef] [Green Version]
He, Y.; Venkatesh, B.; Guan, L. Optimal Scheduling for Charging and Discharging of Electric Vehicles. IEEE Trans. Smart Grid 2012, 3, 1095–1105. [Google Scholar] [CrossRef]
Tang, W.; Zhang, Y.J. A Model Predictive Control Approach for Low-Complexity Electric Vehicle Charging Scheduling: Optimality and Scalability. IEEE Trans. Power Syst. 2017, 32, 1050–1063. [Google Scholar] [CrossRef] [Green Version]
Zhang, L.; Li, Y. Optimal Management for Parking-Lot Electric Vehicle Charging by Two-Stage Approximate Dynamic Programming. IEEE Trans. Smart Grid 2017, 8, 1722–1730. [Google Scholar] [CrossRef]
Bellman, R. Dynamic Programming. Science 1966, 153, 34–37. [Google Scholar] [CrossRef] [PubMed]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M.A. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Abdullah, H.M.; Gastli, A.; Ben-Brahim, L. Reinforcement Learning Based EV Charging Management Systems—A Review. IEEE Access 2021, 9, 41506–41531. [Google Scholar] [CrossRef]
Lee, J.; Lee, E.; Kim, J. Electric Vehicle Charging and Discharging Algorithm Based on Reinforcement Learning with Data-Driven Approach in Dynamic Pricing Scheme. Energies 2020, 13, 1950. [Google Scholar] [CrossRef] [Green Version]
Zhang, F.; Yang, Q.; An, D. CDDPG: A Deep-Reinforcement-Learning-Based Approach for Electric Vehicle Charging Control. IEEE Internet Things J. 2021, 8, 3075–3087. [Google Scholar] [CrossRef]
Wan, Z.; Li, H.; He, H.; Prokhorov, D. Model-Free Real-Time EV Charging Scheduling Based on Deep Reinforcement Learning. IEEE Trans. Smart Grid 2019, 10, 5246–5257. [Google Scholar] [CrossRef]
Wang, S.; Bi, S.; Zhang, Y.A. Reinforcement Learning for Real-Time Pricing and Scheduling Control in EV Charging Stations. IEEE Trans. Ind. Inform. 2021, 17, 849–859. [Google Scholar] [CrossRef]
Chis, A.; Lunden, J.; Koivunen, V. Reinforcement Learning-Based Plug-in Electric Vehicle Charging with Forecasted Price. IEEE Trans. Veh. Technol. 2016, 66, 3674–3684. [Google Scholar] [CrossRef]
Lucas, A.; Barranco, R.; Refa, N. EV Idle Time Estimation on Charging Infrastructure, Comparing Supervised Machine Learning Regressions. Energies 2019, 12, 269. [Google Scholar] [CrossRef] [Green Version]
Deng, R.; Yang, Z.; Chow, M.Y.; Chen, J. A Survey on Demand Response in Smart Grids: Mathematical Models and Approaches. IEEE Trans. Ind. Inform. 2015, 11, 570–582. [Google Scholar] [CrossRef]
Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, King’s College, Cambridge, UK, 1989. [Google Scholar]
Pazis, J.; Lagoudakis, M.G. Reinforcement learning in multidimensional continuous action spaces. In Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Paris, France, 11–15 April 2011; pp. 97–104. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Exchange, K.P. System Marginal Price. Data Retrieved from Electric Power Statistics Information System. 2022. Available online: http://epsis.kpx.or.kr/epsisnew/selectEkmaSmpShdGrid.do?menuId=040202&locale=eng (accessed on 8 February 2022).
Al-Saadi, M.; Olmos, J.; Saez-de Ibarra, A.; Van Mierlo, J.; Berecibar, M. Fast Charging Impact on the Lithium-Ion Batteries’ Lifetime and Cost-Effective Battery Sizing in Heavy-Duty Electric Vehicles Applications. Energies 2022, 15, 1278. [Google Scholar] [CrossRef]
Athanasiadis, C.L.; Papadopoulos, T.A.; Doukas, D.I. Real-time non-intrusive load monitoring: A light-weight and scalable approach. Energy Build. 2021, 253, 111523. [Google Scholar] [CrossRef]

Figure 1. The RL environment for an EV charging station.

Figure 2. Interactions between an agent and its environment in an RL setting.

Figure 3. DQN agent architecture.

Figure 4. Average EV arrivals per hour of day per charging profile.

Figure 5. Average electricity price per hour of day.

Figure 6. Plot of demand–response function for each EV type.

Figure 7. Plot of random probability for

ϵ

-greedy policy during training.

Figure 7. Plot of random probability for

ϵ

-greedy policy during training.

Figure 8. Plot of incremental episode duration during training.

Figure 9. Training curves of the proposed model, with the best of five runs being highlighted.

Figure 10. Training curves of the best run of the proposed model, averaged over a moving window of 50 episodes.

Figure 11. Price announced to customers vs. electricity price paid to the utility company.

Figure 12. Total residual demand vs. total charging rate.

Figure 13. Residual demand per charger (20 in total).

Figure 14. Residual demand for a single charger.

Figure 15. Plot of incremental episode duration during training (three-day episode duration).

Figure 16. Plot of random probability for

ϵ

-greedy policy during training (three-day episode duration).

Figure 16. Plot of random probability for

ϵ

-greedy policy during training (three-day episode duration).

Figure 17. Training curves of the proposed model (three-day episode duration).

Figure 18. Training curves of the proposed model, averaged over a moving window of 50 episodes (three-day episode duration).

Figure 19. Price announced to customers vs. electricity price paid to utility company (three-day episode duration).

Figure 20. Total residual demand vs. total charging rate (three-day episode duration).

Figure 21. Residual demand per charger for the first 350 time slots (three-day episode duration).

Figure 22. Residual demand for a single charger for the first 350 time slots (three-day episode duration).

Figure 23. Training curves of the unconstrained model.

Figure 24. Training curves of the unconstrained model, averaged over a moving window of 50 episodes.

Table 1. Demand–response function parameters and parking time for each EV type.

EV Type	Standard Deviation $σ$	$β_{1}$ [kWh/$]	$β_{2}$ [kWh]	Parking Time
Emergent	4.47	−1	6	30
Normal	3.96	−4	15	120
Residential	2.63	−25	100	720

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Paraskevas, A.; Aletras, D.; Chrysopoulos, A.; Marinopoulos, A.; Doukas, D.I. Optimal Management for EV Charging Stations: A Win–Win Strategy for Different Stakeholders Using Constrained Deep Q-Learning. Energies 2022, 15, 2323. https://doi.org/10.3390/en15072323

AMA Style

Paraskevas A, Aletras D, Chrysopoulos A, Marinopoulos A, Doukas DI. Optimal Management for EV Charging Stations: A Win–Win Strategy for Different Stakeholders Using Constrained Deep Q-Learning. Energies. 2022; 15(7):2323. https://doi.org/10.3390/en15072323

Chicago/Turabian Style

Paraskevas, Athanasios, Dimitrios Aletras, Antonios Chrysopoulos, Antonios Marinopoulos, and Dimitrios I. Doukas. 2022. "Optimal Management for EV Charging Stations: A Win–Win Strategy for Different Stakeholders Using Constrained Deep Q-Learning" Energies 15, no. 7: 2323. https://doi.org/10.3390/en15072323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimal Management for EV Charging Stations: A Win–Win Strategy for Different Stakeholders Using Constrained Deep Q-Learning

Abstract

1. Introduction

2. System Model

2.1. EV Charging Station Environment

2.2. Problem Formulation Using the MDP Framework

3. Proposed Solution

3.1. Constrained Least Laxity First

3.2. Agent Architecture

3.3. Training Approach

4. Evaluation Methodology

4.1. Datasets

4.2. Experimental Setup

5. Results

5.1. Training Results

5.2. Policy Analysis

5.3. Case Study: Increasing Episode Time Horizon

5.4. Case Study: Removing Constraints

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI