Next Article in Journal
Adsorbents, Working Pairs and Coated Beds for Natural Refrigerants in Adsorption Chillers—State of the Art
Next Article in Special Issue
Implementation of ANN-Based Embedded Hybrid Power Filter Using HIL-Topology with Real-Time Data Visualization through Node-RED
Previous Article in Journal
The Potential Utilizing of Critical Element from Coal and Combustion Residues
Previous Article in Special Issue
Spatially-Explicit Prediction of Capacity Density Advances Geographic Characterization of Wind Power Technical Potential
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Reinforcement Learning for Hybrid Energy Storage Systems: Balancing Lead and Hydrogen Storage

Equipes Traitement de l’Information et Systèmes, UMR 8051, National Center for Scientific Research, ENSEA, CY Cergy Paris University, 95000 Cergy-Pontoise, France
*
Author to whom correspondence should be addressed.
Current address: 6 avenue du Ponceau, 95000 Cergy-Pontoise, France.
Energies 2021, 14(15), 4706; https://doi.org/10.3390/en14154706
Submission received: 1 July 2021 / Revised: 20 July 2021 / Accepted: 26 July 2021 / Published: 3 August 2021
(This article belongs to the Special Issue Machine Learning and Deep Learning for Energy Systems)

Abstract

:
We address the control of a hybrid energy storage system composed of a lead battery and hydrogen storage. Powered by photovoltaic panels, it feeds a partially islanded building. We aim to minimize building carbon emissions over a long-term period while ensuring that 35% of the building consumption is powered using energy produced on site. To achieve this long-term goal, we propose to learn a control policy as a function of the building and of the storage state using a Deep Reinforcement Learning approach. We reformulate the problem to reduce the action space dimension to one. This highly improves the proposed approach performance. Given the reformulation, we propose a new algorithm, DDPG α rep , using a Deep Deterministic Policy Gradient (DDPG) to learn the policy. Once learned, the storage control is performed using this policy. Simulations show that the higher the hydrogen storage efficiency, the more effective the learning.

1. Introduction

Energy storage is a crucial question for the usage of photovoltaic (PV) energy because of its time-varying behavior. In the ÉcoBioH2 project [1], we consider a building with solar panels providing different usages. The building includes a datacenter that is constrained to be powered by solar energy. It has a low carbon footprint building with lead and hydrogen storage capabilities. Our goal is to monitor this hybrid energy storage system with a goal of low carbon impact.
The building [1] is partially islanded with a datacenter that can only be powered by the energy produced by the building’s solar panels. The proportion of energy produced by the PV in the energy consumed by the building, including the datacenter, defines the self-consumption. The EcoBioH2 project requests the self-consumption to be at least 35%. Demand flexibility, where the load is adjusted to meet production, is not an option in this building so that energy storage will be needed to power the datacenter. Daily variations of the energy production can be mitigated using lead or lithium batteries. However, due to their low capacity density such technologies cannot be used for interseasonal storage. Hydrogen energy storage, on the other hand, is a promising solution to this problem, enabling yearly low-volume, high-capacity, low-carbo-emission energy storage. Unfortunately, it is plagued by its low storage efficiency. Combining hydrogen storage with lead batteries in a hybrid energy storage system enables us to leverage the advantages of both energy storages [2]. Hybrid storage has been shown to perform well in islanded emergency situations [3]. Lead batteries can deliver a big load but not for long. On the other hand, hydrogen storage only supports a small load but has a higher capacity than lead or lithium batteries allowing a longer discharge. The question becomes how to monitor the charge and discharge of each storage and to balance between the short-term battery and the long-term hydrogen storage?
We encounter therefore several short and long-term goals and constraints in opposition summarized in Table 1. Minimizing the carbon impact discourages from using batteries, as batteries emit carbon during their lifecycle. It also encourages using H 2 storage when needed, as less carbon is emitted per kW·h than battery storage. The less energy is stored, the less energy is lost in storage efficiency. This results in more energy available to the building. Thus, in the short-term, self-consumption increases. However, the datacenter is not guaranteed to have enough energy available for the long-term. Keeping the datacenter powered by solar energy requires storing as much energy as possible. Nevertheless, some energy is lost during charge and discharge leading to a lower self-consumption. This energy should be stored in the battery first since less energy is lost in efficiency, resulting in higher emissions. Keeping the datacenter powered is a long-term objective as previous decisions impact the current state that constraints our capacity to power the datacenter in the future. Nonetheless, because of their capacities our energy storage systems perform in opposition. Battery storage has a limited capacity. It allows the withstanding of short-term production variations. Hydrogen storage has an enormous capacity. It helps with long-term, interseasonal variations.
Managing a long-term storage system means that the control system needs to choose actions (charge or discharge and storage type) depending on their long-term consequences. We consider a duration of several months. We want to minimize the carbon impact while having enough energy for a complete year at least, under the constraints of the datacenter being powered by solar energy. Using convex optimization to solve this problem requires precise forecasting of the energy production and consumption for the whole year. One cannot have months of such forecasts in advance [4,5]. In [6], the authors try to minimize the cost and limit their study to 3 days only. Methods based on genetic algorithms, as [7], require a detailed model of the building usages and energy production which is not realistic in our case since all parts are not known in advance. We also want to allow flexible usages. Therefore, we propose to adopt a solution that can cope with light domain expertise. If the input and output data of the problem are accessible, supervised learning and deep learning can be considered [8]. Having contradicting goals with different horizons, reinforcement learning is an interesting approach [9]. The solution we are looking for should provide a suitable control policy for our hybrid storage system. Most reinforcement learning methods quantize the action space to avoid having interdependent action space bounds [10]. However, such a solution comes with a loss in precision in the action’s selection. It requires more data for learning.
Taking into account theses aspects, we address in the sequel our problem formulation allowing the deployment of non-quantized Deep Reinforcement Learning (DRL) [11] to learn the storage decision policy. DRL learns a long-term evaluation of actions and uses it to train an actor that for each state of the building gives the best action. In our case, the action is the charge or discharge of the lead and hydrogen storages. Learning the policy could even improve controlling the efficiency in the short-term [12]. Existing works focus on non-islanded settings [13] where no state causes a failure. Since our building is partially islanded, this approach would yield to a failure where the islanded portion is not powered anymore. Existing DRL for hybrid energy storage systems focuses on minimizing the energy cost [14]. It does not consider the minimization of carbon emission in a partially islanded building.
In this paper, we formulate the carbon impact minimization of the partially islanded building to learn a hybrid storage policy using DRL. We will reformulate this problem to reduce the action space dimension and therefore improve the DRL performance.
The contributions of this paper are as follows:
  • We redefine the action space so that the action bounds are not interdependent.
  • We use this reformulation to reduce the action space to a single dimension.
  • From this analysis, we deduce a fixed up to a projection (but not learned) repartition policy between the lead and hydrogen.
  • We propose an actor–critic approach to control the partially islanded hybrid energy storage of the building, to be named DDPG α rep .
Simulations will show the importance of the hydrogen efficiency and carbon impact normalization in the reward, for the learned policy to be effective.

2. Problem Statement

In this section, we describe the model used to simulate our building. This model is sketched in Figure 1 and explained next. Action variables are noted in red.

2.1. Storages

We use a simplified model of the energy storage elements as they are sufficient to validate the learning approach for our hybrid storage problem. However, the proposed learning approach can use any batteries model or data since the proposed reformulations and learning do not depend on the batteries model. As long as the action is limited to how much we should charge or discharge, any storage model can be used instead. Since we propose a learning approach, the learned policy could be further improved using real data. Both energy storages (lead battery and H 2 ) use the same equations:
E H 2 ( t ) = E H 2 ( t 1 ) + η H 2 E H 2 i n ( t ) E H 2 o u t ( t )
with E H 2 ( t ) the state of health of the H 2 storage at instant t, η H 2 the global (charging electrolyser and discharging proton-exchange membrane fuel cells) efficiency of H 2 storage. E H 2 i n ( t ) is the charge energy and E H 2 o u t ( t ) is the energy discharged at instant t. Equation (1) must satisfy the following constraints:
0 E H 2 ( t ) E H 2 max
0 E H 2 i n ( t ) E H 2 i n max
0 E H 2 o u t ( t ) E H 2 o u t max
with E H 2 max , E H 2 i n max and E H 2 o u t max the respective upper bounds for E H 2 ( t ) , E H 2 i n ( t ) and E H 2 o u t ( t ) . To obtain the lead battery equations replace H 2 by b a t t in Equations (1)–(4). The lead battery efficiency η b a t t covers the whole battery efficiency: charge and discharge

2.2. Solar Circuit

The solar circuit connects elements that manage the solar energy only. The production is provided by solar panels E s o l a r ( t ) . Part of this energy will be stored in short-term (lead battery) or long-term (hydrogen) storage. Part of this energy will be consumed directly by a small datacenter, E D C ( t ) . The solar circuit is not allowed to handle grid electricity. We define E s u r p l u s ( t ) as:
E s u r p l u s ( t ) = E s o l a r ( t ) E D C ( t ) + E b a t t o u t ( t ) E b a t t i n ( t ) + E H 2 o u t ( t ) E H 2 i n ( t )
Please note that this equation does not prevent from charging one energy storage by the other. The solar circuit can only give energy to the general circuit, so that:
E s u r p l u s ( t ) 0
This constraint (6) ensures that the datacenter can only be provided in solar energy, as is required by our project [1]. E s o l a r ( t ) values are computed using irradiance values from [15] and physical properties of our solar panels.

2.3. General Circuit

The building consumption E b u i l d i n g ( t ) values come from EcoBio H 2 technical office study [16]. They take into account the power consumption of the housing, the restaurant, ...and other usages that are hosted by the building. We define δ E r e g u l ( t ) as the difference between E b u i l d i n g ( t ) and E s u r p l u s ( t ) :
δ E r e g u l ( t ) = E b u i l d i n g ( t ) E s u r p l u s ( t )
When δ E r e g u l ( t ) > 0 , we define it as the consumption from the electric grid:
E g r i d ( t ) = max ( 0 , δ E r e g u l ( t ) )
When δ E r e g u l ( t ) < 0 , we define it as the energy discarded since this building is not allowed to give energy back to the grid:
E w a s t e ( t ) = max ( 0 , δ E r e g u l ( t ) )
In reality, the energy discarded will not be produced. This will be done by temporarily disconnecting the solar panels.
We define E g r i d ( t ) and E w a s t e ( t ) in Equations (8) and (9) as they are used in the simulation metrics in Section 5.2. Variables defined previously and in the remaining of this paper are displayed in Table 2, parameters are in Table 3.

2.4. Long-Term Carbon Impact Minimization Problem

We gather the building consumption, the solar panels production at instant t and the previous stored energy state at t 1 variables in a so-called state defined as:
s t = [ E b u i l d i n g ( t ) , E s o l a r ( t ) , E b a t t ( t 1 ) , E H 2 ( t 1 ) ]
We define the action variables in
a t = [ E b a t t i n ( t ) , E b a t t o u t ( t ) , E H 2 i n ( t ) , E H 2 o u t ( t ) ]
to control the energy storage at the current hour t. We define in Equation (12) the instantaneous carbon impact at state s t when performing action a t as f ( s t , a t ) :
f ( s t , a t ) = C s o l a r E s o l a r ( t ) + C b a t t o u t E b a t t o u t ( t ) + C b a t t i n E b a t t i n ( t ) + C H 2 o u t E H 2 o u t ( t ) + C H 2 i n E H 2 i n ( t ) + C g r i d ( t ) max ( 0 , E b u i l d i n g ( t ) + E D C ( t ) E s o l a r ( t ) E b a t t o u t ( t ) + E b a t t i n ( t ) E H 2 o u t ( t ) + E H 2 i n ( t ) )
with C s o l a r the carbon intensity per kW·h from the complete lifecycle of PV usage. C b a t t i n , C b a t t o u t , C H 2 i n , C H 2 o u t are the carbon intensity from the complete lifecycle per kW·h of respectively lead battery charge, discharge, hydrogen storage charge and discharge. C g r i d ( t ) quantifies the carbon emissions per kW·h associated with energy from the grid. Their values for simulations are provided in Table 3. Our goal is to minimize the long-term carbon impact taking into account the carbon emissions at the current and future states s t , , s t + H as induced by the current and future actions a t , , a t + H :
a t = arg min a t h = 0 H f ( s t + h , a t + h )
under the constraints (2), (3), (4) and (6). We call this initial formulation TwoBatts .
The challenge comes from our ignorance of the actions that will be taken in the future a t + 1 , , a t + H . Yet, we need to account for their impact. DRL approaches are meant for such kind of challenges.

3. Problem Reformulations

In this section, we reformulate our problem (13) to simplify its resolution. We consider in particular the reduction of the action space to reduce the complexity and improve the convergence of learning.

3.1. Battery Charge or Discharge

The current formulation of our problem, TwoBatts, allows the policy to charge and discharge a battery simultaneously. We note that the cost function to be minimized (12) is increasing with the different components of a t . This leads to multiple actions that, in the same state s t , yield to the same s t + 1 while having different costs. To avoid having to deal with such cases, we impose that the energy storage systems can only be charged or discharged at a given instant t:
E b a t t i n ( t ) × E b a t t o u t ( t ) = 0
Therefore, we express the charge and discharge of each battery in a single dimension:
δ E b a t t ( t ) : = E b a t t o u t ( t ) E b a t t i n ( t )
δ E H 2 ( t ) : = E H 2 o u t ( t ) E H 2 i n ( t )
We propose to use these new variables as the action space:
a t = [ δ E b a t t ( t ) , δ E H 2 ( t ) ]
To obtain the new model equations, we replace the following variables in Equations (1)–(12):
E b a t t o u t ( t ) : = max ( δ E b a t t ( t ) , 0 )
E b a t t i n ( t ) : = max ( δ E b a t t ( t ) , 0 )
E H 2 o u t ( t ) : = max ( δ E H 2 ( t ) , 0 )
E H 2 i n ( t ) : = max ( δ E H 2 ( t ) , 0 )
Thus, we obtain the formulation 2Dbatt of (13) with
f ( s t , a t ) = C s o l a r E s o l a r ( t ) + C b a t t o u t max ( δ E b a t t ( t ) , 0 ) + C b a t t i n max ( δ E b a t t ( t ) , 0 ) + C H 2 o u t max ( δ E H 2 ( t ) , 0 ) + C H 2 i n max ( δ E H 2 ( t ) , 0 ) + C g r i d ( t ) max ( E b u i l d i n g ( t ) + E D C ( t ) E s o l a r ( t ) δ E b a t t ( t ) δ E H 2 ( t ) , 0 )
Next, we revisit the constraints with this new action space. When we only charge ( δ E H 2 ( t ) = E H 2 i n ( t ) ), straightforward calculations result in (2) being equivalent to
E H 2 max E H 2 ( t 1 ) η H 2 δ E H 2 ( t )
and (3) turns into:
E H 2 i n max δ E H 2 ( t )
When we only discharge ( δ E H 2 ( t ) = E H 2 o u t ( t ) ) (2) becomes:
δ E H 2 ( t ) E H 2 ( t 1 )
Accordingly, (4) is equivalent to:
δ E H 2 ( t ) E H 2 o u t max
The battery is constrained by variations of (23)–(26). Both storages are constrained by Equation (6) that turns into:
0 E s o l a r ( t ) E D C ( t ) + δ E b a t t ( t ) + δ E H 2 ( t )
The 2Dbatt formulation is the minimization of (13) over (17) constrained by Equations (23)–(26), their battery variant and (27).

3.2. Batteries Storage Repartition

In the 2Dbatt formulation, one storage can discharge when the other is charging which results in a loss of energy. Moreover, the actions bound (27) depends not only on the state but also on the action itself. The bounds are therefore interdependent. If we select an action outside the action bounds, there is a need to project it inside the bounds which is non-trivial because of this interdependence.
To alleviate this problem, we propose to rotate the action space frame. We merge the two action dimensions into the energy storage systems contribution and the contribution repartition defined as:
δ E s t o r a g e ( t ) : = δ E b a t t ( t ) + δ E H 2 ( t )
α r e p ( t ) : = δ E H 2 ( t ) δ E s t o r a g e ( t )
so that the action becomes a t = [ δ E s t o r a g e ( t ) , α r e p ( t ) ] . α r e p ( t ) is the proportion of hydrogen in the storing. It is equal to 0 when the sole battery storage is used and to 1 when only the hydrogen storage is used. α r e p ( t ) is bounded between 0 and 1 by definition, so that one energy storage cannot charge the other. Furthermore, we only convert from Repartition to 2Dbatts and not the other way around. This is illustrated in Figure 2. To insert the new variables in the 2Dbatt formulation, we use the following equations:
δ E b a t t ( t ) = ( 1 α r e p ( t ) ) × δ E s t o r a g e ( t )
δ E H 2 ( t ) = α r e p ( t ) × δ E s t o r a g e ( t )
We transform Equations (23)–(26) using (31):
E H 2 max E H 2 ( t 1 ) η H 2 α r e p ( t ) × δ E s t o r a g e ( t )
E H 2 i n max α r e p ( t ) × δ E s t o r a g e ( t )
α r e p ( t ) × δ E s t o r a g e ( t ) E H 2 ( t 1 )
α r e p ( t ) × δ E s t o r a g e ( t ) E H 2 o u t max
We obtain the battery variant of those equations using (31):
E b a t t max E b a t t ( t 1 ) η b a t t ( 1 α r e p ( t ) ) × δ E s t o r a g e ( t )
E b a t t i n max ( 1 α r e p ( t ) ) × δ E s t o r a g e ( t )
( 1 α r e p ( t ) ) × δ E s t o r a g e ( t ) E b a t t ( t 1 )
( 1 α r e p ( t ) ) × δ E s t o r a g e ( t ) E b a t t o u t max
Moreover, using (28), (27) becomes:
E D C ( t ) E s o l a r ( t ) δ E s t o r a g e ( t )
Equation (40) depends only on one variable, δ E s t o r a g e ( t ) . Using this variable change, we have removed the interdependency of the constraint in (27).
Next, we propose bounds on δ E s t o r a g e ( t ) and α r e p ( t ) that will be critical in the sequel.
Proposition 1.
δ E s t o r a g e ( t ) is constrained by δ E s t o r a g e min ( t ) δ E s t o r a g e ( t ) δ E s t o r a g e max ( t ) and their values are defined by:
δ E s t o r a g e min ( t ) = max ( E D C ( t ) E s o l a r ( t ) , E b a t t i n max E H 2 i n max , E b a t t max E b a t t ( t 1 ) η b a t t i n E H 2 max E H 2 ( t 1 ) η H 2 i n , E b a t t i n max E H 2 max E H 2 ( t 1 ) η H 2 i n , E b a t t max E b a t t ( t 1 ) η b a t t i n E H 2 i n max )
δ E s t o r a g e max ( t ) = min ( E b a t t ( t 1 ) + E H 2 ( t 1 ) , E b a t t o u t max + E H 2 ( t 1 ) , E b a t t ( t 1 ) + E H 2 o u t max , E b a t t o u t max + E H 2 o u t max )
Proof of Proposition 1 is in Appendix A.
Proposition 2.
α r e p ( t ) is constrained by α r e p min ( t ) α r e p ( t ) α r e p max ( t ) with values are defined by:
α r e p min ( t ) = max ( 1 + E b a t t i n max δ E s t o r a g e ( t ) , 1 + E b a t t max E b a t t ( t 1 ) η b a t t × δ E s t o r a g e ( t ) ) if δ E s t o r a g e ( t ) < 0 max ( 1 E b a t t ( t ) 1 δ E s t o r a g e ( t ) , 1 E b a t t o u t max δ E s t o r a g e ( t ) ) if δ E s t o r a g e ( t ) > 0
α r e p max ( t ) = min ( E H 2 i n max δ E s t o r a g e ( t ) , E H 2 max E H 2 ( t 1 ) η H 2 × δ E s t o r a g e ( t ) ) if δ E s t o r a g e ( t ) < 0 min ( E H 2 t 1 δ E s t o r a g e ( t ) , E H 2 o u t max δ E s t o r a g e ( t ) ) if δ E s t o r a g e ( t ) > 0
Proof of Proposition 2 is in Appendix B. Please note that when δ E s t o r a g e ( t ) = 0 , α r e p ( t ) does not matter. We will set it to α r e p ( t ) = 0.5 as a convention.
The interest of bounds (43) and (44) is that they depend on δ E s t o r a g e ( t ) only, whereas the bounds on δ E s t o r a g e ( t ) do not depend on α r e p ( t ) . Thus, given δ E s t o r a g e ( t ) , we only need to decide the contribution of the distribution α r e p ( t ) . The interdependence has been completely removed.
Moreover, we use (30) and (31) to obtain the expression of the modified carbon impact function (12):
f ( s t , a t ) = C s o l a r E s o l a r ( t ) + C g r i d ( t ) max ( 0 , E b u i l d i n g ( t ) + E D C ( t ) E s o l a r ( t ) δ E s t o r a g e ( t ) ) + C b a t t o u t × ( 1 α r e p ( t ) ) × δ E s t o r a g e ( t ) + C H 2 o u t × α r e p ( t ) × δ E s t o r a g e ( t ) if δ E s t o r a g e ( t ) > 0 C b a t t i n × ( 1 α r e p ( t ) ) × δ E s t o r a g e ( t ) C H 2 i n × α r e p ( t ) × δ E s t o r a g e ( t ) if δ E s t o r a g e ( t ) < 0
The problem (13) with action a t = [ δ E s t o r a g e ( t ) , α r e p ( t ) ] , the carbon impact (45) and under the constraints (41)–(44) is called the Repartition formulation.

3.3. Repartition Parameter Only

We have noticed that δ E s t o r a g e ( t ) can be seen as a single global storage. To provide energy for as long a duration as possible, i.e., to respect (40), we want to charge as much as possible and discharge only when needed. We call this the frugal policy. It corresponds to δ E s t o r a g e ( t ) being equal to its lower bound:
δ E s t o r a g e ( t ) = δ E s t o r a g e min ( t )
To reduce even more the action space dimensionality, we propose to use the frugal policy and to focus on learning only α r e p ( t ) the repartition between the lead and hydrogen energy storage systems contribution.
Using this remark, we propose the α rep reformulation with the goal (13), in order to find the single action a t = α r e p ( t ) , given the state s t , using the carbon impact (45) and under constraints (43) and (44) with δ E s t o r a g e ( t ) derived in (46). Unless specified, this is the formulation we use in the sequel of this paper.

3.4. Fixed Repartition Policy

In Section 4, we will propose a learning algorithm for the different formulations. To show the interest of learning, we want to compare the learned policies to a frugal policy (46) where α r e p ( t ) is preselected and fixed to a value v. At each instant, we will only verify that v [ α min ( t ) , α max ( t ) ] and project it into this interval otherwise. We call α rep = v the policy where α r e p is preset to value v. So that:
α r e p ( t ) = projection [ α r e p m i n ( t ) , α r e p m a x ( t ) ] ( v )
In Section 1, we explained that the battery is intended for short-term storage and that the H 2 storage is intended for long-term. Our intuition therefore suggests charging or discharging the lead battery first. This corresponds to a preset value of α rep = 0 , so that α r e p ( t ) = α r e p m i n ( t ) .
One should wonder, what is the best preselected α r e p ? To find it, we simulated 100 different values of α r e p between 0 and 1. For each value v, we run a simulation for each day starting at midnight on a looping year 2006. A detailed description of this data is available in Section 5.1. We use the parameters in Table 3 and PV production computed using irradiance data [17,18] in (48):
E s o l a r ( t ) = P s o l a r ( t ) η s o l a r η s o l a r o p a c i t y S p a n e l s
If the simulation does not last the whole year, we reject it (hatched area in Figure 3). Otherwise, we compute the hourly carbon impact:
t = 0 T f ( s t , a t ) T
with T the number of hours in 2006. This hourly impact is averaged over 365 different runs, each starting at midnight, one for each day of 2006. Figure 3 shows the carbon impact versus α r e p . The α r e p value that minimizes the average hourly impact while lasting the whole year is therefore α rep = 0.2 . It will be used for comparison.

4. Learning the Policy with DDPG

In the reformulation α rep reformulation, we want to select a t given the state s t . The function that provides a t given s t is referred to as the policy. We want to learn the policy using DRL with an actor–critic policy-based approach: the Deep Deterministic Policy Gradient (DDPG) [19]. Experts may want to skip Section 4.2 and Section 4.3.

4.1. Actor–Critic Approach

We call e n v for the environment, the set of equations: (1) and its battery variant that allows the obtaining of s t + 1 from a t and s t , s t + 1 = e n v . s t e p ( s t , a t ) . Its corresponding reward, the short-term evaluation function, is defined as a function of s t and a t , r t = R ( s t , a t ) . We use [19], an actor–critic approach, where the estimated best policy for a given environment s t + 1 = e n v . s t e p ( s t , a t ) is learned through a critic as in Figure 4. The critic transforms this short-term evaluation into a long-term evaluation, the Q-values Q ( s t , a t ) , through learning. It will be detailed in Section 4.2. The actor π θ : s t a t is the function that selects the best possible action a t possible. It uses the critic as to know what is the best action in a given state (as detailed in Section 4.3).
In Section 2.4, we set our objective to minimize the long-term carbon impact (13). However, in reinforcement learning we try to maximize a score, defined as the sum of all rewards:
t 0 T r t
To remove this difference, we maximize the negative carbon impact f ( s t , a t ) . However, the more negative terms you add, the lower the sum is. This leads to a policy trying to stop the simulation as fast as possible, in contradiction to our goal to always provide the datacenter in energy. To counter this, we propose, inspired by [20], to add a living incentive of 1 at each instant. Therefore, we propose to define the reward as:
r t = R ( s t , a t ) = 1 f ( s t , a t ) max a f ( s t , a )
The reward accounting for the carbon impact is now normalized between 0 and 1 so that the reward is always positive. Still in this reward the normalization depends on the state s t . When the normalization depends on the state, two identical actions can have different rewards associated with them. Therefore, the reward is not proportional to the carbon impact (45) making the reward harder to interpret. To alleviate this problem, we propose to use the global maximum instead of the worst case for the current state:
r t = R ( s t , a t ) = 1 f ( s t , a t ) max s , a f ( s , a )
By convention r t is set to zero after the simulation ends.
The actor and critic are parameterized using artificial neural networks, respectively denoted θ and ϕ . They will be learned alternatively and iteratively. Two stabilization networks are also used for the critic supervision with weights θ o l d and ϕ o l d .

4.2. Critic Learning

Now that we have defined a reward, we can use the critic to transform it into a long-term metric. As time goes, we have less and less trust in the future. Therefore, we discount the future rewards using a discount factor 0 < γ < 1 . We define the critic Q : s t , a t k = 0 + γ k r t + k . It estimates the weighted long-term returns of taking an action a t in a given state s t . This weighted version of (50) also allows the binding of the infinite sum to learn it. Q can be expressed recursively:
Q ( s t , a t ) = k = 0 + γ k r t + k = r t + γ k = 0 + γ k r t + 1 + k Q ( s t , a t ) = r t + γ Q ( s t + 1 , a t + 1 )
We learn the Q-function using an artificial neural network of weights ϕ . At the i th iteration of our learning algorithm and for a given value of ϕ o l d i and θ o l d i , we define a reference value y t from the recursive expression (53). Since we do not know a t + 1 , we need to select the best action possible at t + 1 . The best estimator of this action is provided by the policy π θ o l d i , so that we define the reference as:
y t = r t + γ Q ϕ o l d i ( s t + 1 , π θ o l d i ( s t + 1 ) )
where a t + 1 has been estimated by π θ o l d i ( s t + 1 ) .
The squared difference between the estimated value Q ϕ ( s t , a t ) , and the reference value y t [21] is defined as:
J ( ϕ i ) = ( s t , a t , r t , s t + 1 ) D Q ϕ i ( s t , a t ) y t 2
To update ϕ i , we minimize J ( ϕ i ) in (55) using a simple gradient descent:
ϕ i + 1 = ϕ i μ J ( ϕ i )
where J ( ϕ i ) is the gradient of J ( ϕ ) in (55) with respect to ϕ taken at the value ϕ i . μ is a small positive step-size. To stabilize the learning [19] suggests updating the reference network ϕ o l d slower, so that:
ϕ o l d i + 1 = τ ϕ i + ( 1 τ ) ϕ o l d i with 0 < τ 1
ϕ o l d 0 = ϕ 0 at weight initialization.

4.3. Actor Learning

Since we alternate the updates of the critic and of the actor, we address next the learning of the actor. To learn what is the best action to select, we need a loss function that grades different actions a t . Using the reward function (52), as a loss function, the policy would select the best short-term, instantaneous, action. Since the critic Q ( s t , a t ) depends on the action a t , we replace a t by π θ ( s t ) . At iteration i, to update the actor network θ i , we use the gradient ascent of the average Q ϕ i ( s t , π θ ( s t ) ) taken at θ = θ i . This can be expressed as:
θ i + 1 = θ i + λ ( s t , a t , r t , s t + 1 ) D Q ϕ i s t , π θ i s t
where λ is a small positive step-size.
To learn the critic a stabilized actor is used. Like the stabilized critic, π θ o l d is updated by:
θ o l d i + 1 = τ θ i + ( 1 τ ) θ o l d i with 0 < τ 1
with θ o l d 0 = θ 0 at the beginning.
During learning, an Ornstein–Uhlenbeck noise [22], n, is added to the policy decision to make sure we explore the action space:
a t = π θ i ( s t ) + n

4.4. Proposition: DDPG  α rep Algorithm to Learn the Policy

From the previous section, we propose the DDPG α rep Algorithm 1. This algorithm alternates the learning of the networks of the actor and of the critic. We select randomly the initial instant t to avoid learning time patterns. We start each run with full energy storage.
Once learned, we use the last weights θ i of the neural network parameterizing the actor to select the action using π θ i : s t a t directly.
To learn well an artificial neural network needs the different samples of learning data to be uncorrelated. In reinforcement learning two consecutive states tends to be close, i.e., correlated. To overcome this problem, we store all experiences ( s t , a t , r t , s t + 1 ) in a memory and use a tiny random subset as the learning batch [23]. The random selection of a batch from the memory is called s a m p l e .
Algorithm 1: DDPG α rep
Energies 14 04706 i001

5. Simulation

We have just proposed DDPG α rep to learn how to choose α r e p ( t ) with respect to the environment. In this section, we display the simulations settings and results.

5.1. Simulation Settings

Production data are computed using (48) from real irradiance data [17,18] measured at the building location in Avignon, France. The building has S p a n e l s = 1000 m 2 of solar panels with η s o l a r o p a c i t y = 60 % opacity and an efficiency of η s o l a r = 21 % . Those solar panels can produce a maximum of E s o l a r M a x = 185 kW·h per hour.
Consumption data comes from projections of the engineering office [16]. It consists of powering housing units with an electricity demand fluctuating daily between 30 kW·h (1 a.m. to 6 a.m.) and 90 kW·h. The weekly variations of the consumption varies with a factor between 1 and 1.4 during awake hours between workdays and the weekend. There is little interseasonal variation, standard deviation of 0.6 kW·h (0.01% of yearly mean) between seasons, as heating uses wood pellets. In those simulations, the datacenter is consuming a fixed amount of E D C m a x = 10 kW·h. The datacenter consumption adds up to 87.6 MW·h per year, around 17% of the 496 MW·h that the entire building consumes in a year. To power this datacenter, our building’s solar panels produce an average of 53.8 kW·h/h during the 12.7 sunny hours on average day counts, for a yearly total of 249 MW·h/year. This covers a maximum of 2.8 times the consumption of our datacenter, but lowers to 99% if all energy goes through the hydrogen storage. The same solar production covers at most 50% of the building yearly consumption. When accounting for hydrogen efficiency, the solar production covers at most 17% of the building consumption.
We only use half of the lead battery capacity to preserve the battery health longer E b a t t M a x = 650 / 2 = 325 kW·h. The lead battery carbon intensity is split between the charge and discharge C b a t t O u t = 172 / 2 = 86 gCO 2 eq / kW ·h. Since the charge quantity comes before the efficiency, its carbon intensity must account for efficiency: C b a t t I n = C b a t t O u t η b a t t = 86 × 0.81 = 68.66 gCO 2 eq / kW ·h. The carbon intensity of the electrolysers, accounting for the efficiency, is used for C H 2 i n = 5 × η H 2 = 1.75 gCO 2 eq / kW ·h. The carbon intensity of the fuel cells corresponds to C H 2 o u t = 5 gCO 2 eq / kW ·h. η H 2 account for both the electrolysers and fuel cells efficiency. C g r i d = 53 gCO 2 eq / kW ·h uses the average French grid carbon intensity. All those values are reported in Table 3.
The simulations use an hourly simulation step t.
We train on the production data from year 2005, validate and select hyperparameters, using best score (50) values, on the year 2006 and test finally on year 2007. Each year lasts 8760 h.
To improve learning, we normalize between 1 and 1 all state and action inputs and outputs. For a given value d bounded between d m i n and d m a x :
d n o r m = 2 × d d m i n d m a x d m i n 1
d n o r m is then used as an input for the networks.
To accelerate the learning, all gradient descents are performed using Adam [24]. During training, we use the following step sizes μ = 10 3 to learn the critic and λ = 10 4 for the actor. For the stabilization networks, τ = 0.001 . To learn, we sample batches of 64 experiences from a memory of 10 6 experiences. The actor and critic both have 2 hidden layers with a ReLU activation function. Hidden layers have respectively 400 and 300 units in them. The output layer uses a tanh activation to bound its output. The discount factor, g a m m a in (54) is optimized as a hyperparameter between 0.995 and 0.9999. We found the best value for the discount factor to be 0.9979.

5.2. Simulation Metrics

We name duration and note N the average length of the simulations. When all simulations last the whole year, the hourly carbon impact is evaluated as in (49). To select the best policy, the average score is computed using (50). Self-consumption, defined as the energy provided by the solar panels, directly or indirectly using one of the storages, over the consumption, is computed using:
s = t = 0 T E s u r p l u s ( t ) + E D C ( t ) E w a s t e ( t ) t = 0 T E b u i l d i n g ( t ) + E D C ( t )
Per the ÉcoBioH2 project, the goal is to reach 35% of self-consumption: s 0.35 .

5.3. Simulation Results

The following learning algorithms are simulated on data from Avignon from 2007 and our building:
  • DDPGTwoBatts: DDPG with actions a t = [ E b a t t i n ( t ) , E b a t t o u t ( t ) , E H 2 i n ( t ) , E H 2 o u t ( t ) ]
  • DDPGRepartition: DDPG with actions a t = [ δ E s t o r a g e ( t ) , α r e p ( t ) ]
  • proposed DDPG α rep with action a t = [ α r e p ( t ) ]
where DDPGTwoBatts and DDPGRepartition are algorithms similar to DDPG α rep with action spaces of the corresponding formulations respectively (11) and (17). The starting time is randomly selected from any hour of the year.
To test the learned policies, the duration, hourly impact (49), score (50) and self-consumption (62) metrics are computed on the 2007 irradiance data and averaged over all runs. We compute those metrics over 365 different runs, starting each 2007 day at midnight. For the sake of comparison, we also compute those metrics when applicable for the preselected values α rep = 0 and α rep = 0.2 using (47) on the same data. Recall that the fixed α r e p values are bounded to (43) and (44) to ensure the long-term duration.
The metrics over the different runs are displayed in Table 4.
We can see in Table 4 that DDPGTwoBatts and DDPGRepartition do not last the whole year. This shows the importance of our reformulations to reduce the action space dimensions. We observe that all policies using the α rep reformulation last the whole year ( N = 8760 ). This validates our proposed reformulations and dimension reduction.
α rep = 0.2 achieves the lowest carbon impact; however, it cannot ensure the target of self-consumption. On the other hand, α rep = 0 achieves the target self-consumption at the price of a higher carbon impact. The proposed DDPG α rep provides a good trade-off between the two by adapting α r e p t to the state s t . It reaches the target self-consumption minus 0.1% and lowers the carbon impact with respect to α rep = 0 . The carbon emission gain over the intuitive policy α rep = 0 , using hydrogen only as a last resort, is of 43.8 × 10 3 gCO 2 eq/year. This shows the interest of learning the policy once the problem is well formulated.

5.4. Reward Normalization Effect

In Section 4.1, we presented two ways to normalize the carbon impact in the reward. In this section, we show that the proposed global normalization (52) yields better results than the local state-specific normalization (51).
In Table 5, we display the duration for both normalizations. We see that policies that use the locally normalized reward have a lower duration than the ones using a globally normalized reward. This confirms that the local normalization is harder to learn as two identical actions have different rewards in different states.
Therefore, the higher dynamic of the local normalization is not worth the variability induced by this normalization. This validates our choice of the global normalization (52) for the proposed DDPG α rep algorithm.

5.5. Hydrogen Storage Efficiency Impact

In our simulations, we have seen the sensibility of our carbon impact results to the parameters in Table 3. Indeed, the efficiency of the storage has a great impact on the system behavior. Hydrogen storage yields lower carbon emissions when its efficiency η H 2 is higher than some threshold. The greater is η H 2 , the greater α r e p t could be and so the range for adapting α r e p t via learning is more important. To find the threshold in η H 2 , we first compute the total carbon intensity of storing one kW·h in a given storage, including the carbon intensity of energy production. For H 2 , we obtain:
C H 2 t o t = C H 2 O u t + C H 2 I n + C s o l a r η H 2 gCO 2 eq / kW · h
We display the value of (63) of both storages in Figure 5 with respect to η H 2 , the other parameters are taken from Table 3. When C H 2 t o t < C b a t t t o t learning is useful since the policy must balance the lower carbon impact (using the hydrogen storage) with the low efficiency (using the battery storage). When C H 2 t o t > C b a t t t o t the learned policy converges to α rep = 0 , as both objectives (minimizing the carbon impact and continuous powering of the datacenter) align.
We calculate from (63) and its battery variant, the threshold point where C H 2 t o t = C b a t t t o t to be at efficiency:
η H 2 * = C H 2 I n + C s o l a r C b a t t t o t C H 2 O u t
Using values in Table 3 on (64), hydrogen improves the carbon impact only when η H 2 > η H 2 * = 0.24 . The current value is η H 2 = 0.35 > 0.24 , learning is also useful as shown in the simulations Table 4. We can also suggest that when the hydrogen storage efficiency will improve in the future, the impact of learning will be even more important.

6. Conclusions

We have addressed the problem of monitoring the hybrid energy storage of a partially islanded building with a goal of carbon impact minimization and self-consumption. We have reformulated the problem to reduce the number of components of the action to one, α r e p ( t ) , the proportion of hydrogen storage given the building state s t . To learn the policy, π θ : s t α r e p ( t ) , we propose a new DRL algorithm using a reward tailored to our problem, DDPG α rep . The simulation results show that when the hydrogen storage efficiency is large enough, learning of α r e p ( t ) allows a decrease to the carbon impact while lasting at least one year and maintaining 35 % of self-consumption. As hydrogen storage technologies improve, the proposed algorithm should have even more impact.
Learning the policy using the proposed DDPG α rep can also be done when the storage model includes non-linearities. Learning can also adapt to climate changes in time using more recent data for learning. To measure such benefits, we will use in the future the ÉcoBioH2 real data to be measured in the sequel of the project. Learning from real data will reduce the gap between the model and the real system. Reducing this gap should improve performance. The proposed approach could also be used to optimize other environmental metrics with a multi-objective cost in f ( s t , a t ) .
With our current formulation, policies cannot assess what day and hour it is as they only have two state variables to compute the hour: E s o l a r ( t ) and E b u i l d i n g ( t ) . They cannot differentiate between 1 a.m. and 4 a.m. at night as those two times have the same consumption and no PV production. They also cannot differentiate between a cloudy summer and a clear winter as production and consumption are close in those two cases. In the future, we will consider taking into account the knowledge of the current time to enable the learned policy to adapt its behavior to the time of the day and month of the year.

Author Contributions

Conceptualization, L.D., I.F. and P.A.; methodology, L.D., I.F. and P.A.; software, L.D.; validation, L.D. and I.F.; formal analysis, L.D.; investigation, L.D.; resources, L.D.; data curation, L.D.; writing—original draft preparation, L.D.; writing—review and editing, I.F. and P.A.; visualization, L.D.; supervision, I.F and P.A.; project administration, I.F.; funding acquisition, I.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by French PIA3 ADEME (French Agency For the Environment and Energy Management) for the ÉcoBioH2 project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available irradiance datasets were analyzed in this study. This data can be found here: http://www.soda-pro.com/web-services/radiation/helioclim-3-archives-for-pay (accessed on: 1 October 2020) based on [18]. Restrictions apply to the availability of consumption data. Data were obtained from ÉcoBio via ZenT and are available at https://zent-eco.com/ with the permission of ZenT.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
DDPGDeep Deterministic Policy Gradient
DRLDeep Reinforcement Learning
PVPhotoVoltaic

Appendix A. Proof of Proposition 1

Considering (23)–(26) for H 2 and batt for all cases we find more upper and lower bounds on δ E s t o r a g e ( t ) .

Appendix A.1. When δEstorage(t) < 0

Using (28), (24) and its battery variant:
δ E s t o r a g e ( t ) E b a t t i n max E H 2 i n max
Using (28), (23) and its battery variant:
δ E s t o r a g e ( t ) E b a t t max E b a t t ( t 1 ) η b a t t E H 2 max E H 2 ( t 1 ) η H 2
Using (28), (23) and the battery variant of (24):
δ E s t o r a g e ( t ) E b a t t i n max E H 2 max E H 2 ( t 1 ) η H 2
Using (28), (24) and the battery variant of (23):
δ E s t o r a g e ( t ) E b a t t max E b a t t ( t 1 ) η b a t t E H 2 i n max
We obtain the global lower bound (41) by obtaining the min of (40), (A1)–(A4).

Appendix A.2. When δEstorage(t) > 0

Using (28), (25) and its battery variant:
δ E s t o r a g e ( t ) E b a t t ( t 1 ) + E H 2 ( t 1 )
Using (28), (26) and its battery variant:
δ E s t o r a g e ( t ) E b a t t o u t max + E H 2 o u t max
Using (28), (25) and the battery variant of (26):
δ E s t o r a g e ( t ) E b a t t o u t max + E H 2 ( t 1 )
Using (28), (26) and the battery variant of (25):
δ E s t o r a g e ( t ) E b a t t ( t 1 ) + E H 2 o u t max
We obtain the global upper bound (42) by obtaining the max of (A5)–(A8).

Appendix B. Proof of Proposition 2

Appendix B.1. When δEstorage(t) > 0

Given (29) and (26)
α r e p ( t ) E H 2 o u t max δ E s t o r a g e ( t )
Given (29) and (25)
α r e p ( t ) E H 2 ( t 1 ) δ E s t o r a g e ( t )
From (30) and the battery variant of (26)
( 1 α r e p ( t ) ) δ E s t o r a g e ( t ) E b a t t o u t max 1 α r e p ( t ) E b a t t o u t max δ E s t o r a g e ( t ) 1 E b a t t o u t max δ E s t o r a g e ( t ) α r e p ( t )
From (30) and the battery variant of (25)
( 1 α r e p ( t ) ) δ E s t o r a g e ( t ) E b a t t ( t 1 ) 1 α r e p ( t ) E b a t t ( t 1 ) δ E s t o r a g e ( t ) 1 E b a t t ( t 1 ) δ E s t o r a g e ( t ) α r e p ( t )

Appendix B.2. When δEstorage(t) < 0

Given (31) and (24)
α r e p ( t ) δ E s t o r a g e ( t ) E H 2 i n max α r e p ( t ) E H 2 i n max δ E s t o r a g e ( t )
Given (31) and (23)
α r e p ( t ) δ E s t o r a g e ( t ) E H 2 max E H 2 ( t 1 ) η H 2 α r e p ( t ) E H 2 max E H 2 ( t 1 ) η H 2 δ E s t o r a g e ( t )
From (30) and (24) battery variant
( 1 α r e p ( t ) ) δ E s t o r a g e ( t ) E b a t t i n max 1 α r e p ( t ) E b a t t i n max δ E s t o r a g e ( t ) 1 + E b a t t i n max δ E s t o r a g e ( t ) α r e p ( t )
From (30) and (23) battery variant
( 1 α r e p ( t ) ) δ E s t o r a g e ( t ) E b a t t max E b a t t ( t 1 ) η b a t t 1 α r e p ( t ) E b a t t max E b a t t ( t 1 ) η b a t t δ E s t o r a g e ( t ) 1 + E b a t t max E b a t t ( t 1 ) η b a t t δ E s t o r a g e ( t ) α r e p ( t )
We obtain the global upper bound (44) by obtaining the min of (A9) and (A10) when δ E s t o r a g e ( t ) > 0 and the max of (A13), (A14) when δ E s t o r a g e ( t ) < 0 . We obtain the global lower bound (43) by obtaining the min of (A11) and (A12) when δ E s t o r a g e ( t ) > 0 and the max of (A15), (A16) when δ E s t o r a g e ( t ) < 0 .

References

  1. PIA3 ADEME (French Agency for the Environment and Energy Management). Project ÉcoBioH2. 2019. Available online: https://ecobioh2.ensea.fr (accessed on 2 June 2021).
  2. Bocklisch, T. Hybrid energy storage systems for renewable energy applications. Energy Procedia 2015, 73, 103–111. [Google Scholar] [CrossRef] [Green Version]
  3. Pu, Y.; Li, Q.; Chen, W.; Liu, H. Hierarchical energy management control for islanding DC microgrid with electric-hydrogen hybrid storage system. Int. J. Hydrogen Energy 2018, 44, 5153–5161. [Google Scholar] [CrossRef]
  4. Diagne, M.; David, M.; Lauret, P.; Boland, J.; Schmutz, N. Review of solar irradiance forecasting methods and a proposition for small-scale insular grids. Renew. Sustain. Energy Rev. 2013, 27, 65–76. [Google Scholar] [CrossRef] [Green Version]
  5. Desportes, L.; Andry, P.; Fijalkow, I.; David, J. Short-term temperature forecasting on a several hours horizon. In Proceedings of the ICANN, Munich, Germany, 17–19 September 2019. [Google Scholar] [CrossRef] [Green Version]
  6. Zhang, Z.; Nagasaki, Y.; Miyagi, D.; Tsuda, M.; Komagome, T.; Tsukada, K.; Hamajima, T.; Ayakawa, H.; Ishii, Y.; Yonekura, D. Stored energy control for long-term continuous operation of an electric and hydrogen hybrid energy storage system for emergency power supply and solar power fluctuation compensation. Int. J. Hydrogen Energyy 2019, 44, 8403–8414. [Google Scholar] [CrossRef]
  7. Carapellucci, R.; Giordano, L. Modeling and optimization of an energy generation island based on renewable technologies and hydrogen storage systems. Int. J. Hydrogen Energy 2012, 37, 2081–2093. [Google Scholar] [CrossRef]
  8. Bishop, C.M. Pattern Recognition and Machine Learning, 1st ed.; Springer: New York, NY, USA, 2006; pp. 1–2. [Google Scholar]
  9. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
  10. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  11. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  12. Vosen, S.; Keller, J. Hybrid energy storage systems for stand-alone electricpower systems: Optimization of system performance andcost through control strategies. Int. J. Hydrogen Energy 1999, 24, 1139–1156. [Google Scholar] [CrossRef]
  13. Kozlov, A.N.; Tomin, N.V.; Sidorov, D.N.; Lora, E.E.S.; Kurbatsky, V.G. Optimal Operation Control of PV-Biomass Gasifier-Diesel-Hybrid Systems Using Reinforcement Learning Techniques. Energies 2020, 13, 2632. [Google Scholar] [CrossRef]
  14. François-Lavet, V.; Taralla, D.; Ernst, D.; Fonteneau, R. Deep Reinforcement Learning Solutions for Energy Microgrids Management. In Proceedings of the European Workshop on Reinforcement Learning EWRL Pompeu Fabra University, Barcelona, Spain, 3–4 December 2016. [Google Scholar]
  15. Tommy, A.; Marie-Joseph, I.; Primerose, A.; Seyler, F.; Wald, L.; Linguet, L. Optimizing the Heliosat-II method for surface solar irradiation estimation with GOES images. Can. J. Remote Sens. 2015, 41, 86–100. [Google Scholar] [CrossRef]
  16. David, J. L 2.1 EcoBioH2, Internal Project Report. 9 July 2019. Available online: http://www.soda-pro.com/web-services/radiation/helioclim-3-archives-for-pay (accessed on 1 October 2020).
  17. Soda-Pro. HelioClim-3 Archives for Free. 2019. Available online: http://www.soda-pro.com/web-services/radiation/helioclim-3-archives-for-free (accessed on 11 March 2019).
  18. Rigollier, C.; Lefèvre, M.; Wald, L. The method Heliosat-2 for deriving shortwave solar radiation from satellite images. Solar Energy 2004, 77, 159–169. [Google Scholar] [CrossRef] [Green Version]
  19. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  20. Barto, A.G.; Sutton, R.S.; Anderson, C.W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 1983, SMC-13, 834–846. [Google Scholar] [CrossRef]
  21. Ernst, D.; Geurts, P.; Wehenkel, L. Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 2005, 6, 503–556. [Google Scholar]
  22. Uhlenbeck, G.E.; Ornstein, L.S. On the theory of the Brownian motion. Phys. Rev. 1930, 36, 823. [Google Scholar] [CrossRef]
  23. Lin, L.J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
  24. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Figure 1. View of our system. lines in green shows the solar-only part and purple lines shows the grid-only part. actions are displayed in red.
Figure 1. View of our system. lines in green shows the solar-only part and purple lines shows the grid-only part. actions are displayed in red.
Energies 14 04706 g001
Figure 2. Repartition formulation (green), δ E s t o r a g e ( t ) and α r e p ( t ) , in the 2Dbatt (blue) action space. Actions where one storage is charged and the other discharged are highlighted in red.
Figure 2. Repartition formulation (green), δ E s t o r a g e ( t ) and α r e p ( t ) , in the 2Dbatt (blue) action space. Actions where one storage is charged and the other discharged are highlighted in red.
Energies 14 04706 g002
Figure 3. Mean impact versus α r e p preset. Hatched area corresponds to rejected α r e p values where the policy does not last the whole year.
Figure 3. Mean impact versus α r e p preset. Hatched area corresponds to rejected α r e p values where the policy does not last the whole year.
Energies 14 04706 g003
Figure 4. Overview of the actor–critic approach. Curved arrows indicate learning. Time passing with t = t + 1 is displayed Z 1 .
Figure 4. Overview of the actor–critic approach. Curved arrows indicate learning. Time passing with t = t + 1 is displayed Z 1 .
Energies 14 04706 g004
Figure 5. The total hydrogen storage impact depending on the efficiency of storage.
Figure 5. The total hydrogen storage impact depending on the efficiency of storage.
Energies 14 04706 g005
Table 1. Contradictory consequences of carbon impact minimization and datacenter powering.
Table 1. Contradictory consequences of carbon impact minimization and datacenter powering.
Minimizing Carbon ImpactKeeping the Datacenter Powered
short durationlong duration
high self-consumptionlow self-consumption
use only H 2 charge batteries first
do not need any capacityneed large hydrogen storage capacity
Table 2. Nomenclature of variables used.
Table 2. Nomenclature of variables used.
SymbolMeaning
E H 2 ( t ) hydrogen storage state of charge at instant t
E H 2   in ( t ) hydrogen storage charge at instant t
E H 2   out ( t ) hydrogen storage discharge at instant t
E b a t t ( t ) lead storage state of charge at instant t
E b a t t in ( t ) lead storage charge at instant t
E b a t t out ( t ) lead storage discharge at instant t
E solar ( t ) Solar production for the hour
E D C ( t ) Datacenter consumption for the hour
E s u r   p l u s ( t ) Energy going from the solar circuit to the general one
E building ( t ) Energy consumed by the building, excluding the datacenter
E g r i d ( t ) Energy coming from the grid
E w a s t e ( t ) Energy overproduced for the building
ttime step
ataction vector at instant t
ststate vector at instant t
f ( s , a ) carbon impact in state s doing action a
R ( s , a ) reward in state s doing action a
r t reward in state st doing action at
δ E b a t t ( t ) lead battery contribution
δ E H 2 ( t ) Hydrogen storage contribution
δ E storage ( t ) Global energy storages contribution
α rep ( t ) Energy storages contribution repartition
Q ( s t , a t ) discounted sum of future reward doing action at in state st
y t estimation of Q ( s t , a t ) used in the critic loss
γ discount factor of future rewards
π ( s ) policy returning an action a in state s
ϕ i critic parameters at time step i
θ i policy parameters at time step i
J ( ϕ i ) critic loss
ϕ o l d i critic parameters at time step i
θ o l d i policy parameters at time step i
μ step-size for critic learning
λ step-size for actor learning
τ stabilization networks update proportion
Nduration: average length of a policy
sself-consumption ratio (62)
Table 3. Parameters values used during simulations.
Table 3. Parameters values used during simulations.
QuantityValueUnit
η s o l a r o p a c i t y 0.6
η s o l a r 0.21
S p a n e l s 1000m 2
C s o l a r 55gCO 2 eq/kW·h
E s o l a r M a x 185kW·h
η b a t t 0.81
C b a t t I n 68.66gCO 2 eq/kW·h
C b a t t O u t 86gCO 2 eq/kW·h
E b a t t M a x 650 / 2kW·h
E b a t t I n M a x E b a t t M a x kW·h
E b a t t O u t M a x E b a t t M a x kW·h
η H 2 0.35
C H 2 I n 1.75gCO 2 2eq/kW·h
C H 2 O u t 5gCO 2 2eq/kW·h
E H 2 M a x 1000kW·h
E H 2 I n M a x 2 × 10 kW·h
E H 2 O u t M a x 2 × 5 kW·h
C g r i d 53gCO 2 eq/kW·h
E D C m a x 10kW·h
E b u i l d i n g M a x 100kW·h
Table 4. Results computed on the year 2007. n.a.: not applicable.
Table 4. Results computed on the year 2007. n.a.: not applicable.
PolicyDurationHourly ImpactScoreSelf-Consumption
(h)(gCO 2 eq/h) (%)
DDPGTwoBatts442n.a.413n.a.
DDPGRepartition8567n.a.785035%
α rep = 0 87604591n.a.35%
α rep = 0.2 87604510n.a.33.6%
DDPG α rep 87604586802034.9%
Table 5. Learned policies duration depending on the reward normalization: local or global. Using simulations on 2007 test dataset.
Table 5. Learned policies duration depending on the reward normalization: local or global. Using simulations on 2007 test dataset.
PolicyLocal n.Global n.
DDPGTwoBatts248442
DDPGRepartition43128567
DDPG α rep 87608760
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Desportes, L.; Fijalkow, I.; Andry, P. Deep Reinforcement Learning for Hybrid Energy Storage Systems: Balancing Lead and Hydrogen Storage. Energies 2021, 14, 4706. https://doi.org/10.3390/en14154706

AMA Style

Desportes L, Fijalkow I, Andry P. Deep Reinforcement Learning for Hybrid Energy Storage Systems: Balancing Lead and Hydrogen Storage. Energies. 2021; 14(15):4706. https://doi.org/10.3390/en14154706

Chicago/Turabian Style

Desportes, Louis, Inbar Fijalkow, and Pierre Andry. 2021. "Deep Reinforcement Learning for Hybrid Energy Storage Systems: Balancing Lead and Hydrogen Storage" Energies 14, no. 15: 4706. https://doi.org/10.3390/en14154706

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop