#
Reinforcement Learning for Energy Community Management: A European-Scale Study^{ †}

^{1}

^{2}

^{3}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### 1.1. Hypothesis of This Study

#### 1.2. Enhancing Novelty: Methodological Advancements and Unexplored Territories

#### 1.3. Regional Variations in Energy Dynamics: Insights from Italian Regions and beyond

## 2. Related Works

#### Paper’s Contributions

## 3. Methods

#### 3.1. Problem Formulation

#### 3.2. Optimal Control Policy

#### 3.3. Reinforcement Learning Approach

^{′}, and $\gamma \in $ [0, 1] is a discount factor for future rewards.

#### 3.4. Actor–Critic Architecture

#### 3.5. Optimization Procedure

- Exploitation of optimal control actions: During the training phase, we have access to optimal actions computed using the MILP algorithm outlined in Section 3.2. We leverage this information to enhance our agent’s training. By feeding the optimal actions as input to the Value-DNN, we aim to improve the critic’s ability to evaluate the actions taken by the actor.
- Reward penalties for constraint violations: At each time step, the Policy-DNN of the agent generates U actions, each corresponding to a specific entity. As the training objective is to maximize social welfare, we utilize the social welfare for that time step as the reward signal for the actor’s actions. The Policy-DNN’s output actions are in the range of [−1, 1]. Each action is then scaled by the rated power (i.e., ${r}_{u}$) of the BESS of the corresponding entity to determine the actual power for charging/discharging the storage systems. Each action is then multiplied by the rated power (i.e., ${r}_{u}$) of the BESS of the corresponding entity to obtain the actual power for charging/discharging the storage systems. However, the actor’s actions may violate feasibility constraints. In such cases, actions that do not comply with the constraints are replaced with physically feasible actions, and a penalty is computed for the constraint violation for each action (${a}_{u,t}$) using (10):$${k}_{u,t}=\left\{\begin{array}{cc}max(0,({a}_{u,t}-{A}_{u,t}^{upper}))\hfill & {a}_{u,t}>0\hfill \\ max(0,(|{a}_{u,t}|-{A}_{u,t}^{lower}\left)\right)\hfill & {a}_{u,t}<0\hfill \\ 0\hfill & {a}_{u,t}=0\hfill \end{array}\right.$$The resulting total penalty is the average of the individual penalties:$${K}_{t}=\frac{1}{U}\sum _{u}{k}_{u,t}.$$This overall penalty is subtracted from the reward, resulting in the following reward signal, which is used to train the agent:$${R}_{t}={W}_{t}-\sigma \ast {K}_{t},$$

#### 3.6. Simulation Environment

#### 3.7. Reinforcement Learning Logic Concept

## 4. Results

#### 4.1. Dataset Selection and Rationale

#### Production across Six Zones

#### Energy Purchase and Sale Prices in Four Zones

#### 4.2. Training Details

#### 4.3. Evaluation Scenarios, Baselines, and Metrics

- 1.
- FRAN: France, Paris;
- 2.
- SVIZ: Switzerland, Berne;
- 3.
- SLOV: Slovenia, Ljubljana;
- 4.
- GREC: Greece, Athens;
- 5.
- NORD: northern Italy;
- 6.
- CNORD: central-northern Italy;
- 7.
- CSUD: central-southern Italy;
- 8.
- SUD: southern Italy;
- 9.
- CALA: Calabria region, Italy;
- 10.
- SICI: Sicily island, Italy;
- 11.
- SARD: Sardinia island, Italy.

- Optimal Controller (OC), as detailed in Section 3.2. The optimal scheduling of the BESSs for each day was determined using a mixed-integer linear programming (MILP) algorithm. This approach assumes complete knowledge of generation and consumption data for all 24 h, yielding optimal actions for BESS control and maximizing the daily community welfare.
- Rule-Based Controller (RBC). The BESSs’ actions were determined by predefined rules following Algorithm 1. Rule-based controllers, as exemplified in Algorithm 1, are commonly employed to schedule the charge and discharge policies of storage systems. For each entity, the RBC controller charged the BESS with surplus energy as long as the battery had not reached maximum capacity. Conversely, if less energy was produced than required, the loads were supplied with the energy from the BESS, if available.

Algorithm 1: Rule-based controller action selection |

#### 4.4. Results Discussion

## 5. Conclusions

#### Implications and Limitations of this Study

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

REC | Renewable Energy Community |

BESS | Battery Energy Storage System |

PV | Photovoltaic |

RL | Reinforcement Learning |

DRL | Deep Reinforcement Learning |

DNN | Deep Neural Network |

SAC | Soft Actor–Critic |

RBC | Rule-Based Controller |

OC | Optimal Control |

MILP | Mixed-Integer Linear Programming |

Constants and sets | |

U | number of entities forming the community |

T | number of time periods per day |

${\Delta}_{T}$ | duration of a time period (h) |

S | set of states |

A | set of actions |

Variables | |

${i}_{u,t}^{gri}$ | energy imported from the grid by entity u at time t (kWh) |

${e}_{u,t}^{gri}$ | energy exported to the grid by entity u at time t (kWh) |

${e}_{u,t}^{sto}$ | energy level of the battery of entity u at time t (kWh) |

${e}_{u,t}^{cha}$ | energy supplied to the battery of entity u at time t (kWh) |

${e}_{u,t}^{dis}$ | energy withdrawn from the battery of entity u at time t (kWh) |

Parameters | |

${c}_{u}$ | maximum capacity of the battery of entity u (kWh) |

${g}_{u,t}$ | energy generated by PV plant of entity u at time t (kWh) |

${l}_{u,t}$ | energy demand of entity u at time t (kWh) |

${r}_{u}$ | rated power of the battery of entity u (kW) |

${\eta}_{u}^{dis}$ | discharging efficiency of battery of entity u |

${\eta}_{u}^{cha}$ | charging efficiency of battery of entity u |

${\pi}_{u,t}^{egr}$ | unit price of energy exported to the grid by entity u at time t (EUR/kWh) |

${\pi}_{u,t}^{igr}$ | unit price of energy imported from the grid by entity u at time t (EUR/kWh) |

${\pi}_{u,t}^{sto}$ | unitary cost for usage of energy storage of entity u at time t (EUR/kWh) |

${\pi}_{t}^{inc}$ | unit incentive for community self-consumption at time t (EUR/kWh) |

## References

- United Nations. Agenda 2030. Available online: https://tinyurl.com/2j8a6atr (accessed on 28 January 2024).
- Gjorgievski, V.Z.; Cundeva, S.; Georghiou, G.E. Social arrangements, technical designs and impacts of energy communities: A review. Renew. Energy
**2021**, 169, 1138–1156. [Google Scholar] [CrossRef] - Directive (EU) 2018/2001 of the European Parliament and of the Council on the promotion of the use of energy from renewable sources. Off. J. Eur. Union
**2018**, 328, 84–209. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32018L2001 (accessed on 28 January 2024). - Parhizi, S.; Lotfi, H.; Khodaei, A.; Bahramirad, S. State of the Art in Research on Microgrids: A Review. IEEE Access
**2015**, 3, 890–925. [Google Scholar] [CrossRef] - Zia, M.F.; Elbouchikhi, E.; Benbouzid, M. Microgrids energy management systems: A critical review on methods, solutions, and prospects. Appl. Energy
**2018**, 222, 1033–1055. [Google Scholar] [CrossRef] - Zanvettor, G.G.; Casini, M.; Giannitrapani, A.; Paoletti, S.; Vicino, A. Optimal Management of Energy Communities Hosting a Fleet of Electric Vehicles. Energies
**2022**, 15, 8697. [Google Scholar] [CrossRef] - Stentati, M.; Paoletti, S.; Vicino, A. Optimization of energy communities in the Italian incentive system. In Proceedings of the 2022 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), Novi Sad, Serbia, 10–12 October 2022; pp. 1–5. [Google Scholar]
- Talluri, G.; Lozito, G.M.; Grasso, F.; Iturrino Garcia, C.; Luchetta, A. Optimal battery energy storage system scheduling within renewable energy communities. Energies
**2021**, 14, 8480. [Google Scholar] [CrossRef] - Aupke, P.; Kassler, A.; Theocharis, A.; Nilsson, M.; Uelschen, M. Quantifying uncertainty for predicting renewable energy time series data using machine learning. Eng. Proc.
**2021**, 5, 50. [Google Scholar] - Liu, L.; Zhao, Y.; Chang, D.; Xie, J.; Ma, Z.; Sun, Q.; Yin, H.; Wennersten, R. Prediction of short-term PV power output and uncertainty analysis. Appl. Energy
**2018**, 228, 700–711. [Google Scholar] [CrossRef] - Chen, Z.; Wu, L.; Fu, Y. Real-Time Price-Based Demand Response Management for Residential Appliances via Stochastic Optimization and Robust Optimization. IEEE Trans. Smart Grid
**2012**, 3, 1822–1831. [Google Scholar] [CrossRef] - Parisio, A.; Rikos, E.; Glielmo, L. A Model Predictive Control Approach to Microgrid Operation Optimization. IEEE Trans. Control. Syst. Technol.
**2014**, 22, 1813–1827. [Google Scholar] [CrossRef] - Palma-Behnke, R.; Benavides, C.; Lanas, F.; Severino, B.; Reyes, L.; Llanos, J.; Sáez, D. A Microgrid Energy Management System Based on the Rolling Horizon Strategy. IEEE Trans. Smart Grid
**2013**, 4, 996–1006. [Google Scholar] [CrossRef] - Vazquez-Canteli, J.R.; Henze, G.; Nagy, Z. MARLISA: Multi-Agent Reinforcement Learning with Iterative Sequential Action Selection for Load Shaping of Grid-Interactive Connected Buildings. In Proceedings of the 7th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, Yokohama, Japan, 18–20 November 2020. [Google Scholar] [CrossRef]
- Bio Gassi, K.; Baysal, M. Improving real-time energy decision-making model with an actor-critic agent in modern microgrids with energy storage devices. Energy
**2023**, 263, 126105. [Google Scholar] [CrossRef] - Ji, Y.; Wang, J.; Xu, J.; Fang, X.; Zhang, H. Real-Time Energy Management of a Microgrid Using Deep Reinforcement Learning. Energies
**2019**, 12, 2291. [Google Scholar] [CrossRef] - Mocanu, E.; Mocanu, D.C.; Nguyen, P.H.; Liotta, A.; Webber, M.E.; Gibescu, M.; Slootweg, J.G. On-Line Building Energy Optimization Using Deep Reinforcement Learning. IEEE Trans. Smart Grid
**2019**, 10, 3698–3708. [Google Scholar] [CrossRef] - Gao, S.; Xiang, C.; Yu, M.; Tan, K.T.; Lee, T.H. Online Optimal Power Scheduling of a Microgrid via Imitation Learning. IEEE Trans. Smart Grid
**2022**, 13, 861–876. [Google Scholar] [CrossRef] - Vázquez-Canteli, J.R.; Dey, S.; Henze, G.; Nagy, Z. CityLearn: Standardizing Research in Multi-Agent Reinforcement Learning for Demand Response and Urban Energy Management. arXiv
**2020**, arXiv:2012.10504. [Google Scholar] - Guiducci, L.; Palma, G.; Stentati, M.; Rizzo, A.; Paoletti, S. A Reinforcement Learning approach to the management of Renewable Energy Communities. In Proceedings of the 2023 12th Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, 6–10 June 2023; pp. 1–8. [Google Scholar] [CrossRef]
- Legge 28 Febbraio 2020, n. 8, Recante Disposizioni Urgenti in Materia di Proroga di Termini Legislativi, di Organizzazione Delle Pubbliche Amministrazioni, Nonché di Innovazione Tecnologica. Gazzetta Ufficiale n. 51. 2020. Available online: https://www.gazzettaufficiale.it/eli/id/2020/02/29/20G00021/sg (accessed on 28 January 2024).
- Autorità di Regolazione per Energia Eeti e Ambiente. Delibera ARERA, 318/2020/R/EEL—Regolazione delle Partite Economiche Relative all’Energia Condivisa da un Gruppo di Autoconsumatori di Energia Rinnovabile che Agiscono Collettivamente in Edifici e Condomini oppure Condivisa in una Comunità di Energia Rinnovabile. 4 August 2020. Available online: https://www.arera.it (accessed on 28 January 2024).
- Decreto Ministeriale 16 Settembre 2020—Individuazione Della Tariffa Incentivante per la Remunerazione Degli Impianti a Fonti Rinnovabili Inseriti Nelle Configurazioni Sperimentali di Autoconsumo Collettivo e Comunità Energetiche Rinnovabili. Gazzetta Ufficiale n. 285. 2020. Available online: https://www.mimit.gov.it/it/normativa/decreti-ministeriali/decreto-ministeriale-16-settembre-2020-individuazione-della-tariffa-incentivante-per-la-remunerazione-degli-impianti-a-fonti-rinnovabili-inseriti-nelle-configurazioni-sperimentali-di-autoconsumo-collettivo-e-comunita-energetiche-rinnovabili (accessed on 28 January 2024).
- Cielo, A.; Margiaria, P.; Lazzeroni, P.; Mariuzzo, I.; Repetto, M. Renewable Energy Communities business models under the 2020 Italian regulation. J. Clean. Prod.
**2021**, 316, 128217. [Google Scholar] [CrossRef] - Moncecchi, M.; Meneghello, S.; Merlo, M. A game theoretic approach for energy sharing in the italian renewable energy communities. Appl. Sci.
**2020**, 10, 8166. [Google Scholar] [CrossRef] - Stentati, M.; Paoletti, S.; Vicino, A. Optimization and Redistribution Strategies for Italian Renewable Energy Communities. In Proceedings of the IEEE EUROCON 2023—20th International Conference on Smart Technologies, Torino, Italy, 6–8 July 2023; pp. 263–268. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Rizzo, A.; Burgess, N. An action based neural network for adaptive control: The tank case study. In Towards a Practice of Autonomous Systems; MIT Press: Cambridge, MA, USA, 1992; pp. 282–291. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv
**2016**, arXiv:1607.06450. [Google Scholar] - Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv
**2018**, arXiv:1801.01290. [Google Scholar] - Wang, Z.; Bapst, V.; Heess, N.; Mnih, V.; Munos, R.; Kavukcuoglu, K.; de Freitas, N. Sample Efficient Actor-Critic with Experience Replay. arXiv
**2017**, arXiv:1611.01224. [Google Scholar] - Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv
**2016**, arXiv:1606.01540. [Google Scholar] - Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2017**, arXiv:1412.6980. [Google Scholar]

**Figure 2.**Illustration of the architecture of the Policy-DNN utilized in this study, showcasing the input sizes derived from both global and individual data after preprocessing. The Policy-DNN consists of approximately 1.8 million parameters, which may vary based on the community’s size. The preprocessing step, critical for model preparation, is visually represented in the figure, with the green block indicating the current state ${S}_{t}$ and refined input sizes fed into the fully connected layers for decision making.

**Figure 3.**The visual representation within the red block provides an insightful overview of the architecture of the Value-DNN. This model receives the same inputs as the Policy-DNN and additionally receives optimal actions computed beforehand using the MILP algorithm for the current time step. The optimal actions do not require additional encoding, as they are naturally bounded within the interval [−1, 1]. The state ${S}_{t}$ depicted in the red box undergoes processing by a sequence of fully connected layers to calculate the value function.

**Figure 4.**In the framework, the agent–environment interaction occurs in multiple steps. Initially, the agent calculates actions based on the current state, denoted as ${A}_{t}$, for each building. These actions reflect the agent’s decisions. Subsequently, the Gym environment checks whether the actions comply with the constraints. Any violations result in penalties, which are factored into the overall reward signal. The environment then computes the community’s social welfare, a key component of the reward signal, reflecting the societal impact of the agent’s decisions. Thus, the reward signal includes penalties for constraint violations and the calculated social welfare, serving as a quantitative feedback mechanism. Finally, the environment provides the agent with the calculated reward and the new state calculated from the current state and the agent’s actions.

**Figure 6.**Positions of energy communities among European states: 7 RECs in different Italian regions, and 1 REC for each capital in France, Switzerland, Slovenia, and Greece.

**Figure 7.**Monthly distribution of PV generation in six distinct zones. The detailed visualization highlights the high variability among the different geographic areas considered in this study.

**Figure 8.**Daily dynamics of energy market transactions. The figure displays the purchase (import) and sale (export) prices in four distinct zones. The x-axis plots the progression of days, providing a chronological perspective, whereas the y-axis quantifies prices in EUR, providing a standardized metric for evaluation.

Global | ${\pi}_{u,t}^{egr}$ | ${\pi}_{u,t}^{igr}$ | month | day type | hour |

Individual | ${l}_{u,t}$ | ${g}_{u,t}$ | ${e}_{u,t}^{sto}$ | ${A}_{u,t}^{lower}$ | ${A}_{u,t}^{upper}$ |

Community | Train | Test1 | Test2 |
---|---|---|---|

3 entities | SUD | SICI | GREC |

5 entities | CNOR | SLOV | SARD |

7 entities | CALA | SVIZ | CNOR |

9 entities | FRAN | CSUD | NORD |

3 Entities | 5 Entities | 7 Entities | 9 Entities | |||||
---|---|---|---|---|---|---|---|---|

Entity ID | PV | BESS | PV | BESS | PV | BESS | PV | BESS |

1 | - | - | - | - | - | - | 120 | 140 |

1 | - | - | - | - | - | - | 70 | 80 |

3 | - | - | - | - | 30 | 60 | 50 | 45 |

4 | - | - | - | - | 60 | 70 | 40 | 75 |

5 | - | - | 25 | 50 | 50 | 50 | 25 | 50 |

6 | - | - | 20 | 30 | 10 | 30 | 20 | 30 |

7 | 35 | 20 | 20 | 40 | 35 | 50 | 25 | 35 |

8 | 20 | 35 | 30 | 40 | 40 | 50 | 40 | 50 |

9 | 25 | 40 | 20 | 35 | 40 | 50 | 30 | 35 |

Community | Controller | Train | Test1 | Test2 |
---|---|---|---|---|

SUD | SICI | GREC | ||

3 entities | RL | 99.55% | 98.72% | 91.91% |

RBC | 97.32% | 96.04% | 82.23% | |

CNOR | SLOV | SARD | ||

5 entities | RL | 99.57% | 96.17% | 98.94% |

RBC | 95.38% | 93.28% | 96.75% | |

CALA | SVIZ | CNOR | ||

7 entities | RL | 97.70% | 95.77% | 97.61% |

RBC | 64.49% | 94.48% | 94.79% | |

FRAN | CSUD | NORD | ||

9 entities | RL | 97.95% | 95.87% | 95.74% |

RBC | 56.91% | 94.91% | 93.58% |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Palma, G.; Guiducci, L.; Stentati, M.; Rizzo, A.; Paoletti, S.
Reinforcement Learning for Energy Community Management: A European-Scale Study. *Energies* **2024**, *17*, 1249.
https://doi.org/10.3390/en17051249

**AMA Style**

Palma G, Guiducci L, Stentati M, Rizzo A, Paoletti S.
Reinforcement Learning for Energy Community Management: A European-Scale Study. *Energies*. 2024; 17(5):1249.
https://doi.org/10.3390/en17051249

**Chicago/Turabian Style**

Palma, Giulia, Leonardo Guiducci, Marta Stentati, Antonio Rizzo, and Simone Paoletti.
2024. "Reinforcement Learning for Energy Community Management: A European-Scale Study" *Energies* 17, no. 5: 1249.
https://doi.org/10.3390/en17051249