Next Article in Journal
Conceptual Design of a Floating Modular Energy Island for Energy Independency: A Case Study in Crete
Next Article in Special Issue
Forecast of Operational Downtime of the Generating Units for Sediment Cleaning in the Water Intakes: A Case of the Jirau Hydropower Plant
Previous Article in Journal
Coprecipitation Synthesis and Impedance Studies on Electrode Interface Characteristics of 0.5Li2MnO3·0.5Li(Ni0.44Mn0.44Co0.12)O2 Cathode Material
Previous Article in Special Issue
Optimal Scheduling of Virtual Power Plant with Flexibility Margin Considering Demand Response and Uncertainties
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Environment-Friendly Power Scheduling Based on Deep Contextual Reinforcement Learning

1
Major in Industrial Data Science and Engineering, Department of Industrial and Data Engineering, Pukyong National University, Busan 48513, Republic of Korea
2
Department of Industrial Management and Big Data Engineering, Dongeui University, Busan 47340, Republic of Korea
3
Department of Global Marketing, Busan University of Foreign Studies, Busan 46234, Republic of Korea
4
Department of Systems Management and Engineering, Pukyong National University, Busan 48513, Republic of Korea
*
Author to whom correspondence should be addressed.
Energies 2023, 16(16), 5920; https://doi.org/10.3390/en16165920
Submission received: 16 July 2023 / Revised: 6 August 2023 / Accepted: 8 August 2023 / Published: 10 August 2023
(This article belongs to the Special Issue Power System Analysis Control and Operation)

Abstract

:
A novel approach to power scheduling is introduced, focusing on minimizing both economic and environmental impacts. This method utilizes deep contextual reinforcement learning (RL) within an agent-based simulation environment. Each generating unit is treated as an independent, heterogeneous agent, and the scheduling dynamics are formulated as Markov decision processes (MDPs). The MDPs are then used to train a deep RL model to determine optimal power schedules. The performance of this approach is evaluated across various power systems, including both small-scale and large-scale systems with up to 100 units. The results demonstrate that the proposed method exhibits superior performance and scalability in handling power systems with a larger number of units.

1. Introduction

Electrical energy generated from fossil fuels emits significant amounts of greenhouse gases (GHGs), including carbon dioxide (CO2), sulfur dioxide (SO2), and nitrous oxide (N2O). These emissions have detrimental effects on human health and contribute to climate change and global warming. However, the prevailing focus on economic concerns in power generation often results in higher emission levels since economic costs and environmental impacts tend to be inversely related. This effect has been amplified since the implementation of the global emissions trading scheme (ETS) in 2005, aimed at controlling GHGs [1,2]. As a result, power generation based solely on economic costs leads to increased financial penalties for emissions, along with adverse environmental impacts. Therefore, the traditional approach of scheduling power generation based solely on economic costs, which overlooks environmental ETS, is no longer acceptable [2]. Consequently, it becomes imperative to determine an efficient and environmentally friendly power generation schedule that might have lower emission costs, indicating that it generates fewer GHGs or other pollutants during its operation. The primary objective of environmentally friendly power scheduling is to achieve a sustainable cost and emission balance, where economic concerns are considered without compromising environmental sustainability. This means effectively meeting electricity demand while minimizing the environmental impact, especially in terms of greenhouse gas emissions (GHGs) and other pollutants. By adopting this approach, not only does the profitability of power generation increase, but it also leads to reduced emission levels through efficient management and scheduling of generating units [1,2].
Prior research has explored various model-based approaches, such as conventional and dynamic programming and stochastic optimizations, to address the challenges in power scheduling [3]. However, these methods often face the curse of the dimensionality problem, which leads to the use of heuristic rules and simplifications that may not effectively handle real-sized problems [3,4,5]. As power systems continue to grow in size and complexity, even small improvements in efficiency achieved through enhanced power scheduling methods can yield significant economic and environmental benefits [4].
Recently, artificial intelligence (AI) has shown promise in learning optimal strategies without prior knowledge. Particularly, reinforcement learning (RL) can achieve this by employing self-play learning and adapting decision-making policies over time based on feedback from dynamic environments [4,6]. Furthermore, RL does not rely on precise mathematical models [4], making it more suitable for real-world scenarios where power generation dynamics may be uncertain or challenging to accurately model.
However, despite the potential of RL-based models to offer improved power scheduling solutions, only a few studies (such as [4,5,6,7,8,9,10]) are available in the literature. These RL-based models tend to prioritize economic costs and overlook environmental impacts, leading to excessive carbon emissions and neglecting long-term consequences [6]. Additionally, scalability remains a challenge for both the model- and RL-based approaches, as the dimensionality grows exponentially with an increasing number of units [11]. Consequently, simplified power scheduling definitions are often used, which may not adequately represent realistic power systems.
This study aims to address these limitations by proposing a novel deep RL-based method for power scheduling that minimizes both economic and environmental costs. The algorithm utilizes an agent-based contextual simulation environment, where generating units are represented as agents. The simulation environment automatically corrects illegitimate commitments and adjusts supply capacity to meet demand, enabling agents to learn optimal behaviors more efficiently. Furthermore, the proposed method mitigates the dimensionality problem associated with large-scale problems, distinguishing it from existing approaches in the literature. The power scheduling dynamics are simulated using a Markov decision process (MDP), and the results are fed into a deep Q-network (DQN) with separate output nodes (ON and OFF) for each agent, allowing for effective decision making. The remainder of this article is organized as follows: The description of the power scheduling problem is presented in Section 2. Then, Section 3 provides the technical details of the proposed methodology, which is demonstrated with a numerical example in Section 4. Concluding remarks follow in the Section 5.

2. Problem Description

Given a power system with n generating units and a power scheduling horizon of 24 h, let z i t { 0,1 } denote the commitment (ON/OFF) status of unit i at period t , and p i t [ 0 ,   ) be the optimal power output of unit i at period t (MW).

2.1. Objective Function

The operating cost of power production at each period t is often defined by the sum of the production costs ( c i t p r o d ), start-up costs ( c i t O N ), and shutdown costs ( c i O F F ) of all units. This definition of total operation cost ignores environmental constraints, which depend on local regulation and emission allowance trading market schemes [12]. A properly represented power scheduling model should also include other costs that are not linked to fuel prices but related to fuel consumption and technological efficiency [3]. Hence, it is necessary to include emission costs ( c i t e m i s ) as part of the total operation cost [13]. In the existing models, the environmental impact of power generation is primarily addressed in the form of emissions constraints and penalties. Emission constraints involve setting limits on the maximum allowable GHGs and other pollutants, whereas the penalty method assigns penalty costs to units emitting beyond the allowed limits. However, a common issue with both the emission constraint and penalty methods is their lack of flexibility, as they do not account for dynamic adjustments based on real-time changes, such as demand fluctuations, outages of units, and fuel and other operational expenses. Both methods tend to prioritize compliance with emission limits rather than actively pursuing emission reduction strategies. To address these limitations, this study proposes the adoption of emission cost parameters integrated into the main objective function. By representing the emissions produced per MW for different types of units, this approach offers a continuous and gradual representation of environmental impacts [13]. This not only allows for more nuanced and flexible decision making by treating emissions as a continuous variable rather than a binary constraint, but it also enables the assessment of the emission-reduction potential of different power plants.
Thus, the objective function representing the total operational economic and environmental costs of the entire planning horizon ( C ) can be expressed as Equation (1):
C = t = 1 24 i = 1 n z i t ( c i t p r o d + c i t e m i s ) + z i t 1 z i , t 1 c i t O N + 1 z i t z i , t 1 c i O F F
where
c i t p r o d = α i p i t 2 + β i p i t + δ i
c i t e m i s = p i t h = 1 m ϕ h ψ i h
c i t O N = c i h o t , O N , t i * d o w n t i , t 1 O F F t i * d o w n + t i * c o l d   c i c o l d , O N , t i , t 1 O F F > t i * d o w n + t i * c o l d
Equation (2) is the production cost function of unit i at period t where α i , β i , and δ i are the corresponding quadratic, linear, and constant coefficients, respectively. Next, the emission cost of m types of pollutants released by unit i is represented in Equation (3). Since three emission types (namely, C O 2 , S O 2 , and N O x ) have been considered, m = 3 in this study. The emissions level is often directly related to the fuel consumption and technological efficiency [3]. As a result, the cost of emissions can be expressed as a linear function of the power outputs [13], where ϕ h is the external cost of emission type h ($/g), and ψ i h is the h t h emission factor of unit i (g/MW). Finally, the start-up cost given in Equation (2) is a function of the time duration for which unit i has been continuously offline (OFF) until the period t , t i , t 1 O F F . The shutdown costs are fixed but usually negligible [14] and mostly considered zero [3].

2.2. Constraints

The objective function in Equation (1) is required to be minimized subject to different unit-specific and system level constraints as presented in Equations (5)–(9):
C a p a c i t i e s : z i t p i * m i n p i t z i t p i * m a x
R a m p   r a t e s : z i , t 1 z i t p i , t 1 p i * d o w n p i t z i , t 1 z i t p i , t 1 + p i * u p
O p e r a t i n g   t i m e s : t i t O N t i * u p   and   t i t O F F t i * d o w n
L o a d   B a l a n c e : i = 1 n z i t p i t = d t
R e s e r v e : i = 1 n z i t p i t m a x ( 1 + r ) d t
The unit constraints in Equations (5)–(7) affect each unit taken separately, which collectively accounts for the technical specifications of generating units, and the system constraints in Equations (8) and (9) are used to balance the power supply and demand during each period. Due to the non-convexity of the objective function, the combinatorial nature of commitments, and the time-dependent technical characteristics of the power supply units, solving the objective function in Equation (1) subject to the constraints in Equations (5)–(9) using classical methods is highly computationally demanding even for a moderate number of units and might also stack in some local optima. As a result, the power scheduling problem remains a strongly NP-hard problem, causing the curse of dimensionality, and the implicit burden of computation has limited the scope of numerical optimization [3]. A model-free RL approach may provide a promising methodological framework for solving the power scheduling problem [5].

3. Proposed Methodological Framework

A novel multi-agent deep contextual RL algorithm for power scheduling is proposed by constructing a specialized environment called an “agent-based contextual simulation environment”, whose main peculiarities are explained in Section 3.1. Within the environment, the generating units are represented as cooperative types of RL agents [15]. The agents are active in observing the contextual changes in the environment, and they can make independent decisions regarding their commitment status and optimal power outputs. On the other hand, the agents collaborate to satisfy demand, including the reserve availability at each hour (timestep) described in Equations (8) and (9), and the total operation cost of the entire planning horizon (episode) is to be minimized. The power scheduling dynamics from the agent-to-environment interactions follow an agent-based simulation strategy [16] because the agents are heterogeneous, active, and autonomous. These dynamics are simulated in the form of a Markov decision process (MDP), whose key elements are defined below, and then fed as inputs for the deep RL model.

3.1. The Power Scheduling Dynamics as an MDP

Since the planning horizon is an hourly divided day, each hour is considered a timestep t , t . For each timestep t of an episode, the system will be state s t consisting of different components: current timestep t , minimum capacities ( p i t m i n ; i ), and maximum capacities ( p i t m a x ; i ) based on the maximum ramp-down ( p i * d o w n ; i ) and ramp-up ( p i * u p ; i ) rates of the units, current operating (online/offline) time durations ( t i t ; i ) of the units based on the minimum up-time duration ( t i * u p ; i ) and down-time duration ( t i * d o w n ; i ) of the units, and the demand ( d t ) to be satisfied. The state s t at timestep t is defined as s t = ( t , p t m i n , p t m a x , t t , d t ) where t is the current timestep, p t m i n is a vector of minimum capacities, p t m a x is a vector of maximum capacities, t t is a vector of current (online/offline) time duration, and d t is the demand. Overall, the system’s state space for the entire episode can be described as S = t , P m i n , P m a x ,   T , d . There are two possible actions (switch-to/stay ON or switch-to/stay OFF) for each of the n agents. This implies, there are a total of 2 n action combinations (i.e., unit commitments) in the action space A . Thus, each agent i will decide their optimal action ( a i t 0,1 ) at timestep t (or in state s t ). Then, the decisions of all the n agents together constitute an n -dimensional vector a t = 0,1 n A . This actions vector a t is the change of the switch (ON/OFF) status z t of units at period t to z t + 1 at the next period t + 1 . Once the agents take actions a t A in the current state s t S , there is a transition (or probability) function P s t + 1 | s t , a t that leads to the next state s t + 1 . The transition function must satisfy all the constraints.
At each timestep t of the planning horizon, the agents first observe the state s t = t , p t m i n , p t m a x , t t , z t , d t S . Then, each agent can decide to be either ON or OFF, which would result in 2 n combinations of unit commitment found in an action space A . The decisions of all agents constitute the action vector a t = 0,1 n A . Then, the agents get an aggregate reward r t R that can lead to the next state s t + 1 S through a transition (probability) function P s t + 1 = s | s t = s , a t . Therefore, these power scheduling dynamics can be represented as a 4-tuple ( S , A , P , r ) Markovian decision process where S is a state space, A is an action space, P is a transition (or probability) function, and r is a reward.

3.2. Agent-Based Contextual Simulation Environment Algorithm

The structure of agent-based simulation environment is roughly similar to an OpenAI Gym as described below.
Step 1. The parameters of supply units are initialized as s 0 = ( 0 , p 0 m i n , p 0 m a x , t 0 , z 0 , d 0 ) .
Step 2. The minimum and maximum marginal costs of all agents are determined based on the average production cost, Equation (10).
λ i m i n = α i p i * m a x + β i + δ i p i * m a x   and   λ i m a x = α i p i * m i n + β i + δ i p i * m i n ; i
Step 3. The must-ON ( u i t 1 ) or must-OFF ( u i t 0 ) agents are identified based on the operating times, Equation (11).
u i t 1 = 1   if   0 < t i t < t i * u p ;   and   u i t 0 = 1   if t i * d o w n < t i t < 0 ;   i .
Step 4. The agents execute their action a t in state s t , and then pass to the next state s t + 1 through a transition (or probability) function, P s t + 1 | s t , a t , satisfying all constraints.
Step 4.1. The legality of action a i t a t of each agent is confirmed and legalized if there are any violations of the constraints specified in Equation (11), as shown in Equation (12).
a i t = 1   if   a i t = 0   |   u i t 1 = 1 ;   and   a i t = 0   if   a i t = 1   |   u i t 0 = 0 ,   i .
Step 4.2. The aggregate supply capacity of agents is checked for sufficiency in satisfying the demand and future demands when OFF units have not completed their downtime. Then contextual capacity adjustments are made if necessary, and if possible.
If i = 1 n a i t p i t m a x < 1 + r d t , then set each a i t = 1 | u i t 1 = 0 based on the increasing order of λ i m i n ’s of Equation (10) until i = 1 n a i t p i t m a x 1 + r d t . If the capacity shortage is not fully corrected due to unconstrained OFF units, then s t is labeled as a terminal state ( s t + ) that would result an incomplete episode ( I s t + = 1 ).
If i = 1 n a i t p i t m i n > 1 + r d t , then set each a i t = 0 | u i t 0 = 1 as per the decreasing order of λ i m i n ’s of Equation (10) until i = 1 n a i t p i t m i n 1 + r d t . If the excess capacity is not yet fully adjusted due to an insufficient number of unconstrained ON units, it results in an incomplete episode ( I s t + = 1 ) as the state s t is terminal ( s t + ).
If the current capacity does not satisfy future demands, set each a i t = 1 | ( t i t t i * u p ) as per the decreasing order of λ i m i n ’s of Equation (10). The current state s t is also labeled as terminal ( s t + ) if the future demands cannot be meet while the offline units must still be OFF due to an insufficient number of unconstrained OFF units.
Step 5. The total operation cost at timestep t is determined. First, start-up and shutdown costs are obtained based on the action a t . Second, a lambda iteration algorithm is used for solving the optimal power loads p i t , in Equation (2), which are then used to estimate the emission costs specified in Equation (3). Lastly, the total operation cost is obtained using Equation (13) where z i , t + 1 = a i t , i .
C t = i = 1 n z i , t + 1 ( c i t p r o d + c i t e m i s ) + z i , t + 1 1 z i t c i t O N + 1 z i , t + 1 z i t c i O F F
Step 6. The agents get an aggregate reward according to the predefined function given in Equation (14), which is the negative of the normalized total operation cost scaled to 100.
r t = R s t , a t ,   s t + 1 = 1 ( 1 I s t + ) C t + I s t + C t + C m i n C m a x C m i n × 100
In the episodic task of RL, incomplete episodes need to be avoided, and large penalties are recommended by [11]. For this purpose, while the cost function in Equation (13) is used for non-terminal states, the cost for terminal states is defined as C t + = C m a x t 23 ( C m a x C p m a x ) where C m a x = i = 1 n λ i m a x p i * m a x and C p m a x is the sum of Equations (2) and (3), assuming p i t = p i * m a x . This provides evenly distributed penalties, maintaining the desired proximity to the final timestep.
Step 7. If the current state is terminal (i.e., I s t + = 1 ) or t 24 , then go to Step 1 to re-initialize the environment and restart a new episode to state s 0 . But, if t < 24 and I s t + = 0 , the agents pass to the next state s t + 1 = ( t + 1 , p t + 1 m i n , p t + 1 m a x , t t + 1 , z t + 1 , d t + 1 ) where z t + 1 = a t is a vector commitment status; p t + 1 m i n = max p * m i n , z t z t + 1 p t p * d o w n is a vector of minimum capacities; p t + 1 m a x = min p * m a x , z t z t + 1 p t + p * u p + I [ z t z t + 1 = 0 ] p * m a x is a vector of maximum capacities; t t + 1 = ( t t + 1 if z t + 1 = 1 | t t > 0 , 1 else if z t + 1 = 0 | t t > 0 , 1 else if z t + 1 = 1 | t t < 0 , t t 1 if z t + 1 = 0 | t t < 0 ) is a vector of operating time durations.
Step 8. Update the must-ON and must-OFF agents for the next timestep t + 1 as defined in Equation (11): u i , t + 1 1 = 1 if 0 < t i , t + 1 < t i * u p , and u i , t + 1 0 = 1 if t i * d o w n < t i , t + 1 < 0 ; i .
Step 9. The execution of a t in the environment returns the next state ( s t + 1 ), the reward ( r t ), indicates whether the state is terminal ( I s t + ) and loads dispatch information together with the legally confirmed action ( a t , p t ) .
It should be noted that action a t returned in Step 9 is not necessarily the same as the action a t executed in Step 4 since the environment makes contextual corrections in Step 4.1 and 4.2 using the idea of a contextual search [17]. This agent-based contextual simulation environment, one of the main contributions of this study, can be highly effective in reducing the computing and training time of the multi-agent deep contextual RL model described below.

3.3. Deep Contextual Reinforcement Learning

At each timestep t T , the agents observe a state s t from S and select their respective actions a t from the action space A according to a policy π ( a t | s t ) , where π is a mapping from states s t to actions a t . The agents then receive a reward r t and proceed to the next state s t + 1 . This process continues until the agents finish the entire episode or reach a terminal state, both of which reinitialize the environment. The agents’ goal is to learn a policy π a t s t that maximizes the long-run cumulative sum of rewards called return, defined as G t k = 0 γ k r t + k where γ is a discount rate γ [ 0,1 ] of the MDP. The expected return of action, a t in the state s t can be expressed as an action-value function Q π s t , a t = E ( G t | s t , a t ) . It can be approximated using a deep Q-network (DQN) which can be applied in a high-dimensional state and/or action space [18]. The action-value function can now be written as Q ( s t , a t | θ ) , where θ consists of the parameters of the DQN model whose inputs are the power scheduling dynamics simulated from the environment in the form of MDPs. The size of action space A is 2 n , which may render an exponential growth in computation. Thus, it is impractical to parameterize the model into 2 n output nodes. Instead, it is parameterized into 2 n output nodes corresponding to the two possible actions (ON/OFF) of each agent. As a result, the model estimates action-values for 2 n output nodes, and then the decisions of agents made using an exponential decay epsilon-greedy exploration strategy collectively constitute the action vector a t .
In a power scheduling problem, the action in a particular state affects the rewards and a set of future states, which yields serial correlations among the MDPs. In such cases, direct application of DQN may not be efficient, as it might result in unstable and slow learning processes [11]. Employing the notion of experience replay, the autocorrelation among the states may be properly addressed, and the training process can be expedited and stabilized [19]. After storing a transition tuple ( s t , a t , r t , s t + 1 ) to a replay buffer B , a batch b of experiences s , a , r , s is sampled to approximate the action-value function Q s , a θ and the target network Q s , a θ , where s and s are of size ( b × 2 n ) , and the sizes of a and r are ( b × n ) and ( b × 1 ) , respectively. The DQN algorithm minimizes the mean-squared loss (i.e., temporal difference) L e ( θ ) defined by
L e θ = E s , a , r , s ~ B r + γ max a Q s , a θ Q s , a θ 2
where θ is the parameter of target network, which is periodically updated from the Q-network parameters θ , and e denotes the iteration index.

4. Demonstrative Example

The applicability of the proposed deep contextual RL method is demonstrated with the power system investigated in [20]. The test system consists of five units, for which supply and demand profiles and emission parameters can be found in [20]. The deep RL utilizes a feedforward neural network featuring a rectified linear unit (ReLU) activation function on both the hidden and output layers. The learning rate and discount factor were set to 0.0001 and 0.99, respectively, and the Adam’s optimizer was used. Employing the popular genetic algorithm, the optimal cost in [20] was $430,331, summing up the start-up, production, and emission costs of $3140 (0.7%), $289,178 (67.2%), and $138,010 (32.1%), respectively. On the other hand, the proposed method in this study yields an improved optimal operation cost of $413,122 as shown in Table 1 and Figure 1 and Figure 2, corresponding to $16,808 (4.0%) lower daily operation cost than the results reported in [20] as compared in Table 2 and Figure 3. The total cost is composed of $2230 (0.5%) for start-up costs, $275,962 (66.8%) for production costs, and $134,931 (32.6%) for emission costs. It is also asserted that the proposed method may be computationally efficient by adopting the experience replay. The scalability of the proposed method may thus be tested with a large-scale power system comprising a large number of generating units. Duplicating the five-unit test system multiple times and scaling the demands proportionately, the proposed method has been applied to obtain the optimal power scheduling scheme of individual units. The optimal costs of duplicated large-scale power systems are summarized in Table 3. It is worth noting that the optimal operating cost of each test system is lower than the scaled optimal operating cost of the original five-unit system. It is implied that the proposed method may easily be extended to render an economically and environmentally better solution for larger-scale power systems.

5. Concluding Remarks

This article presents a novel RL-based algorithm designed to optimize the economic and environmental cost of power scheduling. The algorithm utilizes a contextually corrective agent-based RL environment, which simulates power scheduling dynamics using the framework of MDP. To evaluate the applicability and performance of the proposed method, the algorithm is tested on different test systems comprising up to 100 generating units. It is demonstrated that the algorithm provides superior solutions and is scalable to handle larger power systems. The potential for incorporating renewable power sources and investigating their impacts further highlights the versatility and applicability of the proposed method in addressing real-world power scheduling challenges.

Author Contributions

Conceptualization, A.S.E. and Y.J.K.; methodology, A.S.E. and C.P.; software, A.S.E.; validation, C.P. and Y.C.; formal analysis, A.S.E.; investigation, C.P. and Y.C.; resources, Y.J.K.; data curation, Y.C.; writing—original draft preparation, A.S.E.; writing—review and editing, Y.J.K.; supervision, Y.J.K.; project administration, Y.J.K.; funding acquisition, Y.J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (2021R1I1A3047456).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

Indices
n :Number of units.
m :Number of emission types.
T = 1,2 , , n : Indices   of   all   units ,   i T .
M = { 1,2 , , m } Indices   of   all   types   of   emissions ,   h M .
T = { 1,2 , , 24 } : Indices   of   all   periods ,   t T .
Units and Demand Profiles
p i * m a x , p i * m i n : Max ,   min   capacity   of   unit   i (MW).
p i t m a x ,   p i t m i n : Max ,   min   capacity   of   unit   i   at   period   t (MW).
p i t : Power   output   of   unit   i   at   period   t (MW).
t i * u p ,   t i * d o w n : Min   online ,   offline   duration   of   unit   i (hour).
t i t : Operating   ( online / offline )   duration   of   unit   i   at   period   t (hour).
t i t O N ,   t i t O F F : Online   ( up ) ,   offline   ( down )   duration   of   unit   i   at   period   t (hour).
u i t 1 , u i t 0 : Indicator   if   unit   i   must - ON ,   must - OFF   at   time   t .
d t : Demand   at   period   t (MW).
r :Percentage of demand for reserve capacity.
Objective Function
C , C t : Total   generation   cos t   function   of   a   day   and   at   period   t .
α i , β i , δ i :Quadratic, linear, constant parameters of cost function of unit   i .
ϕ h , ψ i h :Externality cost of emission type h ($/g), emission factor of unit i   for type h (g/MW)
c i t O N , ,   c i O F F : Start - up   cos t   at   period   t ,   shutdown   cos t   of   unit   i .
Others
E :Expected value.
I :Indicator function.

References

  1. Asokan, K.; Ashokkumar, R. Emission controlled Profit based Unit commitment for GENCOs using MPPD Table with ABC algorithm under Competitive Environment. WSEAS Trans. Syst. 2014, 13, 523–542. [Google Scholar]
  2. Roque, L.; Fontes, D.; Fontes, F. A multi-objective unit commitment problem combining economic and environmental criteria in a metaheuristic approach. In Proceedings of the 4th International Conference on Energy and Environment Research, Porto, Portugal, 17–20 July 2017. [Google Scholar]
  3. Montero, L.; Bello, A.; Reneses, J. A review on the unit commitment problem: Approaches, techniques, and resolution methods. Energies 2022, 15, 1296. [Google Scholar] [CrossRef]
  4. De Mars, P.; O’Sullivan, A. Applying reinforcement learning and tree search to the unit commitment problem. Appl. Energy 2021, 302, 117519. [Google Scholar] [CrossRef]
  5. De Mars, P.; O’Sullivan, A. Reinforcement learning and A* search for the unit commitment problem. Energy AI 2022, 9, 100179. [Google Scholar] [CrossRef]
  6. Jasmin, E.A.; Imthias Ahamed, T.P.; Jagathy Raj, V.P. Reinforcement learning solution for unit commitment problem through pursuit method. In Proceedings of the 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies, Bangalore, India, 28–29 December 2009. [Google Scholar]
  7. Jasmin, E.A.T.; Remani, T. A function approximation approach to reinforcement learning for solving unit commitment problem with photo voltaic sources. In Proceedings of the 2016 IEEE International Conference on Power Electronics, Drives and Energy Systems, Trivandrum, India, 14–17 December 2016. [Google Scholar]
  8. Li, F.; Qin, J.; Zheng, W. Distributed Q-learning-based online optimization algorithm for unit commitment and dispatch in smart grid. IEEE Trans. Cybern. 2019, 50, 4146–4156. [Google Scholar] [CrossRef] [PubMed]
  9. Navin, N.; Sharma, R. A fuzzy reinforcement learning approach to thermal unit commitment problem. Neural Comput. Appl. 2019, 31, 737–750. [Google Scholar] [CrossRef]
  10. Dalal, G.; Mannor, S. Reinforcement learning for the unit commitment problem. In Proceedings of the 2015 IEEE Eindhoven PowerTech, Eindhoven, Netherlands, 29 June–2 July 2015. [Google Scholar]
  11. Qin, J.; Yu, N.; Gao, Y. Solving unit commitment problems with multi-step deep reinforcement learning. In Proceedings of the 2021 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids, Aachen, Germany, 25–28 October 2021. [Google Scholar]
  12. Ongsakul, W.; Petcharaks, N. Unit commitment by enhanced adaptive Lagrangian relaxation. IEEE Trans. Power Syst. 2004, 19, 620–628. [Google Scholar] [CrossRef]
  13. Nemati, M.; Braun, M.; Tenbohlen, S. Optimization of unit commitment and economic dispatch in microgrids based on genetic algorithm and mixed integer linear programming. Appl. Energy 2018, 2018, 944–963. [Google Scholar] [CrossRef]
  14. Trüby, J. Thermal Power Plant Economics and Variable Renewable Energies: A Model-Based Case Study for Germany; International Energy Agency: Paris, France, 2014.
  15. Zhang, K.; Yang, Z.; Başar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Handbook of Reinforcement Learning and Control; Vamvoudakis, K.G., Wan, Y., Lewis, F.L., Cansever, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2021; pp. 321–384. [Google Scholar]
  16. Wilensky, U.; Rand, W. An Introduction to Agent-Based Modeling: Modeling Natural, Social, and Engineered Complex Systems with NetLogo; The MIT Press: London, UK, 2015. [Google Scholar]
  17. Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; The MIT Press: London, UK, 2018. [Google Scholar]
  18. Matzliach, B.; Ben-Gal, I.; Kagan, E. Detection of static and mobile targets by an autonomous agent with deep Q-learning abilities. Entropy 2022, 24, 1168. [Google Scholar] [CrossRef] [PubMed]
  19. Adam, S.; Busoniu, L.; Babuska, R. Experience replay for real-time reinforcement learning control. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2012, 42, 201–212. [Google Scholar] [CrossRef]
  20. Yildirim, M.; Özcan, M. Unit commitment problem with emission cost constraints by using genetic algorithm. Gazi Univ. J. Sci. 2022, 35, 957–967. [Google Scholar]
Figure 1. Optimal loads of test power system I using the proposed method.
Figure 1. Optimal loads of test power system I using the proposed method.
Energies 16 05920 g001
Figure 2. Costs of test power system I using the proposed method.
Figure 2. Costs of test power system I using the proposed method.
Energies 16 05920 g002
Figure 3. Radar plot of the hourly optimal costs using the genetic algorithm (GA) and the proposed reinforcement learning (RL).
Figure 3. Radar plot of the hourly optimal costs using the genetic algorithm (GA) and the proposed reinforcement learning (RL).
Energies 16 05920 g003
Table 1. Optimal commitments, optimal loads, and available reserve of test power system I using the proposed method.
Table 1. Optimal commitments, optimal loads, and available reserve of test power system I using the proposed method.
Hour
( t )
Optimal CommitmentsOptimal Loads (MW) r
(%)
z 1 t z 2 t z 3 t z 4 t z 5 t p 1 t p 2 t p 3 t p 4 t p 5 t
110000400.0000013.8
210100426.5023.50030.0
310100450.9029.10021.9
410100455.0045.00017.0
510100455.0075.00010.4
611100455.036.658.40030.0
711100455.052.073.00023.3
811100455.062.382.70019.2
911100455.072.692.40015.3
1011110455.077.797.320022.3
1111110455.093.1111.920016.9
1211110455.0103.3121.720013.6
1311110455.077.797.320022.3
1411100455.072.592.50015.3
1511100455.062.382.70019.2
1611100455.036.658.40030.0
1710100455.0045.00017.0
1810110455.0075.020020.9
1910110455.00125.020010.8
2010111455.00130.0551010.8
2110110455.00125.020010.8
2210110455.0075.020020.9
2310100455.0045.00017.0
2410100426.4023.60030.0
Table 2. Comparison of the optimal costs (start-up cost, production cost, emission cost, and total cost) of test power system I between genetic algorithm (GA) [20] and the proposed RL method.
Table 2. Comparison of the optimal costs (start-up cost, production cost, emission cost, and total cost) of test power system I between genetic algorithm (GA) [20] and the proposed RL method.
Hour
( t )
Genetic Algorithm [20]Proposed RL
Start-UpProductionEmissionTotalStart-UpProductionEmissionTotal
108466482413,29007553424111,793
208466482813,2945609061470214,324
36010,564504415,66809560500414,564
4010,564504415,60809893517015,063
5112011,327582518,271010,395540115,797
6011,327582517,151110011,427555518,082
7011,327582517,151011,930578617,717
86013,425604519,529012,267594018,207
9013,425604519,469012,604609418,699
1034013,523614520,00834013,591625120,183
113015,621636522,016014,099648320,582
12015,621636521,986014,439663721,076
13013,523614519,668013,591625119,843
143013,425604519,499012,604609418,699
15110013,456604520,601012,267594018,207
16011,358582517,182011,427555516,982
17011,358582517,18209893517015,063
18011,358582517,18217011,213548116,865
1934013,554614520,039012,059586617,926
20013,554614519,6996013,862608520,007
21013,554614519,699012,059586617,926
22011,358582517,182011,213548116,695
236010,564504415,66809893517015,063
2408466482413,29009061470213,764
Total3140289,178138,013430,3312230275,962134,931413,122
Table 3. Optimal costs of power production for large-scale systems.
Table 3. Optimal costs of power production for large-scale systems.
Number of UnitsCost ($)
Start-UpProductionEmissionTotal
104840545,837270,724821,401
2093001,083,979540,4761,633,754
3014,3701,622,473811,8662,448,710
4018,9802,160,1141,082,6663,261,760
5031,6602,744,0901,329,0714,104,821
8051,3204,369,0932,129,5266,549,939
10058,9605,456,7002,674,2078,189,867
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ebrie, A.S.; Paik, C.; Chung, Y.; Kim, Y.J. Environment-Friendly Power Scheduling Based on Deep Contextual Reinforcement Learning. Energies 2023, 16, 5920. https://doi.org/10.3390/en16165920

AMA Style

Ebrie AS, Paik C, Chung Y, Kim YJ. Environment-Friendly Power Scheduling Based on Deep Contextual Reinforcement Learning. Energies. 2023; 16(16):5920. https://doi.org/10.3390/en16165920

Chicago/Turabian Style

Ebrie, Awol Seid, Chunhyun Paik, Yongjoo Chung, and Young Jin Kim. 2023. "Environment-Friendly Power Scheduling Based on Deep Contextual Reinforcement Learning" Energies 16, no. 16: 5920. https://doi.org/10.3390/en16165920

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop