Next Article in Journal
An End-to-End Video Steganography Network Based on a Coding Unit Mask
Next Article in Special Issue
Optimal Scheduling of Cogeneration System with Heat Storage Device Based on Artificial Bee Colony Algorithm
Previous Article in Journal
Analysis of LED Lamps’ Sensitivity to Surge Impulse
Previous Article in Special Issue
DWT-LSTM-Based Fault Diagnosis of Rolling Bearings with Multi-Sensors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples

1
Department of Artificial Intelligence Convergence, Pukyong National University, Busan 48513, Korea
2
Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul 03760, Korea
*
Author to whom correspondence should be addressed.
Electronics 2022, 11(7), 1141; https://doi.org/10.3390/electronics11071141
Submission received: 8 March 2022 / Revised: 29 March 2022 / Accepted: 1 April 2022 / Published: 4 April 2022

Abstract

:
Markov decision processes (MDPs) are widely used to model stochastic systems to deduce optimal decision-making policies. As the transition probabilities are usually unknown in MDPs, simulation-based policy improvement (SBPI) using a base policy to derive optimal policies when the state transition probabilities are unknown is suggested. However, estimating the Q-value of each action to determine the best action in each state requires many simulations, which results in efficiency problems for SBPI. In this study, we propose a method to improve the overall efficiency of SBPI using optimal computing budget allocation (OCBA) based on accumulated samples. Previous works have mainly focused on improving SBPI efficiency for a single state and without using the previous simulation samples. In contrast, the proposed method improves the overall efficiency until an optimal policy can be found in consideration of the state traversal property of the SBPI. The proposed method accumulates simulation samples across states to estimate the unknown transition probabilities. These probabilities are then used to estimate the mean and variance of the Q-value for each action, which allows the OCBA to allocate the simulation budget efficiently to find the best action in each state. As the SBPI traverses the state, the accumulated samples allow appropriate allocation of OCBA; thus, the optimal policy can be obtained with a lower budget. The experimental results demonstrate the improved efficiency of the proposed method compared to previous works.

1. Introduction

A Markov decision process (MDP) is a discrete-time stochastic control scheme that aims to solve stochastic decision-making problems. Research on stochastic decision-making problems has been widely reported in numerous fields, such as physics [1,2], finance [3], and biology [4]. Intuitively, decision-making in MDP-based complex systems involves a process of finding an optimal policy. This has been leveraged to simulate real dynamic environments of complex systems to derive optimal solutions (i.e., policies) to predict or improve system performance; see these examples of a robot motion planning system [5], a dataflow system [6], and a mobile edge computing system [7]. The MDP consists of a set of discrete states and a finite set of actions. The MDP policy involves mapping from states to actions. When an action is implemented following the policy in a given state, it is transferred to a new state according to the transition probability and receives a reward, as shown in Figure 1. The objective of the MDP is to find an optimal policy, and the optimal policy consists of the best actions that maximize the expected sum of discounted rewards (i.e., Q-value) in each state. Since the MDP mimics practical management systems, the action space is typically large and transition probabilities are usually not known in advance; thus, directly finding the optimal policy is impractical and time consuming.
When it is not feasible to directly determine an optimal policy, a more appropriate solution is to improve from a given base policy, which is effective and often available in engineering practice. Simulation-based policy improvement (SBPI) (also known as rollout) [8] is a heuristic method of improving the base policy gradually by simulations. In a given state, SBPI estimates the Q-value of each action using simulations and updates the policy in the state with the selected best action depending on these values. Due to the ease of implementation of SBPI, it has been widely applied for many problems, including electric vehicle charging [9] and post-hazard recovery [10]. However, when the number of available actions in each state is large or the reachable states are numerous, each collected simulation sample may result in large variance. Thus, for selecting the best action accurately, a large simulation budget (i.e., a large number of simulation replications) is required to estimate the Q-value of each action, which results in an efficiency problem.
Ranking and selection (R&S) procedures can be used to address the above problem because they can efficiently select the best action under a limited simulation budget by allocating the budget based on statistical inference. There are various types of R&S procedures, such as indifference-zone [11], uncertainty evaluation [12], and optimal computing budget allocation (OCBA) [13]. Among them, OCBA is used in many fields owing to its excellent efficiency, simplicity, and strong theoretical background. It allocates the simulation budget to asymptotically maximize the lower bound of the probability of correct selection based on the ratio of the sample mean to sample variance. Based on the merits of OCBA, Jia et al. [14] applied it to improve the efficiency of SBPI and efficiently find the best action for a given state. Wu et al. [15] developed a sample path sharing procedure to further improve upon the above work. For a given state, a sample path is obtained by selecting an action and thereafter following the given policy. When the number of sample paths increases, the overlaps between the sample paths generated by different actions allow accurate estimation of the Q-value for each action. They reported that the sample path sharing procedure dramatically improves the efficiency of SBPI compared to the previous methods [14].
To derive the optimal policy, the SBPI should traverse all states until the base policy of each state can no longer be improved. However, the above works focus on improving the efficiency of the SBPI in a single state; i.e., there is room to further improve the efficiency of finding the optimal policy with SBPI. In this work, we propose a method to improve the overall efficiency of SBPI using OCBA and sample accumulation. Specifically, the proposed method accumulates simulation samples across all states to estimate the unknown state transition probabilities. These probabilities are used to estimate the mean and variance of the Q-value for each action. As the SBPI traverses the states, the probabilities become more accurate, thereby enabling precise estimations of the mean and variance. They allow the OCBA to allocate a budget suited to each action and to select the best action for a low cost. Thus, the proposed method can reduce the total budget required to derive the optimal policy with SBPI compared to other methods, which is demonstrated using two MDP examples. The present work was adapted from our previous work [16], which used accumulated samples to estimate the mean of the Q-values. We expanded the use of these samples to further estimate the variance of the Q-values by fully adapting to the OCBA workflow.

2. Problem Definition

Herein, we consider a discrete-time MDP with discrete state space S and discrete action space A. In the MDP, the policy learner or decision-maker is called the agent. Assume that the agent is at state s and performs an action a. Then it transits to a new state in the set of reachable states S s a = s 1 , s 2 , s 3 that are determined by the unknown transition probabilities P s s 1 a , P s s 2 a , P s s 3 a . The agent receives one of the numerical rewards r ( s 1 ) , r ( s 2 ) , r ( s 3 ) based on the state of arrival. We define a random variable h ( s , a ) , whose possible outcomes are the rewards received when arriving at the corresponding state by taking the action a:
h ( s , a ) r ( s 1 ) , r ( s 2 ) , r ( s 3 ) .
If the transition probabilities are known, the expected reward for the action a in state s can be calculated as:
E h ( s , a ) = P s s 1 a · r ( s 1 ) + P s s 2 a · r ( s 2 ) + P s s 3 a · r ( s 3 ) .
Now, we formulate the policy improvement problem. In this study, we only consider a deterministic stationary policy π (i.e., a mapping from S to A), which is a guideline for the agent for the action that should be taken in a particular state. Assume that there exists a base policy π . For a given state s, if an action a is taken and then the base policy π is followed afterward, the Q-value of the action a A can be defined as:
Q π ( s , a ) = lim T E h ( s , a ) + E π t = 1 T 1 γ t h ( s t , π ( s t ) ) s , a ,
where T is the terminal time index, γ 0 , 1 is the discount rate, and  s t is one of the reachable states at time t (i.e., s t S s t 1 π ( s t 1 ) ). E π indicates the expected sum of discounted rewards obtained by taking actions following the given policy π from time 1 to T 1 . Using the definition above, the policy improvement at state s can be defined as:
π P I ( s ) = a b = arg max a { a 1 , a 2 , , a k } Q π ( s , a ) ,
where π P I is the improved policy from π by updating the previous action π ( s ) with the best action a b in s. In practice, the transition probabilities are usually unknown, and  Q π ( s , a ) cannot be calculated directly. Thus, Equation (3) can only be calculated using an infinite number of simulation replications n; i.e.,
Q π ( s , a ) = lim T lim n 1 n n N h ( s , a ) + t = 1 T 1 γ t h ( s t , π ( s t ) ) s , a .
Since it is practically infeasible to take actions infinitely in a simulation replication, T becomes a decision variable called epoch, and Equation (5) is approximated by
Q π T ( s , a ) = lim n 1 n n N h ( s , a ) + t = 1 T 1 γ t h ( s t , π ( s t ) ) s , a .
That is, in a single replication, T actions, including action a from state s, are sequentially taken depending on π , and a sample trajectory of Q π T ( s , a ) can be obtained as follows:
Q ˙ π T ( s , a ) = h ( s , a ) + t = 1 T 1 γ t h ( s t , π ( s t ) ) s , a .
In practice, the number of simulation replications n is typically limited; thus, Q π T ( s , a ) can be estimated from the average of the sample trajectories:
Q ¯ π T ( s , a ) = 1 n l = 1 n Q ˙ π T , l ( s , a ) ,
when n is large, due to the central limit theorem, it is reasonable to assume that Q ¯ π T ( s , a ) follows a normal distribution of Q π T ( s , a ) [17].
For each s S , the SBPI estimates Q ¯ π T ( s , a ) for every available action using many simulation replications and improves π by replacing the base action π ( s ) with the estimated best action a e :
π S B P I ( s ) = a e = arg max a a 1 , a 2 , , a k Q ¯ π T ( s , a ) .
To improve π exactly using the SBPI, the selection of a e should be correct (i.e., a e = a b ) at each s. From this point of view, the probability of the correct selection P C S can be defined according to [14] as
P C S = P a e = a b = P Q ˜ π ( s , a e ) Q ˜ π ( s , a ) ϵ .
Here, ϵ 0 is the tolerance level, and  Q ˜ π ( s , a ) is the posterior distribution of Q π ( s , a ) . Increasing n and T for each action can maximize P C S , but it causes the efficiency problem of the SBPI, as mentioned earlier.
To solve the problem, the existing methods [14,15] apply OCBA to allocate a given simulation budget N efficiently; i.e.,
arg max n 1 , n k , T P C S , s . t . i = 1 k n i T = N , and n i 0 ,
where n i is the number of simulation replications allocated to estimate the Q-value of the ith action. Under this definition, the OCBA aims to accurately allocate N to each action so that the best action can be correctly selected with higher P C S , thereby improving the efficiency of SBPI. The allocation rule of OCBA [14] is defined as follows:
n i n j = σ i / ( δ e , i T + ϵ 2 c ) σ j / ( δ e , j T + ϵ 2 c ) 2 , n e = σ e i = 1 , i e k n i σ i 2 , i j e ,
where δ e , i T Q ¯ π T ( s , a e ) Q ¯ π T ( s , a i ) , σ i 2 is the variance of Q-value for action a i , and 0 c ϵ / 2 is a constant determined by ϵ . In practice, σ i 2 is unknown in advance and so is approximated by sample variance [13]. In Equation (11), T is a decision variable, which is an important hyperparameter to determine the optimal simulation length while ensuring that the estimation of the action is as close as possible to the estimation with infinite simulation length. To determine the optimal T for each simulation sample, the authors of [14] proposed
T = log c ( 1 γ ) F log γ ,
where · is the ceiling function and F is the maximum absolute reward; i.e.,  max s S r ( s ) .
To find the optimal policy through SBPI, it is necessary to repeat the SBPI and traverse across states until the policy is no longer improved. Here, the quality of the policy can be evaluated as
V π s α = lim T E π t = 0 T 1 γ t h s t , π ( s t ) s α ,
where V π s α is the expected sum of discounted reward obtained by sequentially taking actions based on π from the initial state s α . V π s α is maximized if π is the optimal policy π . The iteration number of SBPI is denoted as m. When the total simulation budget is given as B, the problem of finding π using SBPI can be defined as
max V π P I m s α s . t . N m = B ,
where π P I m represents the improved policy via mth SBPI. If N increases, the existing methods [14,15] can correctly select the best action in each state and improve the policy. However, before  π converges to π , the selected best action may not be the actual best action. In other words, the best action in the same state may change as the policy is updated, as shown in Equation (9). When B is fixed, increasing N causes insufficient iterations of the SBPI. Hence, regardless of how correctly the best action is selected in each state, the existing methods may not converge to π . Considering the this issue, it is necessary to decrease N and increase m to find the optimal policy. However, the existing methods may not be able to correctly select the best action when N is small, since they discard the previous measurements after each update. In the next section, we propose a method for accumulating simulated samples to allow the OCBA to allocate small N efficiently and select the best action correctly.

3. Proposed Method

Herein, we illustrate the proposed method in two parts. Firstly, we show how to utilize accumulated samples to estimate the unknown transition probabilities. Secondly, we utilize the probability estimates to derive the mean and variance of the Q-values for OCBA to accurately allocate the budget. As defined in Equation (3), if the transition probabilities are known, they can be unrolled as
Q π s , a = s S s a P s s a r s + γ s S s a P s s a s S s π ( s ) P s s π ( s ) r s + + γ T 1 s S s a P s s a s ( T ) S s ( T 1 ) π ( s ( T 1 ) ) P s ( T 1 ) s ( T ) π ( s ( T 1 ) ) r s ( T ) .
The formulation above can be further rewritten in a recursive form by extracting the common factor s S s a P s s a :
Q π s , a = s S s a P s s a r ( s ) + γ Q π ( s , π ( s ) ) .
With Equation (17), Q π ( s , a ) can be computed for all the action candidates in state s, and the best action can be accurately selected using Equation (4). However, the transition probabilities are unknown beforehand. To this end, our method accumulates all the state–action pairs generated by each simulation sample to estimate the unknown transition probabilities.
Let N s s a be the cumulative number for arriving at a state s when an action a is taken in that state s. Then, the transition probability can be estimated as
P ^ s s a = N s s a s i S s a N s s i a .
We use a table to store all the N s s a for estimating and updating probabilities, as shown in Figure 2. The first column in Figure 2 represents the possible state–action pairs. The first row represents the reachable states.
As the simulation progresses, the state–action pairs accumulate and result in accurate estimates of the transition probability. Thus, we use these estimates to derive the mean and variance of the Q-values, which allows OCBA accurately allocate the given simulation budget to each available action based on the sample accumulation. To derive the mean with estimated probabilities, we substitute Equation (18) into (17):
Q π ^ ( s , a ) = s S s a P ^ s s a r ( s ) + γ Q ^ π ( s , π ( s ) ) .
When the samples are accumulated, the probability estimates are approximate equal to the real probability distribution, thereby ensuring that Q π ^ ( s , a ) is an unbiased estimated mean of the Q-value. For the variance of the Q-value, we utilize the variance definition of a random variable as follows:
σ π 2 ( s , a ) = E π h ( s , a ) + t = 1 T 1 γ t h ( s t , π ( s t ) ) 2 s , a Q π ( s , a ) 2 .
Unrolling the quadratic sum in Equation (20), we then have
σ π 2 ( s , a ) = s S s a P s s a r ( s ) 2 + 2 s S s a P s s a r ( s ) E π t = 1 T 1 γ t h ( s t , π ( s t ) ) s , π ( s ) + s S s a P s s a E π t = 1 T 1 γ t h ( s t , π ( s t ) ) 2 s , π ( s ) Q π ( s , a ) 2 .
From the equation above, we can observe that the first term E π t = 1 T 1 γ t h ( s t , π ( s t ) ) has a form similar to Equation (3). The difference here is that it multiplies γ from the first state and follows the base policy from the beginning. Further, the second term E π t = 1 T 1 γ t h ( s t , π ( s t ) ) 2 has a form similar to the expected form in Equation (20). Thus, we extract γ from the first term and γ 2 from the second term
σ π 2 ( s , a ) = s S s a P s s a r ( s ) 2 + 2 γ s S s a P s s a r ( s ) E π t = 1 T 1 γ t 1 h ( s t , π ( s t ) ) s , π ( s ) + γ 2 s S s a P s s a E π t = 1 T 1 γ t 1 h ( s t , π ( s t ) ) 2 s , π ( s ) Q π ( s , a ) 2 .
It is clear that the term E π t = 1 T 1 γ t 1 h ( s t , π ( s t ) ) can be substituted by Q π ( s , π ( s ) ) based on Equation (3), and the term E π t = 1 T 1 γ t 1 h ( s t , π ( s t ) ) 2 can be rewritten as σ π 2 ( s , π ( s ) ) + Q π ( s , π ( s ) ) 2 according to Equation (20). Thus, we have
σ π 2 ( s , a ) = s S s a P s s a [ γ 2 Q π ( s , π ( s ) ) 2 + 2 γ r ( s ) Q π ( s , π ( s ) ) + r ( s ) 2 ] + γ 2 s S s a P s s a σ π 2 ( s , π ( s ) ) Q π ( s , a ) 2 = s S s a P s s a r ( s ) + γ Q π ( s , π ( s ) ) 2 + γ 2 s S s a P s s a σ π 2 ( s , π ( s ) ) Q π ( s , a ) 2 .
To simplify the equation above, we rewrite it in recursive form. Let R ( s ) be a new reward function
R ( s ) = r ( s ) + γ Q π ( s , π ( s ) ) 2 Q π ( s , a ) 2 s S s a P s s a .
Then, Equation (23) can be rewritten with Equation (24) as:
σ π 2 ( s , a ) = s S s a P s s a R ( s ) + γ 2 σ π 2 ( s , π ( s ) ) .
As the unknown transition probabilities can be calculated from the table, as shown in Figure 2, the variance of the Q-values can be estimated as
σ ^ π 2 ( s , a ) = s S s a P ^ s s a R ^ ( s ) + γ 2 σ ^ π 2 ( s , π ( s ) ) ,
Despite the fact that the σ ^ π 2 ( s , a ) of each action’s Q-value may not be accurate at the beginning of the iteration, the allocation rule of the OCBA enables additional simulation replications to the promising action as the number of iterations increases by updating the measurement for each action. Thus, the estimation of probabilities becomes accurate as more samples are accumulated and results in an accurate estimate of the variance. To estimate Q ^ π ( s , a ) and σ ^ π 2 ( s , a ) , we rewrite them into recursive forms. Since it is not feasible to use an infinite number of T for estimation, we use Equation (13) to determine T. Then, Q ^ π ( s , a ) ) and σ ^ π 2 ( s , a ) are approximated as
Q ^ π T ( s , a ) = s S s a P ^ s s a r ( s ) + γ Q ^ π T 1 ( s , π ( s ) ) ,
σ ^ π 2 , T ( s , a ) = s S s a P ^ s s a R T ( s ) + γ 2 σ ^ π 2 , T 1 ( s , π ( s ) ) .
In the existing methods, the simulation budget allocated to estimate each Q ¯ π T ( s , a ) in a given state is limited to N. Thus, when N is small, inaccurate estimates of Q ¯ π T ( s , a ) and sample variance degrade the effectiveness of OCBA at allocating simulation replications to each action and result in low efficiency of SBPI. On the other hand, the proposed method accumulates simulation samples from the previous m updates (i.e., 1 m N ) to estimate and update transition probabilities. These probabilities are then used to compute Q ^ π T ( s , a ) and σ ^ π 2 , T ( s , a ) for the OCBA allocation rule of Equation (12), so that the OCBA can accurately allocate N to each action in each state and help improve the overall efficiency of SBPI. The proposed method is summarized in Algorithm 1.
Algorithm 1 Efficient simulation-based policy improvement with optimal computing budget allocation based on accumulated samples.
Require: a base policy π , an incremental replication △, simulation budget N, an initial state s a S , total simulation budget B. Initialize the probability table. Set the iteration number of SBPI m 1 . Determine T using Equation (13).
1:
while N m B do
2:
    for s in S do
3:
        Initialize l 0
4:
        if SBPI has never been applied to s then
5:
           Set n 1 l = = n k l = n 0
6:
           Run n 0 for a a 1 , , a k , and store N s s a
7:
           Estimate P ^ s s a by (18) and Calculate Q ^ π T ( s , a ) , σ ^ π 2 , T ( s , a ) for each a
8:
        else
9:
           Calculate Q ^ π T ( s , a ) , σ ^ π 2 , T ( s , a ) by P ^ s s a and
10:
            set n 1 l = = n k l = 0
11:
        end if
12:
        Select a e arg max a a 1 , a 2 , a k Q ^ π T ( s , a )
13:
        while  i = 1 k n i l T < N  do
14:
           Increase the simulation budget by △
15:
           Compute new allocation n 1 l + 1 , , n k l + 1 with Q ^ π T ( s , a ) and σ ^ π 2 , T ( s , a ) using (12)
16:
           Run additional max ( 0 , n i l + 1 n i l ) simulations for each a
17:
           Update P ^ s s a and compute Q ^ π T ( s , a ) , σ ^ π 2 , T ( s , a )
18:
           Select a e arg max a a 1 , a 2 , a k Q ^ π T ( s , a ) and set l l + 1
19:
        end while
20:
        return  π P I ( s ) a e
21:
        if  N m B  then
22:
           break
23:
        else
24:
            m m + 1
25:
        end if
26:
    end for
27:
end while
In Algorithm 1, lines 12 to 16 show the procedure of OCBA using Q ^ π T ( s , a ) and σ ^ π 2 , T ( s , a ) to allocate the simulation budget for each action. It is noted that Algorithm 1 initially allocates n 0 simulation replications to each action only when the SBPI has never been applied to that state, whereas the existing methods allocate n 0 to each action regardless of whether the state has been visited. The reason for this is that our method stores the accumulated samples from the previous updates and calculates the prior information for the state (i.e., line 5, 6); it does not allocate n 0 for the prior information in the next visit. For SBPI, the estimated best action a e is considered as the best action for a given state and is used to update the base policy. If the base policy is updated as the SBPI proceeds, the previously selected a e may no longer be the best for the updated policy even in the same state. Owing to this property of SBPI, existing methods may waste some of the simulation budget for previous policy updates when the SBPI is applied to each state to obtain the optimal policy. To avoid this, Algorithm 1 accumulates simulation samples from the previous updates and applies them to compute Q ^ π T ( s , a ) and σ ^ π 2 , T ( s , a ) ,a) to accurately deduce the optimal action in the following m. As Algorithm 1 proceeds, the accumulated samples help SBPI obtain the optimal policy with minimum m, thereby reducing the total simulation budget required. This suggests that the proposed method is efficient for improving the overall efficiency of SBPI, which accumulates the samples considering the property of SBPI.
The proposed method can be computationally inefficient owing to the recursive forms of Q ^ π T ( s , a ) and σ ^ π 2 , T ( s , a ) . However, this issue is tolerable compared to the simulation cost of an actual system. For a real-world MDP system (e.g., water resource management), a large duration of time is needed to return a reward. In this aspect, the computational time taken by Q ^ π T ( s , a ) and σ ^ π 2 , T ( s , a ) is negligible relative to the running time of obtaining the reward. Moreover, compared with the existing methods, the proposed method ha higher efficiency for finding the optimal policy, as shown in the Experiments section. This indicates that the superior efficiency of the proposed method is sufficient to mitigate its limitation with respect to the recursive calculation.

4. Experiments

Herein, we compare our method with five existing methods (EA, OCBAPI [14], OCBA-S [15], EA-sample accumulation (SA), and OCBAPI-SA) using two MDP models, namely, a two-state example and its extended version. The extended version is used to verify the effectiveness and efficiency of the proposed method in a more complex manner. The description of the five methods is summarized in Table 1.
The two-state example was expanded by increasing the available actions in state s 2 , as shown in Figure 3. The example had two states, and 20 actions were available in each state. Action α i and β i were the ith elements in the action vector A = B = 0.0 , 0.05 , , 0.95 . The state transition probabilities were determined by the choice of actions. For example, if the agent selected an action α 2 in state s 1 , the probability of remaining in state s 1 was 0.05 and the probability of transferring to state s 2 was 0.95. When the agent arrived at state s 1 , it always received a reward of 0 but received a reward of 1 when arriving at state s 2 . The extended version of the two-state example was obtained by increasing the number of states from 2 to 10, as shown in Figure 4. The agent in the extended model received a reward of 5 only when it reached state s 10 .
The base policy was set as selecting the action 0.5 in each state for two examples, and state s 1 was set as the initial state. In the two-state example, let the discount rate γ = 0.7 and tolerance level ϵ = 0.1 . Then, c = ϵ / 2 = 0.05 is obtained and T = ( log c ( 1 γ ) / F ) / log γ = 12 , where F = 1 . The simulation budget for each method in each state was N = 60 T . The number of iterations of the SBPI was m = 20 so that the total simulation budget was B = N m = 1200 T . In the extended example, we only changed the tolerance level to ϵ = 0.5 and the iteration number to m = 100 . For the methods using OCBA, we set n 0 = 2 T , and incremental replication = 2 T . The value function and PCS of each method were estimated over 5000 independent replicated experiments, and the results are shown in Figure 5. In both examples, the proposed OCBAPI-SA2 converged to the optimal policy faster than the other methods, and the gap increased as the problem complexity increased. All experiments were implemented based on Python (version 3.7.9).
The results of EA-SA, OCBAPI-SA, and OCBAPI-SA2 indicate the effectiveness of the sample accumulation in the SBPI. As shown in Figure 5A,C, these methods had superior efficiencies to their original versions, i.e., EA and OCBA. As m increased, the accumulated samples allowed precise estimates of Q ^ π T ( s , a ) , which enabled the methods to select the best action correctly, as shown in Figure 5B,D. Meanwhile, OCBA-S achieved a higher P C S than EA-SA at the beginning iteration of SBPI, which can be attributed to its efficient allocation rule and sample path sharing. However, the sample path sharing was limited to a single state, EA-SA, and the sample accumulation improved as m increased.
While OCBAPI-SA only used Q ^ π T ( s , a ) and sample variance for OCBA allocation, OCBAPI-SA2 used Q ^ π T ( s , a ) and σ ^ π 2 , T ( s , a ) evaluated from the accumulated samples. When N was small, OCBAPI-SA could not efficiently allocate △ to each action owing to inaccurate sample estimates. On the contrary, OCBAPI-SA2 surpassed the limited N by accumulating samples from previous updates to estimate Q ^ π T ( s , a ) and σ ^ π 2 , T ( s , a ) . As SBPI proceeded, OCBAPI-SA2 could efficiently allocate △ to the promising actions and resulted in a higher P C S than OCBAPI-SA. Although the gap between them was relatively insignificant in the two-state example owing to the small number of states, it became large in the complex problem, as shown in Figure 5C,D.

5. Conclusions

In this study, we proposed a method called OCBAPI-SA2 that uses OCBA to improve the overall efficiency of SBPI. Unlike existing methods, OCBAPI-SA2 aims to improve the overall efficiency by considering the state traversal property of SBPI. To achieve this, OCBAPI-SA2 applies SBPI to traverse across states and accumulates the simulation samples to estimate the unknown transition probabilities. Then, it utilizes these probabilities to compute the mean and variance of the Q-values for OCBA to efficiently allocate simulation budget. With the accumulation of samples, OCBAPI-SA2 allows SBPI to obtain the optimal policy with a lower simulation budget, which is important in practice for complex systems with limited budgets. The experimental results show that the superior efficiency of the OCBAPI-SA2 is comparable to those of existing methods. Considering the properties of the SBPI, the sequence of improving policy has a significant impact on reducing the simulation budget. To further improve the overall efficiency of SBPI, our future work will focus on the optimal sequence of state traversal.

Author Contributions

X.H. and S.H.C. conceived the methodology and designed the experiments; X.H. conducted the experiments; X.H. and S.H.C. analyzed the experimental results; X.H. wrote the manuscript; X.H. and S.H.C. reviewed and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ewha Womans University Research Grant of 2022.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflict of interest regarding the publication of this paper.

References

  1. Hendy, A.S.; Zaky, M.A.; Doha, E.H. On a discrete fractional stochastic Grönwall inequality and its application in the numerical analysis of stochastic FDEs involving a martingale. Int. J. Nonlinear Sci. Numer. Simul. 2021. [Google Scholar] [CrossRef]
  2. Hendy, A.S.; Zaky, M.A.; Suragan, D. Discrete fractional stochastic Grönwall inequalities arising in the numerical analysis of multi-term fractional order stochastic differential equations. Math. Comput. Simul. 2022, 193, 269–279. [Google Scholar] [CrossRef]
  3. Moghaddam, B.; Mendes Lopes, A.; Tenreiro Machado, J.; Mostaghim, Z. Computational scheme for solving nonlinear fractional stochastic differential equations with delay. Stoch. Anal. Appl. 2019, 37, 893–908. [Google Scholar] [CrossRef]
  4. Moghaddam, B.; Zhang, L.; Lopes, A.; Tenreiro Machado, J.; Mostaghim, Z. Sufficient conditions for existence and uniqueness of fractional stochastic delay differential equations. Stochastics 2020, 92, 379–396. [Google Scholar] [CrossRef]
  5. Jahanshahi, H.; Jafarzadeh, M.; Sari, N.N.; Pham, V.T.; Huynh, V.V.; Nguyen, X.Q. Robot motion planning in an unknown environment with danger space. Electronics 2019, 8, 201. [Google Scholar] [CrossRef] [Green Version]
  6. Tibaldi, M.; Palermo, G.; Pilato, C. Dynamically-Tunable Dataflow Architectures Based on Markov Queuing Models. Electronics 2022, 11, 555. [Google Scholar] [CrossRef]
  7. Ouyang, W.; Chen, Z.; Wu, J.; Yu, G.; Zhang, H. Dynamic Task Migration Combining Energy Efficiency and Load Balancing Optimization in Three-Tier UAV-Enabled Mobile Edge Computing System. Electronics 2021, 10, 190. [Google Scholar] [CrossRef]
  8. Bertsekas, D.P.; Castanon, D.A. Rollout algorithms for stochastic scheduling problems. J. Heuristics 1999, 5, 89–108. [Google Scholar] [CrossRef]
  9. Huang, Q.L.; Jia, Q.S.; Qiu, Z.F.; Guan, X.H.; Deconinck, G. Matching EV charging load with uncertain wind: A simulation-based policy improvement approach. IEEE Trans. Smart Grid 2015, 6, 1425–1433. [Google Scholar] [CrossRef]
  10. Sarkale, Y.; Nozhati, S.; Chong, E.K.P.; Ellingwood, B.R.; Mahmoud, H. Solving Markov decision processes for network-level post-hazard recovery via simulation optimization and rollout. In Proceedings of the IEEE 14th International Conference on Automation Science and Engineering, Munich, Germany, 20–24 August 2018; pp. 906–912. [Google Scholar]
  11. Kim, S.H.; Nelson, B.L. A fully sequential procedure for indifference-zone selection in simulation. ACM Trans. Model. Comput. 2001, 11, 251–273. [Google Scholar] [CrossRef]
  12. Choi, S.H.; Kim, T.G. Efficient ranking and selection for stochastic simulation model based on hypothesis test. IEEE Trans. Syst. Man Cybern. Syst. 2017, 48, 1555–1565. [Google Scholar] [CrossRef]
  13. Chen, C.H.; Lin, J.W.; Yücesan, E.; Chick, S.E. Simulation budget allocation for further enhancing the efficiency of ordinal optimization. J. Discr. Event Dyn. Syst. Theory Appl. 2000, 10, 251–270. [Google Scholar] [CrossRef] [Green Version]
  14. Jia, Q.S. Efficient computing budget allocation for simulation-based policy improvement. IEEE Trans. Autom. Sci. Eng. 2012, 9, 342–352. [Google Scholar] [CrossRef]
  15. Wu, D.; Jia, Q.S.; Chen, C.H. Sample path sharing in simulation-based policy improvement. In Proceedings of the IEEE International Conference on Robotics and Automation, Hong Kong, China, 31 May–5 June 2014; pp. 3291–3296. [Google Scholar]
  16. Huang, X.L.; Choi, S.H. A Simulation Sample Accumulation Method for Efficient Simulation-based Policy Improvement in Markov Decision Process. J. Korea Multimed. Soc. 2020, 23, 830–839. [Google Scholar]
  17. DeGroot, M.H. Optimal Statistical Decisions; John Wiley & Sons: Hoboken, NJ, USA, 2005; Volume 82. [Google Scholar]
Figure 1. An example of MDP, where a circle represents a state and a square represents an action that is available in the state. P represents an unknown state transition probability, a reward is denoted by r, and π is the policy.
Figure 1. An example of MDP, where a circle represents a state and a square represents an action that is available in the state. P represents an unknown state transition probability, a reward is denoted by r, and π is the policy.
Electronics 11 01141 g001
Figure 2. Table of the cumulative number of state–action pairs, where 0 represents the unreachable states.
Figure 2. Table of the cumulative number of state–action pairs, where 0 represents the unreachable states.
Electronics 11 01141 g002
Figure 3. A two-state Markov decision process.
Figure 3. A two-state Markov decision process.
Electronics 11 01141 g003
Figure 4. An extended version of the two-state example.
Figure 4. An extended version of the two-state example.
Electronics 11 01141 g004
Figure 5. Graphs indicate the value function of the improved policy and P C S of each method for the two examples: (A,B) the results of the two-state example; (C,D) the results of the extended version.
Figure 5. Graphs indicate the value function of the improved policy and P C S of each method for the two examples: (A,B) the results of the two-state example; (C,D) the results of the extended version.
Electronics 11 01141 g005
Table 1. Summary of comparison methods.
Table 1. Summary of comparison methods.
MethodDescriptionAllocation RuleBest Action Selection
EAStandard SBPI n i = N / k a e = arg max a Q ¯ π T ( s , a )
OCBAPIUsing OCBA to allocate the simulation budget efficientlyEquation (12) a e = arg max a Q ¯ π T ( s , a )
OCBA-SImproving efficiency of OCBAPI with sample path sharingEquation (12) a e = arg max a Q ¯ π T ( s , a ) a
EA-SAUsing sample accumulation for EA to select the best action via Equation (27) n i = N / k a e = arg max a Q ^ π T ( s , a )
OCBAPI-SAUsing sample accumulation for OCBAPI to select the best action via Equation (27)Equation (12) a e = arg max a Q ^ π T ( s , a )
OCBAPI-SA2
(Algorithm 1)
Using the estimated mean from Equation (27) and variance from Equation (28) of the Q-value to efficiently allocate computing budget for OCBAPI-SA.use Q ^ π T ( s , a ) , σ ^ π 2 , T ( s , a ) for Equation (12) a e = arg max a Q ^ π T ( s , a )
a  Sample path sharing is used to calculate Q ¯ π T , S ( s , a ) ; see more details in [15].
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Huang, X.; Choi, S.H. An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples. Electronics 2022, 11, 1141. https://doi.org/10.3390/electronics11071141

AMA Style

Huang X, Choi SH. An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples. Electronics. 2022; 11(7):1141. https://doi.org/10.3390/electronics11071141

Chicago/Turabian Style

Huang, Xilang, and Seon Han Choi. 2022. "An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples" Electronics 11, no. 7: 1141. https://doi.org/10.3390/electronics11071141

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop