An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples

Huang, Xilang; Choi, Seon Han

doi:10.3390/electronics11071141

Open AccessArticle

An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples

by

Xilang Huang

¹ and

Seon Han Choi

^2,*

¹

Department of Artificial Intelligence Convergence, Pukyong National University, Busan 48513, Korea

²

Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul 03760, Korea

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(7), 1141; https://doi.org/10.3390/electronics11071141

Submission received: 8 March 2022 / Revised: 29 March 2022 / Accepted: 1 April 2022 / Published: 4 April 2022

(This article belongs to the Special Issue Advances in Stochastic System Modeling, Control, Optimization, and Their Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Markov decision processes (MDPs) are widely used to model stochastic systems to deduce optimal decision-making policies. As the transition probabilities are usually unknown in MDPs, simulation-based policy improvement (SBPI) using a base policy to derive optimal policies when the state transition probabilities are unknown is suggested. However, estimating the Q-value of each action to determine the best action in each state requires many simulations, which results in efficiency problems for SBPI. In this study, we propose a method to improve the overall efficiency of SBPI using optimal computing budget allocation (OCBA) based on accumulated samples. Previous works have mainly focused on improving SBPI efficiency for a single state and without using the previous simulation samples. In contrast, the proposed method improves the overall efficiency until an optimal policy can be found in consideration of the state traversal property of the SBPI. The proposed method accumulates simulation samples across states to estimate the unknown transition probabilities. These probabilities are then used to estimate the mean and variance of the Q-value for each action, which allows the OCBA to allocate the simulation budget efficiently to find the best action in each state. As the SBPI traverses the state, the accumulated samples allow appropriate allocation of OCBA; thus, the optimal policy can be obtained with a lower budget. The experimental results demonstrate the improved efficiency of the proposed method compared to previous works.

Keywords:

Markov decision process; simulation-based policy improvement; optimal computing budget allocation; stochastic system optimization

1. Introduction

A Markov decision process (MDP) is a discrete-time stochastic control scheme that aims to solve stochastic decision-making problems. Research on stochastic decision-making problems has been widely reported in numerous fields, such as physics [1,2], finance [3], and biology [4]. Intuitively, decision-making in MDP-based complex systems involves a process of finding an optimal policy. This has been leveraged to simulate real dynamic environments of complex systems to derive optimal solutions (i.e., policies) to predict or improve system performance; see these examples of a robot motion planning system [5], a dataflow system [6], and a mobile edge computing system [7]. The MDP consists of a set of discrete states and a finite set of actions. The MDP policy involves mapping from states to actions. When an action is implemented following the policy in a given state, it is transferred to a new state according to the transition probability and receives a reward, as shown in Figure 1. The objective of the MDP is to find an optimal policy, and the optimal policy consists of the best actions that maximize the expected sum of discounted rewards (i.e., Q-value) in each state. Since the MDP mimics practical management systems, the action space is typically large and transition probabilities are usually not known in advance; thus, directly finding the optimal policy is impractical and time consuming.

When it is not feasible to directly determine an optimal policy, a more appropriate solution is to improve from a given base policy, which is effective and often available in engineering practice. Simulation-based policy improvement (SBPI) (also known as rollout) [8] is a heuristic method of improving the base policy gradually by simulations. In a given state, SBPI estimates the Q-value of each action using simulations and updates the policy in the state with the selected best action depending on these values. Due to the ease of implementation of SBPI, it has been widely applied for many problems, including electric vehicle charging [9] and post-hazard recovery [10]. However, when the number of available actions in each state is large or the reachable states are numerous, each collected simulation sample may result in large variance. Thus, for selecting the best action accurately, a large simulation budget (i.e., a large number of simulation replications) is required to estimate the Q-value of each action, which results in an efficiency problem.

Ranking and selection (R&S) procedures can be used to address the above problem because they can efficiently select the best action under a limited simulation budget by allocating the budget based on statistical inference. There are various types of R&S procedures, such as indifference-zone [11], uncertainty evaluation [12], and optimal computing budget allocation (OCBA) [13]. Among them, OCBA is used in many fields owing to its excellent efficiency, simplicity, and strong theoretical background. It allocates the simulation budget to asymptotically maximize the lower bound of the probability of correct selection based on the ratio of the sample mean to sample variance. Based on the merits of OCBA, Jia et al. [14] applied it to improve the efficiency of SBPI and efficiently find the best action for a given state. Wu et al. [15] developed a sample path sharing procedure to further improve upon the above work. For a given state, a sample path is obtained by selecting an action and thereafter following the given policy. When the number of sample paths increases, the overlaps between the sample paths generated by different actions allow accurate estimation of the Q-value for each action. They reported that the sample path sharing procedure dramatically improves the efficiency of SBPI compared to the previous methods [14].

To derive the optimal policy, the SBPI should traverse all states until the base policy of each state can no longer be improved. However, the above works focus on improving the efficiency of the SBPI in a single state; i.e., there is room to further improve the efficiency of finding the optimal policy with SBPI. In this work, we propose a method to improve the overall efficiency of SBPI using OCBA and sample accumulation. Specifically, the proposed method accumulates simulation samples across all states to estimate the unknown state transition probabilities. These probabilities are used to estimate the mean and variance of the Q-value for each action. As the SBPI traverses the states, the probabilities become more accurate, thereby enabling precise estimations of the mean and variance. They allow the OCBA to allocate a budget suited to each action and to select the best action for a low cost. Thus, the proposed method can reduce the total budget required to derive the optimal policy with SBPI compared to other methods, which is demonstrated using two MDP examples. The present work was adapted from our previous work [16], which used accumulated samples to estimate the mean of the Q-values. We expanded the use of these samples to further estimate the variance of the Q-values by fully adapting to the OCBA workflow.

2. Problem Definition

Herein, we consider a discrete-time MDP with discrete state space S and discrete action space A. In the MDP, the policy learner or decision-maker is called the agent. Assume that the agent is at state s and performs an action a. Then it transits to a new state in the set of reachable states

S_{s}^{a} = \{s_{1}, s_{2}, s_{3}\}

that are determined by the unknown transition probabilities

P_{s s_{1}}^{a}, P_{s s_{2}}^{a}, P_{s s_{3}}^{a}

. The agent receives one of the numerical rewards

r (s_{1}), r (s_{2}), r (s_{3})

based on the state of arrival. We define a random variable

h (s, a)

, whose possible outcomes are the rewards received when arriving at the corresponding state by taking the action a:

h (s, a) \in \{r (s_{1}), r (s_{2}), r (s_{3})\} .

(1)

If the transition probabilities are known, the expected reward for the action a in state s can be calculated as:

E [h (s, a)] = P_{s s_{1}}^{a} \cdot r (s_{1}) + P_{s s_{2}}^{a} \cdot r (s_{2}) + P_{s s_{3}}^{a} \cdot r (s_{3}) .

(2)

Now, we formulate the policy improvement problem. In this study, we only consider a deterministic stationary policy

π

(i.e., a mapping from S to A), which is a guideline for the agent for the action that should be taken in a particular state. Assume that there exists a base policy

π

. For a given state s, if an action a is taken and then the base policy

π

is followed afterward, the Q-value of the action

a \in A

can be defined as:

\begin{matrix} Q_{π} (s, a) = lim_{T \to \infty} \{E [h (s, a)] + E_{π} [\sum_{t = 1}^{T - 1} γ^{t} h (s^{t}, π (s^{t}))| s, a]\}, \end{matrix}

(3)

where T is the terminal time index,

γ \in [0, 1]

is the discount rate, and

s^{t}

is one of the reachable states at time t (i.e.,

s^{t} \in S_{s^{t - 1}}^{π (s^{t - 1})}

).

E_{π}

indicates the expected sum of discounted rewards obtained by taking actions following the given policy

π

from time 1 to

T - 1

. Using the definition above, the policy improvement at state s can be defined as:

π_{P I} (s) = a_{b} = {arg max}_{a \in {a_{1}, a_{2}, \dots, a_{k}}} Q_{π} (s, a),

(4)

where

π_{P I}

is the improved policy from

π

by updating the previous action

π (s)

with the best action

a_{b}

in s. In practice, the transition probabilities are usually unknown, and

Q_{π} (s, a)

cannot be calculated directly. Thus, Equation (3) can only be calculated using an infinite number of simulation replications n; i.e.,

Q_{π} (s, a) = lim_{T \to \infty} lim_{n \to \infty} \frac{1}{n} \sum_{n \in N}^{} \{h (s, a) + [\sum_{t = 1}^{T - 1} γ^{t} h (s^{t}, π (s^{t}))| s, a]\} .

(5)

Since it is practically infeasible to take actions infinitely in a simulation replication, T becomes a decision variable called epoch, and Equation (5) is approximated by

Q_{π}^{T} (s, a) = lim_{n \to \infty} \frac{1}{n} \sum_{n \in N}^{} \{h (s, a) + [\sum_{t = 1}^{T - 1} γ^{t} h (s^{t}, π (s^{t}))| s, a]\} .

(6)

That is, in a single replication, T actions, including action a from state s, are sequentially taken depending on

π

, and a sample trajectory of

Q_{π}^{T} (s, a)

can be obtained as follows:

\dot{Q}_{π}^{T} (s, a) = h (s, a) + [\sum_{t = 1}^{T - 1} γ^{t} h (s^{t}, π (s^{t}))| s, a] .

(7)

In practice, the number of simulation replications n is typically limited; thus,

Q_{π}^{T} (s, a)

can be estimated from the average of the sample trajectories:

\bar{Q}_{π}^{T} (s, a) = \frac{1}{n} \sum_{l = 1}^{n} \dot{Q}_{π}^{T, l} (s, a),

(8)

when n is large, due to the central limit theorem, it is reasonable to assume that

\bar{Q}_{π}^{T} (s, a)

follows a normal distribution of

Q_{π}^{T} (s, a)

[17].

For each

s \in S

, the SBPI estimates

\bar{Q}_{π}^{T} (s, a)

for every available action using many simulation replications and improves

π

by replacing the base action

π (s)

with the estimated best action

a_{e}

:

π_{S B P I} (s) = a_{e} = {arg max}_{a \in \{a_{1}, a_{2}, \dots, a_{k}\}} \bar{Q}_{π}^{T} (s, a) .

(9)

To improve

π

exactly using the SBPI, the selection of

a_{e}

should be correct (i.e.,

a_{e} = a_{b}

) at each s. From this point of view, the probability of the correct selection

P \{C S\}

can be defined according to [14] as

\begin{matrix} P \{C S\} & = P \{a_{e} = a_{b}\} \\ = P \{{\tilde{Q}}_{π} (s, a_{e}) \geq {\tilde{Q}}_{π} (s, a) - ϵ\} . \end{matrix}

(10)

Here,

ϵ \geq 0

is the tolerance level, and

{\tilde{Q}}_{π} (s, a)

is the posterior distribution of

Q_{π} (s, a)

. Increasing n and T for each action can maximize

P \{C S\}

, but it causes the efficiency problem of the SBPI, as mentioned earlier.

To solve the problem, the existing methods [14,15] apply OCBA to allocate a given simulation budget N efficiently; i.e.,

\begin{matrix} {arg max}_{\{n_{1}, \dots n_{k}, T\}} P \{C S\}, \\ s . t . \sum_{i = 1}^{k} n_{i} T = N, and n_{i} \geq 0, \end{matrix}

(11)

where

n_{i}

is the number of simulation replications allocated to estimate the Q-value of the ith action. Under this definition, the OCBA aims to accurately allocate N to each action so that the best action can be correctly selected with higher

P \{C S\}

, thereby improving the efficiency of SBPI. The allocation rule of OCBA [14] is defined as follows:

\begin{matrix} \frac{n_{i}}{n_{j}} & = {[\frac{σ_{i} / (δ_{e, i}^{T} + ϵ - 2 c)}{σ_{j} / (δ_{e, j}^{T} + ϵ - 2 c)}]}^{2}, \\ n_{e} & = σ_{e} \sqrt{\sum_{i = 1, i \neq e}^{k} {(\frac{n_{i}}{σ_{i}})}^{2}}, i \neq j \neq e, \end{matrix}

(12)

where

δ_{e, i}^{T} \equiv \bar{Q}_{π}^{T} (s, a_{e}) - \bar{Q}_{π}^{T} (s, a_{i})

,

σ_{i}^{2}

is the variance of Q-value for action

a_{i}

, and

0 \leq c \leq ϵ / 2

is a constant determined by

ϵ

. In practice,

σ_{i}^{2}

is unknown in advance and so is approximated by sample variance [13]. In Equation (11), T is a decision variable, which is an important hyperparameter to determine the optimal simulation length while ensuring that the estimation of the action is as close as possible to the estimation with infinite simulation length. To determine the optimal T for each simulation sample, the authors of [14] proposed

T = ⌈\frac{{log}_{} [\frac{c (1 - γ)}{F}]}{{log}_{} γ}⌉,

(13)

where

⌈\cdot⌉

is the ceiling function and F is the maximum absolute reward; i.e.,

{max}_{s \in S} |r (s)|

.

To find the optimal policy through SBPI, it is necessary to repeat the SBPI and traverse across states until the policy is no longer improved. Here, the quality of the policy can be evaluated as

V_{π} (s_{α}) = lim_{T \to \infty} E_{π} [\sum_{t = 0}^{T - 1} γ^{t} h (s^{t}, π (s^{t}))| s_{α}],

(14)

where

V_{π} (s_{α})

is the expected sum of discounted reward obtained by sequentially taking actions based on

π

from the initial state

s_{α}

.

V_{π} (s_{α})

is maximized if

π

is the optimal policy

π^{*}

. The iteration number of SBPI is denoted as m. When the total simulation budget is given as B, the problem of finding

π^{*}

using SBPI can be defined as

\max V_{π_{P I}^{m}} (s_{α}) s . t . N m = B,

(15)

where

π_{P I}^{m}

represents the improved policy via mth SBPI. If N increases, the existing methods [14,15] can correctly select the best action in each state and improve the policy. However, before

π

converges to

π^{*}

, the selected best action may not be the actual best action. In other words, the best action in the same state may change as the policy is updated, as shown in Equation (9). When B is fixed, increasing N causes insufficient iterations of the SBPI. Hence, regardless of how correctly the best action is selected in each state, the existing methods may not converge to

π^{*}

. Considering the this issue, it is necessary to decrease N and increase m to find the optimal policy. However, the existing methods may not be able to correctly select the best action when N is small, since they discard the previous measurements after each update. In the next section, we propose a method for accumulating simulated samples to allow the OCBA to allocate small N efficiently and select the best action correctly.

3. Proposed Method

Herein, we illustrate the proposed method in two parts. Firstly, we show how to utilize accumulated samples to estimate the unknown transition probabilities. Secondly, we utilize the probability estimates to derive the mean and variance of the Q-values for OCBA to accurately allocate the budget. As defined in Equation (3), if the transition probabilities are known, they can be unrolled as

\begin{matrix} Q_{π} (s, a) & = \sum_{s^{^{'}} \in S_{s}^{a}} P_{s s^{^{'}}}^{a} r (s^{^{'}}) + γ \sum_{s^{^{'}} \in S_{s}^{a}} P_{s s^{^{'}}}^{a} \sum_{s^{^{″}} \in S_{s^{^{'}}}^{π (s^{^{'}})}} P_{s^{^{'}} s^{^{''}}}^{π (s^{^{'}})} r (s^{^{″}}) + \dots \\ + γ^{T - 1} \sum_{s^{^{'}} \in S_{s}^{a}} P_{s s^{^{'}}}^{a} \dots \sum_{s^{^{'} (T)} \in S_{s^{^{'} (T - 1)}}^{π (s^{^{'} (T - 1)})}} P_{s^{^{'} (T - 1)} s^{^{'} (T)}}^{π (s^{^{'} (T - 1)})} r (s^{^{'} (T)}) . \end{matrix}

(16)

The formulation above can be further rewritten in a recursive form by extracting the common factor

\sum_{s^{^{'}} \in S_{s}^{a}} P_{s s^{^{'}}}^{a}

:

Q_{π} (s, a) = \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} [r (s^{^{'}}) + γ Q_{π} (s^{^{'}}, π (s^{^{'}}))] .

(17)

With Equation (17),

Q_{π} (s, a)

can be computed for all the action candidates in state s, and the best action can be accurately selected using Equation (4). However, the transition probabilities are unknown beforehand. To this end, our method accumulates all the state–action pairs generated by each simulation sample to estimate the unknown transition probabilities.

Let

N_{s s^{^{'}}}^{a}

be the cumulative number for arriving at a state

s^{^{'}}

when an action a is taken in that state s. Then, the transition probability can be estimated as

\hat{P}_{s s^{^{'}}}^{a} = \frac{N_{s s^{^{'}}}^{a}}{\sum_{s_{i} \in S_{s}^{a}} N {_{s s_{i}}^{a}}^{}} .

(18)

We use a table to store all the

N_{s s^{^{'}}}^{a}

for estimating and updating probabilities, as shown in Figure 2. The first column in Figure 2 represents the possible state–action pairs. The first row represents the reachable states.

As the simulation progresses, the state–action pairs accumulate and result in accurate estimates of the transition probability. Thus, we use these estimates to derive the mean and variance of the Q-values, which allows OCBA accurately allocate the given simulation budget to each available action based on the sample accumulation. To derive the mean with estimated probabilities, we substitute Equation (18) into (17):

\hat{Q_{π}} (s, a) = \sum_{s^{^{'}} \in S_{s}^{a}}^{} \hat{P}_{s s^{^{'}}}^{a} [r (s^{^{'}}) + γ {\hat{Q}}_{π} (s^{^{'}}, π (s^{^{'}}))] .

(19)

When the samples are accumulated, the probability estimates are approximate equal to the real probability distribution, thereby ensuring that

\hat{Q_{π}} (s, a)

is an unbiased estimated mean of the Q-value. For the variance of the Q-value, we utilize the variance definition of a random variable as follows:

σ_{π}^{2} (s, a) = E_{π} [{(h (s, a) + \sum_{t = 1}^{T - 1} γ^{t} h (s^{t}, π (s^{t})))}^{2}| s, a] - Q_{π} {(s, a)}^{2} .

(20)

Unrolling the quadratic sum in Equation (20), we then have

\begin{matrix} σ_{π}^{2} (s, a) & = \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} r {(s^{^{'}})}^{2} + 2 \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} r (s^{^{'}}) E_{π} [\sum_{t = 1}^{T - 1} γ^{t} h (s^{t}, π (s^{t}))| s^{^{'}}, π (s^{^{'}})] \\ + \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} E_{π} [{(\sum_{t = 1}^{T - 1} γ^{t} h (s^{t}, π (s^{t})))}^{2}| s^{^{'}}, π (s^{^{'}})] - Q_{π} {(s, a)}^{2} . \end{matrix}

(21)

From the equation above, we can observe that the first term

E_{π} [\sum_{t = 1}^{T - 1} γ^{t} h (s^{t}, π (s^{t}))]

has a form similar to Equation (3). The difference here is that it multiplies

γ

from the first state and follows the base policy from the beginning. Further, the second term

E_{π} [{(\sum_{t = 1}^{T - 1} γ^{t} h (s^{t}, π (s^{t})))}^{2}]

has a form similar to the expected form in Equation (20). Thus, we extract

γ

from the first term and

γ^{2}

from the second term

\begin{matrix} σ_{π}^{2} (s, a) & = \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} r {(s^{^{'}})}^{2} + 2 γ \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} r (s^{^{'}}) E_{π} [\sum_{t = 1}^{T - 1} γ^{t - 1} h (s^{t}, π (s^{t}))| s^{^{'}}, π (s^{^{'}})] \\ + γ^{2} \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} E_{π} [{(\sum_{t = 1}^{T - 1} γ^{t - 1} h (s^{t}, π (s^{t})))}^{2}| s^{^{'}}, π (s^{^{'}})] - Q_{π} {(s, a)}^{2} . \end{matrix}

(22)

It is clear that the term

E_{π} [\sum_{t = 1}^{T - 1} γ^{t - 1} h (s^{t}, π (s^{t}))]

can be substituted by

Q_{π} (s^{^{'}}, π (s^{^{'}}))

based on Equation (3), and the term

E_{π} [{(\sum_{t = 1}^{T - 1} γ^{t - 1} h (s^{t}, π (s^{t})))}^{2}]

can be rewritten as

σ_{π}^{2} (s^{^{'}}, π (s^{^{'}})) + Q_{π} {(s^{^{'}}, π (s^{^{'}}))}^{2}

according to Equation (20). Thus, we have

\begin{matrix} σ_{π}^{2} (s, a) & = \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} [γ^{2} Q_{π} {(s^{^{'}}, π (s^{^{'}}))}^{2} + 2 γ r (s^{^{'}}) Q_{π} (s^{^{'}}, π (s^{^{'}})) + r {(s^{^{'}})}^{2}] \\ + γ^{2} \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} σ_{π}^{2} (s^{^{'}}, π (s^{^{'}})) - Q_{π} {(s, a)}^{2} \\ = \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} {[r (s^{^{'}}) + γ Q_{π} (s^{^{'}}, π (s^{^{'}}))]}^{2} + γ^{2} \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} σ_{π}^{2} (s^{^{'}}, π (s^{^{'}})) - Q_{π} {(s, a)}^{2} . \end{matrix}

(23)

To simplify the equation above, we rewrite it in recursive form. Let

R (s^{^{'}})

be a new reward function

R (s^{^{'}}) = {[r (s^{^{'}}) + γ Q_{π} (s^{^{'}}, π (s^{^{'}}))]}^{2} - \frac{Q_{π} {(s, a)}^{2}}{\sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a}} .

(24)

Then, Equation (23) can be rewritten with Equation (24) as:

σ_{π}^{2} (s, a) = \sum_{s^{^{'}} \in S_{s}^{a}}^{} P_{s s^{^{'}}}^{a} [R (s^{^{'}}) + γ^{2} σ_{π}^{2} (s^{^{'}}, π (s^{^{'}}))] .

(25)

As the unknown transition probabilities can be calculated from the table, as shown in Figure 2, the variance of the Q-values can be estimated as

\hat{σ}_{π}^{2} (s, a) = \sum_{s^{^{'}} \in S_{s}^{a}}^{} \hat{P}_{s s^{^{'}}}^{a} [\hat{R} (s^{^{'}}) + γ^{2} \hat{σ}_{π}^{2} (s^{^{'}}, π (s^{^{'}}))],

(26)

Despite the fact that the

\hat{σ}_{π}^{2} (s, a)

of each action’s Q-value may not be accurate at the beginning of the iteration, the allocation rule of the OCBA enables additional simulation replications to the promising action as the number of iterations increases by updating the measurement for each action. Thus, the estimation of probabilities becomes accurate as more samples are accumulated and results in an accurate estimate of the variance. To estimate

{\hat{Q}}_{π} (s, a)

and

\hat{σ}_{π}^{2} (s, a)

, we rewrite them into recursive forms. Since it is not feasible to use an infinite number of T for estimation, we use Equation (13) to determine T. Then,

{\hat{Q}}_{π} (s, a)

) and

\hat{σ}_{π}^{2} (s, a)

are approximated as

\hat{Q}_{π}^{T} (s, a) = \sum_{s^{^{'}} \in S_{s}^{a}}^{} \hat{P}_{s s^{^{'}}}^{a} [r (s^{^{'}}) + γ \hat{Q}_{π}^{T - 1} (s^{^{'}}, π (s^{^{'}}))],

(27)

\hat{σ}_{π}^{2, T} (s, a) = \sum_{s^{^{'}} \in S_{s}^{a}}^{} \hat{P}_{s s^{^{'}}}^{a} [R^{T} (s^{^{'}}) + γ^{2} \hat{σ}_{π}^{2, T - 1} (s^{^{'}}, π (s^{^{'}}))] .

(28)

In the existing methods, the simulation budget allocated to estimate each

\bar{Q}_{π}^{T} (s, a)

in a given state is limited to N. Thus, when N is small, inaccurate estimates of

\bar{Q}_{π}^{T} (s, a)

and sample variance degrade the effectiveness of OCBA at allocating simulation replications to each action and result in low efficiency of SBPI. On the other hand, the proposed method accumulates simulation samples from the previous m updates (i.e.,

\sum_{1}^{m} N

) to estimate and update transition probabilities. These probabilities are then used to compute

\hat{Q}_{π}^{T} (s, a)

and

\hat{σ}_{π}^{2, T} (s, a)

for the OCBA allocation rule of Equation (12), so that the OCBA can accurately allocate N to each action in each state and help improve the overall efficiency of SBPI. The proposed method is summarized in Algorithm 1.

Algorithm 1 Efficient simulation-based policy improvement with optimal computing budget allocation based on accumulated samples.

Require: a base policy

π

, an incremental replication △, simulation budget N, an initial state

s_{a} \in S

, total simulation budget B. Initialize the probability table. Set the iteration number of SBPI

m \to 1

. Determine T using Equation (13).

1:: while $N m \leq B$ do
2:: for s in S do
3:: Initialize $l \to 0$
4:: if SBPI has never been applied to s then
5:: Set $n_{1}^{l} = \dots = n_{k}^{l} = n_{0}$
6:: Run $n_{0}$ for $a \in \{a_{1}, \dots, a_{k}\}$ , and store $N_{s s^{^{'}}}^{a}$
7:: Estimate $\hat{P}_{s s^{^{'}}}^{a}$ by (18) and Calculate $\hat{Q}_{π}^{T} (s, a)$ , $\hat{σ}_{π}^{2, T} (s, a)$ for each a
8:: else
9:: Calculate $\hat{Q}_{π}^{T} (s, a)$ , $\hat{σ}_{π}^{2, T} (s, a)$ by $\hat{P}_{s s^{^{'}}}^{a}$ and
10:: set $n_{1}^{l} = \dots = n_{k}^{l} = 0$
11:: end if
12:: Select $a_{e} \leftarrow {arg max}_{a \in \{a_{1}, a_{2}, \dots a_{k}\}} \hat{Q}_{π}^{T} (s, a)$
13:: while $\sum_{i = 1}^{k} n_{i}^{l} T < N$ do
14:: Increase the simulation budget by △
15:: Compute new allocation $n_{1}^{l + 1}, \dots, n_{k}^{l + 1}$ with $\hat{Q}_{π}^{T} (s, a)$ and $\hat{σ}_{π}^{2, T} (s, a)$ using (12)
16:: Run additional max $(0, n_{i}^{l + 1} - n_{i}^{l})$ simulations for each a
17:: Update $\hat{P}_{s s^{^{'}}}^{a}$ and compute $\hat{Q}_{π}^{T} (s, a)$ , $\hat{σ}_{π}^{2, T} (s, a)$
18:: Select $a_{e} \leftarrow {arg max}_{a \in \{a_{1}, a_{2}, \dots a_{k}\}} \hat{Q}_{π}^{T} (s, a)$ and set $l \leftarrow l + 1$
19:: end while
20:: return $π_{P I} (s) \leftarrow a_{e}$
21:: if $N m \geq B$ then
22:: break
23:: else
24:: $m \leftarrow m + 1$
25:: end if
26:: end for
27:: end while

In Algorithm 1, lines 12 to 16 show the procedure of OCBA using

\hat{Q}_{π}^{T} (s, a)

and

\hat{σ}_{π}^{2, T} (s, a)

to allocate the simulation budget for each action. It is noted that Algorithm 1 initially allocates

n_{0}

simulation replications to each action only when the SBPI has never been applied to that state, whereas the existing methods allocate

n_{0}

to each action regardless of whether the state has been visited. The reason for this is that our method stores the accumulated samples from the previous updates and calculates the prior information for the state (i.e., line 5, 6); it does not allocate

n_{0}

for the prior information in the next visit. For SBPI, the estimated best action

a_{e}

is considered as the best action for a given state and is used to update the base policy. If the base policy is updated as the SBPI proceeds, the previously selected

a_{e}

may no longer be the best for the updated policy even in the same state. Owing to this property of SBPI, existing methods may waste some of the simulation budget for previous policy updates when the SBPI is applied to each state to obtain the optimal policy. To avoid this, Algorithm 1 accumulates simulation samples from the previous updates and applies them to compute

\hat{Q}_{π}^{T} (s, a)

and

\hat{σ}_{π}^{2, T} (s, a)

,a) to accurately deduce the optimal action in the following m. As Algorithm 1 proceeds, the accumulated samples help SBPI obtain the optimal policy with minimum m, thereby reducing the total simulation budget required. This suggests that the proposed method is efficient for improving the overall efficiency of SBPI, which accumulates the samples considering the property of SBPI.

The proposed method can be computationally inefficient owing to the recursive forms of

\hat{Q}_{π}^{T} (s, a)

and

\hat{σ}_{π}^{2, T} (s, a)

. However, this issue is tolerable compared to the simulation cost of an actual system. For a real-world MDP system (e.g., water resource management), a large duration of time is needed to return a reward. In this aspect, the computational time taken by

\hat{Q}_{π}^{T} (s, a)

and

\hat{σ}_{π}^{2, T} (s, a)

is negligible relative to the running time of obtaining the reward. Moreover, compared with the existing methods, the proposed method ha higher efficiency for finding the optimal policy, as shown in the Experiments section. This indicates that the superior efficiency of the proposed method is sufficient to mitigate its limitation with respect to the recursive calculation.

4. Experiments

Herein, we compare our method with five existing methods (EA, OCBAPI [14], OCBA-S [15], EA-sample accumulation (SA), and OCBAPI-SA) using two MDP models, namely, a two-state example and its extended version. The extended version is used to verify the effectiveness and efficiency of the proposed method in a more complex manner. The description of the five methods is summarized in Table 1.

The two-state example was expanded by increasing the available actions in state

s_{2}

, as shown in Figure 3. The example had two states, and 20 actions were available in each state. Action

α_{i}

and

β_{i}

were the ith elements in the action vector

A = B = [0.0, 0.05, \dots, 0.95]

. The state transition probabilities were determined by the choice of actions. For example, if the agent selected an action

α_{2}

in state

s_{1}

, the probability of remaining in state

s_{1}

was 0.05 and the probability of transferring to state

s_{2}

was 0.95. When the agent arrived at state

s_{1}

, it always received a reward of 0 but received a reward of 1 when arriving at state

s_{2}

. The extended version of the two-state example was obtained by increasing the number of states from 2 to 10, as shown in Figure 4. The agent in the extended model received a reward of 5 only when it reached state

s_{10}

.

The base policy was set as selecting the action 0.5 in each state for two examples, and state

s_{1}

was set as the initial state. In the two-state example, let the discount rate

γ = 0.7

and tolerance level

ϵ = 0.1

. Then,

c = ϵ / 2 = 0.05

is obtained and

T = ⌈({log}_{} [c (1 - γ) / F]) / log γ⌉ = 12

, where

F = 1

. The simulation budget for each method in each state was

N = 60 T

. The number of iterations of the SBPI was

m = 20

so that the total simulation budget was

B = N m = 1200 T

. In the extended example, we only changed the tolerance level to

ϵ = 0.5

and the iteration number to

m = 100

. For the methods using OCBA, we set

n_{0} = 2 T

, and incremental replication

△ = 2 T

. The value function and PCS of each method were estimated over 5000 independent replicated experiments, and the results are shown in Figure 5. In both examples, the proposed OCBAPI-SA2 converged to the optimal policy faster than the other methods, and the gap increased as the problem complexity increased. All experiments were implemented based on Python (version 3.7.9).

The results of EA-SA, OCBAPI-SA, and OCBAPI-SA2 indicate the effectiveness of the sample accumulation in the SBPI. As shown in Figure 5A,C, these methods had superior efficiencies to their original versions, i.e., EA and OCBA. As m increased, the accumulated samples allowed precise estimates of

\hat{Q}_{π}^{T} (s, a)

, which enabled the methods to select the best action correctly, as shown in Figure 5B,D. Meanwhile, OCBA-S achieved a higher

P \{C S\}

than EA-SA at the beginning iteration of SBPI, which can be attributed to its efficient allocation rule and sample path sharing. However, the sample path sharing was limited to a single state, EA-SA, and the sample accumulation improved as m increased.

While OCBAPI-SA only used

\hat{Q}_{π}^{T} (s, a)

and sample variance for OCBA allocation, OCBAPI-SA2 used

\hat{Q}_{π}^{T} (s, a)

and

\hat{σ}_{π}^{2, T} (s, a)

evaluated from the accumulated samples. When N was small, OCBAPI-SA could not efficiently allocate △ to each action owing to inaccurate sample estimates. On the contrary, OCBAPI-SA2 surpassed the limited N by accumulating samples from previous updates to estimate

\hat{Q}_{π}^{T} (s, a)

and

\hat{σ}_{π}^{2, T} (s, a)

. As SBPI proceeded, OCBAPI-SA2 could efficiently allocate △ to the promising actions and resulted in a higher

P \{C S\}

than OCBAPI-SA. Although the gap between them was relatively insignificant in the two-state example owing to the small number of states, it became large in the complex problem, as shown in Figure 5C,D.

5. Conclusions

In this study, we proposed a method called OCBAPI-SA2 that uses OCBA to improve the overall efficiency of SBPI. Unlike existing methods, OCBAPI-SA2 aims to improve the overall efficiency by considering the state traversal property of SBPI. To achieve this, OCBAPI-SA2 applies SBPI to traverse across states and accumulates the simulation samples to estimate the unknown transition probabilities. Then, it utilizes these probabilities to compute the mean and variance of the Q-values for OCBA to efficiently allocate simulation budget. With the accumulation of samples, OCBAPI-SA2 allows SBPI to obtain the optimal policy with a lower simulation budget, which is important in practice for complex systems with limited budgets. The experimental results show that the superior efficiency of the OCBAPI-SA2 is comparable to those of existing methods. Considering the properties of the SBPI, the sequence of improving policy has a significant impact on reducing the simulation budget. To further improve the overall efficiency of SBPI, our future work will focus on the optimal sequence of state traversal.

Author Contributions

X.H. and S.H.C. conceived the methodology and designed the experiments; X.H. conducted the experiments; X.H. and S.H.C. analyzed the experimental results; X.H. wrote the manuscript; X.H. and S.H.C. reviewed and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ewha Womans University Research Grant of 2022.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflict of interest regarding the publication of this paper.

References

Hendy, A.S.; Zaky, M.A.; Doha, E.H. On a discrete fractional stochastic Grönwall inequality and its application in the numerical analysis of stochastic FDEs involving a martingale. Int. J. Nonlinear Sci. Numer. Simul. 2021. [Google Scholar] [CrossRef]
Hendy, A.S.; Zaky, M.A.; Suragan, D. Discrete fractional stochastic Grönwall inequalities arising in the numerical analysis of multi-term fractional order stochastic differential equations. Math. Comput. Simul. 2022, 193, 269–279. [Google Scholar] [CrossRef]
Moghaddam, B.; Mendes Lopes, A.; Tenreiro Machado, J.; Mostaghim, Z. Computational scheme for solving nonlinear fractional stochastic differential equations with delay. Stoch. Anal. Appl. 2019, 37, 893–908. [Google Scholar] [CrossRef]
Moghaddam, B.; Zhang, L.; Lopes, A.; Tenreiro Machado, J.; Mostaghim, Z. Sufficient conditions for existence and uniqueness of fractional stochastic delay differential equations. Stochastics 2020, 92, 379–396. [Google Scholar] [CrossRef]
Jahanshahi, H.; Jafarzadeh, M.; Sari, N.N.; Pham, V.T.; Huynh, V.V.; Nguyen, X.Q. Robot motion planning in an unknown environment with danger space. Electronics 2019, 8, 201. [Google Scholar] [CrossRef] [Green Version]
Tibaldi, M.; Palermo, G.; Pilato, C. Dynamically-Tunable Dataflow Architectures Based on Markov Queuing Models. Electronics 2022, 11, 555. [Google Scholar] [CrossRef]
Ouyang, W.; Chen, Z.; Wu, J.; Yu, G.; Zhang, H. Dynamic Task Migration Combining Energy Efficiency and Load Balancing Optimization in Three-Tier UAV-Enabled Mobile Edge Computing System. Electronics 2021, 10, 190. [Google Scholar] [CrossRef]
Bertsekas, D.P.; Castanon, D.A. Rollout algorithms for stochastic scheduling problems. J. Heuristics 1999, 5, 89–108. [Google Scholar] [CrossRef]
Huang, Q.L.; Jia, Q.S.; Qiu, Z.F.; Guan, X.H.; Deconinck, G. Matching EV charging load with uncertain wind: A simulation-based policy improvement approach. IEEE Trans. Smart Grid 2015, 6, 1425–1433. [Google Scholar] [CrossRef]
Sarkale, Y.; Nozhati, S.; Chong, E.K.P.; Ellingwood, B.R.; Mahmoud, H. Solving Markov decision processes for network-level post-hazard recovery via simulation optimization and rollout. In Proceedings of the IEEE 14th International Conference on Automation Science and Engineering, Munich, Germany, 20–24 August 2018; pp. 906–912. [Google Scholar]
Kim, S.H.; Nelson, B.L. A fully sequential procedure for indifference-zone selection in simulation. ACM Trans. Model. Comput. 2001, 11, 251–273. [Google Scholar] [CrossRef]
Choi, S.H.; Kim, T.G. Efficient ranking and selection for stochastic simulation model based on hypothesis test. IEEE Trans. Syst. Man Cybern. Syst. 2017, 48, 1555–1565. [Google Scholar] [CrossRef]
Chen, C.H.; Lin, J.W.; Yücesan, E.; Chick, S.E. Simulation budget allocation for further enhancing the efficiency of ordinal optimization. J. Discr. Event Dyn. Syst. Theory Appl. 2000, 10, 251–270. [Google Scholar] [CrossRef] [Green Version]
Jia, Q.S. Efficient computing budget allocation for simulation-based policy improvement. IEEE Trans. Autom. Sci. Eng. 2012, 9, 342–352. [Google Scholar] [CrossRef]
Wu, D.; Jia, Q.S.; Chen, C.H. Sample path sharing in simulation-based policy improvement. In Proceedings of the IEEE International Conference on Robotics and Automation, Hong Kong, China, 31 May–5 June 2014; pp. 3291–3296. [Google Scholar]
Huang, X.L.; Choi, S.H. A Simulation Sample Accumulation Method for Efficient Simulation-based Policy Improvement in Markov Decision Process. J. Korea Multimed. Soc. 2020, 23, 830–839. [Google Scholar]
DeGroot, M.H. Optimal Statistical Decisions; John Wiley & Sons: Hoboken, NJ, USA, 2005; Volume 82. [Google Scholar]

Figure 1. An example of MDP, where a circle represents a state and a square represents an action that is available in the state. P represents an unknown state transition probability, a reward is denoted by r, and

π

is the policy.

Figure 1. An example of MDP, where a circle represents a state and a square represents an action that is available in the state. P represents an unknown state transition probability, a reward is denoted by r, and

π

is the policy.

Figure 2. Table of the cumulative number of state–action pairs, where 0 represents the unreachable states.

Figure 3. A two-state Markov decision process.

Figure 4. An extended version of the two-state example.

Figure 5. Graphs indicate the value function of the improved policy and

P \{C S\}

of each method for the two examples: (A,B) the results of the two-state example; (C,D) the results of the extended version.

Figure 5. Graphs indicate the value function of the improved policy and

P \{C S\}

of each method for the two examples: (A,B) the results of the two-state example; (C,D) the results of the extended version.

Table 1. Summary of comparison methods.

Method	Description	Allocation Rule	Best Action Selection
EA	Standard SBPI	$n_{i} = N / k$	$a_{e} = {arg max}_{a} \bar{Q}_{π}^{T} (s, a)$
OCBAPI	Using OCBA to allocate the simulation budget efficiently	Equation (12)	$a_{e} = {arg max}_{a} \bar{Q}_{π}^{T} (s, a)$
OCBA-S	Improving efficiency of OCBAPI with sample path sharing	Equation (12)	$a_{e} = {arg max}_{a} \bar{Q}_{π}^{T} (s, a)^{a}$
EA-SA	Using sample accumulation for EA to select the best action via Equation (27)	$n_{i} = N / k$	$a_{e} = {arg max}_{a} \hat{Q}_{π}^{T} (s, a)$
OCBAPI-SA	Using sample accumulation for OCBAPI to select the best action via Equation (27)	Equation (12)	$a_{e} = {arg max}_{a} \hat{Q}_{π}^{T} (s, a)$
OCBAPI-SA2 (Algorithm 1)	Using the estimated mean from Equation (27) and variance from Equation (28) of the Q-value to efficiently allocate computing budget for OCBAPI-SA.	use $\hat{Q}_{π}^{T} (s, a)$ , $\hat{σ}_{π}^{2, T} (s, a)$ for Equation (12)	$a_{e} = {arg max}_{a} \hat{Q}_{π}^{T} (s, a)$

^{a}

Sample path sharing is used to calculate

\bar{Q}_{π}^{T, S} (s, a)

; see more details in [15].

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, X.; Choi, S.H. An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples. Electronics 2022, 11, 1141. https://doi.org/10.3390/electronics11071141

AMA Style

Huang X, Choi SH. An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples. Electronics. 2022; 11(7):1141. https://doi.org/10.3390/electronics11071141

Chicago/Turabian Style

Huang, Xilang, and Seon Han Choi. 2022. "An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples" Electronics 11, no. 7: 1141. https://doi.org/10.3390/electronics11071141

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples

Abstract

1. Introduction

2. Problem Definition

3. Proposed Method

4. Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI