Next Article in Journal
Change Point Detection by State Space Modeling of Long-Term Air Temperature Series in Europe
Next Article in Special Issue
Point Cloud Registration via Heuristic Reward Reinforcement Learning
Previous Article in Journal
Applying the Multilevel Approach in Estimation of Income Population Differences
Previous Article in Special Issue
Do Deep Reinforcement Learning Agents Model Intentions?
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An ϵ-Greedy Multiarmed Bandit Approach to Markov Decision Processes †

1
Independent Researcher, Stony Brook, NY 11794-3600, USA
2
Department of Applied Mathematics and Statistics, The State University of New York at Stony Brook, Stony Brook, NY 11794-3600, USA
*
Author to whom correspondence should be addressed.
This work was completed while the author was a PhD student at the Department of Applied Mathematics and Statistics, SUNY at Stony Brook, Stony Brook, NY 11794-3600, USA.
Stats 2023, 6(1), 99-112; https://doi.org/10.3390/stats6010006
Submission received: 19 November 2022 / Revised: 22 December 2022 / Accepted: 23 December 2022 / Published: 1 January 2023
(This article belongs to the Special Issue Feature Paper Special Issue: Reinforcement Learning)

Abstract

:
We present REGA, a new adaptive-sampling-based algorithm for the control of finite-horizon Markov decision processes (MDPs) with very large state spaces and small action spaces. We apply a variant of the ϵ -greedy multiarmed bandit algorithm to each stage of the MDP in a recursive manner, thus computing an estimation of the “reward-to-go” value at each stage of the MDP. We provide a finite-time analysis of REGA. In particular, we provide a bound on the probability that the approximation error exceeds a given threshold, where the bound is given in terms of the number of samples collected at each stage of the MDP. We empirically compare REGA against another sampling-based algorithm called RASA by running simulations against the SysAdmin benchmark problem with 2 10 states. The results show that REGA and RASA achieved similar performance. Moreover, REGA and RASA empirically outperformed an implementation of the algorithm that uses the “original” ϵ -greedy algorithm that commonly appears in the literature.

1. Introduction

Consider a finite-horizon Markov decision process (MDP) with non-negative rewards. Let A and S denote finite action and state spaces, respectively, and let H denote the horizon of the MDP. For any stage i = 0 , , H 1 and state s i S , choosing a control a i A (stochastically) leads to a future state s i + 1 . The system dynamics can be captured by s i + 1 = f ( s i , a i , w i ) , where w i U ( 0 , 1 ) is a uniformly distributed random variable representing the uncertainly of the system (c.f. [1,2]). Let R ( s i , a i , w i ) denote the immediate reward, and let Π be the set of all possible deterministic nonstationary Markovian policies π = { π i | π i : S A , i = 0 , , H 1 } . To simplify our notation, let w = ( w i , w i + 1 , , w H 1 ) . Given a policy π Π , the reward-to-go value function is given by V i π ( s i ) = E w π t = i H 1 R s t , π t ( s t ) , w t , with V H π ( s ) = 0 s S . Let the optimal reward-to-go value function be denoted by V i * ( s ) = max π Π V i π ( s ) , with V H * ( s ) = 0 s S , then V i * ( s ) can be computed recursively using
V i * ( s ) = max a A Q i * ( s , a ) Q i * ( s , a ) = E w π R ( s , a , w ) + V i + 1 * ( f ( s , a , w ) ) .
In this paper, we state our results for the objective of maximizing the total reward-to-go value for a fixed initial state s 0 . When clear from the context, we omit the subscript for the initial horizon and state, and use V * ( s ) instead of V 0 * ( s 0 ) . The results can be trivially restated for minimization problems.
Recent years have seen theoretical progress in the area of simulation-based methods for solving large finite-horizon MDPs. Two algorithms closely related to the method that we propose in this paper are the recursive automata sampling algorithm (RASA) of [3] and the adaptive multistage sampling algorithm (AMS) of [4]. RASA recursively applies a learning automata algorithm at each stage of the MDP in order to calculate the reward-to-go value function. When applied to the initial state of the MDP, the algorithm produces an approximated value for the MDP. The objective is to maximize the total reward criterion. In [3], the authors give two probabilistic bounds on the performance of RASA as a function of the number of samples at each stage. In particular, a lower bound on the probability that the algorithm will sample the optimal action, and an upper bound on the probability that the error of the algorithm exceeds a given threshold.
The AMS algorithm is very similar to RASA, except that learning automata are replaced with an upper confidence bounds (UCBs) multiarmed bandit (MAB) algorithm to drive the simulation at each stage of the MDP. In [4], the authors showed that the MDP value function estimator produced by AMS is asymptotically unbiased. This asymptotic convergence result is weaker than the convergence in probability result shown for RASA. The value function estimator required to show the theoretical results for AMS is based on weighted averaging of the Q-function estimates of each sampled action, where the weights are taken on the basis of the number of times each action is chosen for simulation by the underlying MAB algorithm. Thus, the resultant value function estimator is biased below the true value function of the MDP.
Other simulation-based approaches for solving MDPs include stochastic annealing ([5]), neurodynamic programming ([6]), stochastic neural networks ([7]), and Q-learning ([8,9]).
In this paper, we propose a new algorithm called the recursive epsilon-greedy algorithm (REGA), which combines ideas from RASA and AMS. Like AMS, the sampling strategy at each stage of the MDP is based on a multiarmed bandit model. In particular, we use a generalized variant of Auer et al.’s ϵ -greedy algorithm (c.f. [10]). On the other hand, our theoretical analysis and the value function estimator that we use are similar to those of RASA. We give a finite-time analysis of REGA following the same ideas as those found in [3]. In particular, we give an upper bound on the probability that the approximated reward-to-go generated by REGA has an absolute error larger than a tolerated threshold ϵ > 0 . We provide a convergence rate in terms of the number of samples captured at each stage of the MDP. Much credit goes to [3] for the ideas behind our theoretical analysis.
The rest of the paper is organized as follows: The REGA algorithm is introduced in Section 2. Section 3 provides a finite-time analysis of REGA. Empirical results in comparison to the RASA algorithm appear in Section 4. We conclude with some remarks in Section 5.
This article is based on Chapter 2 of the first author’s PhD thesis [11].

2. The Algorithm

MAB problems are well-studied in the literature (c.f. [10,12,13,14]), and provide a mechanism for the trade-off between exploration and exploitation. In this paper, we consider finite-horizon MDPs with finite action and state spaces A and S and finite horizon H. Suppose that the system dynamics of the MDP are not known. The objective is to use a simulation-based technique to approximate the total reward of the MDP. Instead of simulating all permissible actions an equal number of times, it is reasonable to allocate more simulations towards actions with a likely higher reward to go. This becomes more important in settings where simulation resources are limited.
Consider a single stage i = 0 , 1 , , H 1 and current state s S , and let c i 1 be tuneable parameters that control the amount of pure exploration that takes place at each stage i of the algorithm. Let Q ^ i , M i ( s , a ) and V ^ i , M i ( s ) = argmax a A ( s ) Q ^ i , M i ( s , a ) denote estimates of Q i * ( s , a ) and V i * ( s ) , respectively, when M i simulations are allocated to stage i. When clear from the context, we drop the suffix M i , and use simplified notations Q ^ i ( s , a ) and V ^ i ( s ) . By convention, we set V ^ H ( s ) = 0 . Moreover, let M a i ( s ) denote the number of times action a is sampled at stage i. It always holds that M i = a A ( s ) M a i ( s ) . Lastly, for a non-negative integer m and a state s, let { ϵ m s } m be the nonincreasing sequence given by
ϵ m s = min 1 , c i | A ( s ) | m ,
where A ( s ) is the set of available actions at state s, and | A ( s ) | is the size of A ( s ) . When the state s is clear from the context, we drop the superscript and use the simplified notation ϵ m instead of ϵ m s . The REGA algorithm is outlined in Algorithm 1.
The widely cited ϵ -greedy method of [10] defines ϵ m s differently, where denominator m in Equation (1) is replaced with m. See Section 4 for a comparison of empirical results.
Algorithm 1: Recursive Epsilon-Greedy Algorithm (REGA)
Input: Stage i < H , state s S , c i 1 , and M i 1 .
Initialization: V ^ H ( s ) = 0 , Q ^ i ( s , a ) = 0 and M a i ( s ) = 0 a A ( s ) .
Loop (For m = 1 , 2 , , M i ):
{
1. Update the current optimal action a ^ i .
a ^ i = argmax a A ( s ) Q ^ i ( s , a )
2. Pick the next action to sample
a ¯ i = a ^ i , w . p . ( 1 ϵ m ) ( u n i f o r m l y ) r a n d o m a r m , w . p . ϵ m
3. Generate w m i U ( 0 , 1 ) then simulate the next state f ( s , a ¯ i , w m i ) and immediate reward R ( s , a ¯ i , w m i ) .
4. Update Q ^ i ( s , a ¯ i ) =
M a ¯ i i ( s ) Q ^ i ( s , a ¯ i ) + V ^ i + 1 f ( s , a ¯ i , w m i ) + R ( s , a ¯ i , w m i ) M a ¯ i i ( s ) + 1 .
5. Update M a ¯ i i ( s ) = M a ¯ i i ( s ) + 1 .
}
Output: V ^ i ( s ) = argmax a A Q ^ i ( s ) .
The REGA algorithm treats each available action at the current state as an arm in a MAB setting. As such, the next arm to sample is chosen in accordance with the ϵ -greedy MAB method. Let m be the number of simulations run at this stage thus far. At first, all arms are uniformly sampled up to a certain total number of purely exploratory/training simulations. After that, with ties broken arbitrarily, the arm believed to be optimal is sampled with increasing probability 1 ϵ m . For each simulation and chosen action, the MDP advances (stochastically) to a future stage s i + 1 according to the system dynamics s i + 1 = f ( s i , a i , w i ) (c.f. [1,2]). The next state s i + 1 is once again treated as a MAB problem. We continue to apply the ϵ -greedy method recursively until the maximal horizon H is reached. Via backward induction, the optimal reward to go at each stage is estimated, and when applied to the initial state of the MDP, we obtain an estimation of the optimal value function of the MDP. The main advantage of REGA is that the reward to go need not be computed for every state of the MDP, as the algorithm only considers the states visited during sampling. As such, REGA is well-suited for MDPs with very large state spaces.

3. Theoretical Analysis

In this section, we provide a theoretical treatment of the convergence properties of REGA. Before introducing the main result, the following lemma is needed to justify some algebraic steps. The proof can be found in Appendix A.
Lemma 1. 
Let c 1 be a fixed real number, and let M 1 and r > c 2 + c 2 be fixed integers. Then, the following inequalities hold:
c M r + M + 1 < i = r + 1 r + M log 1 c i < ( c + 1 ) M r
i = r + 1 r + M log 2 ( 1 c i ) < c + 1 2 log 1 + M r .
The discrete probability distribution of the sum of independent Bernoulli trails that are not identically distributed is known as the Poisson binomial distribution. The following theorem gives a bound on the error of estimating the Poisson binomial distribution CDF with the “usual” Poisson distribution CDF (c.f. [15], Thm. 5.1).
Theorem 1. 
Let X 1 , , X n be independent Bernoulli variables with respective success probabilities p 1 , , p n . Assume that 0 < p i < 1 and let λ i = log ( 1 p i ) for all 1 i n . Let Y ( λ ) denote a random variable having the Poisson distribution with mean parameter λ. Then, for each integer m 1 , we have
P Y i = 1 n λ i m P i = 1 n X i m P Y i = 1 n λ i m + 1 2 i = 1 n λ i 2 .
Before analyzing REGA, we first consider a simpler version of the algorithm in which V ^ i + 1 is replaced with V i + 1 * in Equation (2). This removes the recursion from REGA under the assumption that the true reward to go at the next stage is known and thus reduces the problem from a multistage problem to a single-stage problem. We refer to this algorithm as the nonrecursive epsilon-greedy algorithm (NREGA). Later on, we derive results pertaining to REGA.
In what follows, we use the following additional notation. Note that ξ ϵ > 0 for ϵ > 0 .
Λ = max c 2 + c 2 , c | A ( s ) | 2 ,
ξ ϵ = R max H + ϵ R max H log R max H + ϵ R max H ϵ R max H .
The next lemma gives a finite time bound on the estimation errors of the Q-functions induced by NREGA at the exit of the algorithm as a function of the number of allocated simulations at each stage. The result is analogous to Lemma 3.2 in [3].
Lemma 2. 
Let ϵ > 0 and α ( 1 , 2 ) be fixed. Consider a single stage of NREGA applied to a MDP using parameters c 2 3 and a large enough total number of allocated simulations M, such that M Λ 1 / α and M α M + 2 . Then, for any action a and state s, we have
P Q ^ i , M ( s , a ) Q i * ( s , a ) ϵ < ( c + 1 ) 2 2 log 1 + M 1 α + e c c exp 2 c 2 M 1 α / 2 + 2 exp ξ ϵ M 1 α / 2 .
Proof. 
Consider a fixed but arbitrary stage i of the algorithm, state s, and action a of the MDP. Denote by a i ( t ) the arm played at iteration t of the algorithm. Let I { a i ( t ) = a } be a random variable indicating whether or not the chosen arm a i ( t ) is in fact arm a, and let I U { a i ( t ) = a } denote whether or not the arm was chosen (independently) per the uniformly random rule in step 2 of the algorithm. Let N a i , M ( s ) = t = 1 M I { a i ( t ) = a } be a random variable denoting the total number of times action a is played during the first M iterations of the algorithm at stage s. Similarly, let U a i , M ( s ) = t = 1 M I U { a i ( t ) = a } .
In an argument identical to the one in the proof of Lemma 3.2 in [3], it can be shown for any positive integer N that
P Q ^ i , M ( s , a ) Q i * ( s , a ) ϵ < 2 exp [ ξ ϵ N ] + P N a i , M ( s ) N .
We first bound the term P N a i , M ( s ) N and then replace N with an appropriate function of M.
Indeed, N a i , M ( s ) U a i , M ( s ) a A ( s ) and N a i , M ( s ) > U a i , M ( s ) for some a A ( s ) . Therefore, N a i , M ( s ) dominates U a i , M ( s ) almost surely. Hence,
P N a i , M ( s ) N P U a i , M ( s ) N = P t = 1 M I U { a i ( t ) = a } N .
For iteration t = 1 , , M of the algorithm, let ϵ t be defined as per Equation (1). Per the algorithm description, random variables I U { a i ( t ) = a } are independent and have probabilities
P I U { a i ( t ) = a } = 1 = ϵ t | A ( s ) | = c t , t c | A ( s ) | 2 1 | A ( s ) | , o / w .
Let integer r Λ be a constant “shift” variable, and let X 1 , , X M be independent Bernoulli variables with success probabilities p t = c r + t . Since p t ϵ t | A ( s ) | t = 1 , , M , then each random variable X t is (first-order) stochastically dominated by the corresponding random variable I U { a i ( t ) = a } for any fixed t. Hence, it holds that
P t = 1 M I U { a i ( t ) = a } N P t = 1 M X t N .
Combining Equations (7) and (8) yields
P N a i , M ( s ) N P t = 1 M X t N .
Let λ t = log ( 1 c t ) . We now apply Theorem 1 to Equation (9) which yields
P N a i , M ( s ) N P Y t = r + 1 r + M λ t N + 1 2 t = r + 1 r + M λ t 2 .
Applying Chernoff bounds to a Poisson random variable Y ( λ ) yields
P Y y e λ ( e λ ) y y y , for   y λ .
Consider N c M r + M + 1 . It follows from Lemma 1 that N t = r + 1 r + M λ t . Hence, we can apply the Chernoff bound from Equation (11) to Equation (10) to obtain
P N a i , M ( s ) N 1 2 t = r + 1 r + M λ t 2 + exp t = r + 1 r + M λ t e t = r + 1 r + M λ t N N N .
We wish to give an elementary upper bound on the right-hand side of (12). Since e λ ( e λ ) y y y is decreasing in λ for y λ , it follows from Lemma 1 that
P N a i , M ( s ) N 1 2 ( c + 1 ) 2 log 1 + M r + exp c M r + M + 1 e c M N r + M + 1 N .
Now, let r = M α and N = M 1 α / 2 . For the bound in Equation (13) to converge to zero as M , it is necessary that α ( 1 , 2 ) . Moreover, the condition N c M r + M + 1 is indeed satisfied since c 2 3 implies that
M 3 c 3 2 2 α M 1 α / 2 + 1 c M 1 α / 2 3 M 1 α / 2 + 1 c M M α + M + 2 M 1 α / 2 c M M α + M + 2 N c M r + M + 1 ,
where the second equation follows from the restriction M α M + 2 .
Substituting r and N into Equation (13) and relaxing the bound yields
P N a i , M ( s ) N 1 2 ( c + 1 ) 2 log 1 + M M α + exp c M M α + M + 2 e c M α / 2 M α + M + 1 M 1 α / 2 + 1 .
Through condition M α M + 2 , we have
exp c M M α + M + 2 e c M α / 2 M α + M + 1 M 1 α / 2 + 1 exp c M 2 M α e c M 1 α / 2 + 1 = exp c M 1 α / 2 2 + M 1 α / 2 + 1 c M 1 α / 2 + 1 = e c c exp 2 c 2 M 1 α / 2 .
Combining Equations (14) and (15) yields
P N a i , M ( s ) N ( c + 1 ) 2 2 log 1 + M 1 α + e c c exp 2 c 2 M 1 α / 2 .
N = M 1 α / 2 , and ξ ϵ > 0 for ϵ > 0 . Therefore, Equation (6) can be relaxed to
P Q ^ i , M ( s , a ) Q i * ( s , a ) ϵ < ( c + 1 ) 2 2 log 1 + M 1 α + e c c exp 2 c 2 M 1 α / 2 + 2 exp ξ ϵ M 1 α / 2 ,
which completes the proof. □
Corollary 1. 
Suppose the conditions of Lemma 2 hold. Let Φ = c exp 2 c 2 < 1 . If M is large enough, such that Φ M 1 α / 2 M 1 α and exp ξ ϵ M 1 α / 2 M 1 α , then it follows that
P Q ^ i , M ( s , a ) Q i * ( s , a ) ϵ < ( c + 1 ) 2 2 + e c + 2 M 1 α .
Proof. 
It is well-known (c.f. [16]) that log ( 1 + x ) x for x > 1 . The result immediately follows by relaxing Equation (5) of Lemma 2. □
The result in Corollary 1 can be made arbitrarily close to O ( 1 / M ) in the limit by selecting α arbitrarily close to 2. However, as α is increased towards 2, the minimal value of M for which the finite time bound applies grows exponentially.
For the purposes of our theoretical analysis, we now assume that the optimal action at each stage of the algorithm is unique, so that the MDP has a unique optimal policy. This assumption is identical to that in [3].
Assumption 1. 
θ = min s S , i = 0 , H 1 θ i ( s ) > 0 , where θ i ( s ) = min a a * ( V i * ( s ) Q i * ( s , a ) ) s S , a A ( s ) and i = 0 , , H 1 .
A small value of θ means increased difficulty in differentiating between the truly optimal action and the best suboptimal action(s). Therefore, θ gives a measurement on the “size” of the problem.
We are finally ready to leap from NREGA to the recursive REGA algorithm. We start by restating Lemma 3.3 of [3] in a manner that is useful and applicable to REGA. We refer the reader to the original manuscript for the proof but note that Equation (11) as it originally appears in Lemma 3.3 of [3] is unnecessarily loose, where N a i ( x ) is bounded from above by K i + 1 instead of K i . Using the lemma as-is causes our lower bound on the probability in Theorem 2 to converge to 0 instead of 1 as the number of simulations at each stage of the MDP is increased to infinity, whereas one would expect the bound to improve as the number of samples is increased. We necessarily use the tighter bound.
Lemma 3. 
Consider REGA applied with M i simulations allocated to each stage i = 0 , , H 1 . Suppose that Assumption 1 holds, and suppose that δ i ( 0 , 1 ) satisfies
P Q ^ i , M i ( s , a ) Q i * ( s , a ) > θ 2 i + 2 < δ i for all i and s S . At the exit step we have
P V ^ 0 , M 0 ( s 0 ) V 0 * ( s 0 ) < θ 2 > ( 1 δ 0 ) i = 1 H 1 ( 1 δ i ) j = 0 i 1 M j .
The following definition is needed in the next theorem:
ξ i = R max H + θ / 2 i + 2 R max H log R max H + θ / 2 i + 2 R max H θ / 2 i + 2 R max H .
We now state our main result pertaining to the REGA algorithm in the form of a finite-time bound on the probability that the approximation error of the MDP value function exceeds half the size of the problem, i.e., θ / 2 . The bound is given as a function of the number of simulations M i allocated at each stage.
Theorem 2. 
Suppose that Assumption 1 holds and that the requirements of Corollary 1 are satisfied for each stage i of the MDP, where ξ ϵ is replaced with ξ i . Let ϕ > 1 be fixed, and suppose that M i = j = 0 i 1 M j ϕ / ( α 1 ) for all i = 1 , , H 1 . Denote Ψ = ( c + 1 ) 2 2 + e c + 2 . If M 0 is large enough such that Ψ M 0 1 α < 1 , then at the exit step of REGA we have
P V ^ 0 , M 0 ( s 0 ) V 0 * ( s 0 ) < θ 2 > 1 Ψ M 0 1 α H M 0 ( α 1 ) / ϕ .
Proof. 
Via Corollary 1, it holds for each stage i = 0 , , H 1 that
P Q ^ i , M i ( s , a ) Q i * ( s , a ) > θ 2 i + 2 < Ψ M i 1 α .
Via Lemma 3, we have
P V ^ 0 , M 0 ( s 0 ) V 0 * ( s 0 ) < θ 2 > 1 Ψ M 0 1 α i = 1 H 1 1 Ψ M i 1 α j = 0 i 1 M j = 1 Ψ M 0 1 α i = 1 H 1 1 Ψ M i 1 α M i ( α 1 ) / ϕ > 1 Ψ M 0 1 α H M 0 ( α 1 ) / ϕ ,
where the last inequality follows from the fact that 1 Ψ x 1 α x ( α 1 ) / ϕ is increasing in x. This completes the proof. □
We conclude by noting that the bound on the probability in Theorem 2 goes to 1 as M 0 .

4. Empirical Results

Value iteration and policy iteration are classical exact methods for solving MDPs. One iteration of these methods has runtime complexity O ( | S | 2 | A | ) and O ( | S | 2 | A | + | S | 3 ) , which are nonlinear in the size of the state space | S | . On the other hand, REGA and RASA have runtime complexity O max h = 0 , , H 1 { N h } H , which is independent of the size of the state space. Although the runtime complexities of REGA and RASA are also independent of the size of the action space, the accuracies of the algorithms are certainly dependent on the size of the action space. As such, REGA and RASA are generally not well-suited for MDPs with large action spaces. Due to the backward induction process, the runtime complexities of REGA and RASA grow exponentially as the sampling horizon H is increased. Thus, REGA and RASA are particularly attractive algorithms for problems with a large state space S, small action space A, and small horizon H.
One solution to the exponential runtime complexity issue pertaining to a large horizon H is to use rolling-horizon control (RH), also known in the literature as receding/moving horizon control. RH provides an approximate solution to infinite horizon control problems in an online manner. The basic idea is to consider a fixed horizon length to determine the best policy/action at the current stage, ignoring stages far out into the future in the hope that they have little effect on the decision being made at the current stage. If the length of this “sliding” lookahead window is chosen to be large enough, then the error induced by this approach can be made small. For further information, we refer the reader to [17].

4.1. SysAdmin Problem

In what follows, we demonstrate the performance of REGA when applied to the well-known SysAdmin discrete-time problem (c.f. [18]). In the SysAdmin problem, a system administrator manages B computing machines that are connected via a known network configuration, so that each machine b = 1 , , B is connected to a subset of “neighboring” machines N ( b ) . Let I b h denote the state of machine b at horizon h = 1 , , H , where I b h = 1 indicates that the machine is in working state, and I b h = 0 indicates that the machine is in faulted state. The state of the system at time h can be captured as a binary vector < I 1 h , I 2 h , , I B h > , so the size of the state space is 2 | B | , which is exponential to the number of machines. At each time step, the system administrator collects rewards r ( b ) associated with machines b that are in working state. The objective is to maximize E h = 1 H b = 1 B r ( b ) I b h , the expected total (nondiscounted) reward over the entire horizon of the problem, where the expectation is taken with respect to machine states.
At each time step, the system administrator has to decide to either reboot a particular machine or not reboot any machine at all. Only one machine can be rebooted at each time step. Hence, the action space includes | B | + 1 actions. We denote the action taken at each horizon h by a binary vector < A 1 h , A 2 h , , A B h > , where A b h = 1 if machine b is rebooted, and A b h = 0 otherwise. Since, at each time step, at most one machine can be rebooted, we have 0 b = 1 B A b h 1 .
Machine states are controlled via the following (stochastic) rules: If any machine in N ( b ) is in faulted state, then b also fails with high probability p 1 . On the other hand, if none of the machines in N ( b ) is in faulted state, then b can still fail with a much smaller probability p 2 . When a machine is rebooted, it fails to start up in a working state with a very small probability p 3 . The transitional probabilities for each machine can be formally stated as follows:
1
A faulted machine that is not rebooted remains in a faulted state.
P I b h + 1 = 0 | I b h = 0 , A b h = 0 = 1 .
2
A machine in working state that is not rebooted enters a faulted state with different probability depending on the state of neighbor machines. Let statement T denote that β N ( b ) s.t. I β h = 0 .
P I b h + 1 = 0 | I b h = 1 , A b h = 0 = p 1 , T true p 2 , o / w
P I b h + 1 = 1 | I b h = 1 , A b h = 0 = 1 p 1 , T true 1 p 2 , o / w
3
Rebooting a machine can result in it becoming in faulted state (regardless of its current state).
P I b h + 1 = 0 | I b h { 0 , 1 } , A b h = 1 = p 3 .
P I b h + 1 = 1 | I b h { 0 , 1 } , A b h = 1 = 1 p 3 .
Typically, in the literature, the SysAdmin problem is modeled as a finite horizon discrete-time MDP with total (undiscounted) reward criteria. Due to the size of the state space, it is often cited in articles related to factored MDPs (c.f. [18]).

4.2. Results

We consider two network topologies: a ring configuration and a star configuration. In the ring configuration, each machine b B is connected to two neighboring machines b 1 and b + 1 (arithmetic taken modulo | B | ). In the star configuration, there is one central server, and all other machines are connected only to the central server. The two topologies are depicted in Figure 1.
In addition to REGA and RASA, we also consider a modification to REGA that is consistent with the original ϵ -greedy method of [10] that is heavily cited in the literature, where denominator m in Equation (1) is replaced with m. We refer to this method as original REGA (OREGA). We also consider a trivial purely greedy strategy (PGS) where each action is sampled exactly once, after which each iteration of the algorithm plays the machine with highest observed average reward. Using custom C++ code running on a quad-core CPU and 12GB of RAM, and allowing each run to execute for one minute, we simulated REGA, RASA, OREGA, and PGS against the cycle and star network topologies of 10 machines with an initial state of all machines being online. The problems have state spaces with 2 10 states and action spaces with 11 actions. We set the horizon of the problem to H = 3 . The number of simulations was set to a constant across all stages (i.e., M 1 = M 2 = M 3 ). The reward of machine i being online was set to i in order to eliminate any simplifications introduced by symmetry where trivial strategies can perform well. The machine failure rates were set to p 1 = 0.7 ,   p 2 = 0.1 and p 3 = 0.01 . For REGA and OREGA, we set the parameters c i to c 1 = c 2 = c 3 = 6 (the performance of the ϵ -greedy method, and thus REGA, is generally sensitive to the choice of values c i . Our choice is not necessarily the best, but performed well). For RASA we set the tuneable parameter μ i = 1 2 1 / M i as was chosen in the simulations described by the original paper [3]. Each simulation was repeated 30 times in order to calculate the average reward and standard error for each setting. REGA and RASA achieved similar performances and, as expected, both outperformed the trivial PGS policy. Interestingly, OREGA did not perform as well as REGA and RASA, and this will be investigated in more detail in future work. Table 1 and Table 2 summarize the results of our simulations. The same results are plotted in Figure 2 and Figure 3. For both networks, one can verify via value iteration that the optimal value is V * = 149.93 .

5. Discussion

We introduced REGA, a recursive method based on the well-known ϵ -greedy multiarmed bandit algorithm useful for numerically solving MDPs with (possibly large) state spaces, but small action spaces. Indeed, the curse of dimensionality renders the algorithm most suitable for small finite-horizon settings, but techniques such as rolling-horizon control can aid in such situations. Due to the numerical nature of REGA, it can be of particular interest to use the algorithm to train model-free reinforcement learning and deep reinforcement learning problems. Indeed, the algorithm does not impose any restrictions on how the agent interacts with the environment, and so long as the state transition function f ( s , a ¯ i , w m i ) and immediate reward R ( s , a ¯ i , w m i ) can be computed (or simulated), then the Q-value estimates can be updated as per the REGA algorithm specification. Furthermore, if training the reinforcement learning models can occur offline with less sensitivity to running time, then exploring larger horizon settings may be feasible. We leave the investigation of applying REGA to appropriate popular and/or benchmark prediction and control problems from reinforcement learning as future work.
We provided finite-time analysis of REGA and showed that the bound on the error induced by the algorithm approaches zero as the number of sample points at each stage of the MDP is increased. The algorithm was tested empirically against the SysAdmin benchmark problem with up to 2 10 states. In our experiments, we found that a modification to the exploration/exploitation trade-off model of the ϵ -greedy method commonly cited in the literature yields improved results in practice, and that will be further investigated in future work. Other directions of future study include constructing a counterpart algorithm that is useful for MDPs with large action spaces and small state spaces. It would also be of interest to carry out a formal theoretical analysis comparing the expected performance of REGA and RASA.

Author Contributions

Methodology, I.M. and J.H.; software, I.M.; formal analysis, I.M. and J.H.; writing—original draft preparation, I.M.; writing—review and editing, I.M.; supervision, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 1

Let x > c 2 + c 2 , and consider function b ( x ) = log 1 c / x . We have lim x b ( x ) = 0 and b ( x ) < 0 . Therefore, b is positive and decreasing in x, so the definite integral of b can be bounded from below by the right-hand Riemann sum, and from above by the left-hand Riemann sum, yielding
0 < r + 1 r + M + 1 b ( x ) d x i = r + 1 r + M b ( i ) r r + M b ( x ) d x .
To prove the first part, we bound the function b from below. Identity log ( 1 y ) < y holds for y ( 0 , 1 ) . Since c / x ( 0 , 1 ) , we obtain
b ( x ) > c x .
Combining Equations (A1) and (A2) yields
i = r + 1 r + M b ( i ) r + 1 r + M + 1 b ( x ) d x > r + 1 r + M + 1 c x d x = 2 c r + M + 1 r + 1 = 2 c M r + M + 1 + r + 1 > c M r + M + 1 ,
which proves the first part. We now turn our attention to the second part of our claim. For brevity, denote u = ( c 2 + c ) ( x c ) x c 2 c . Basic algebraic steps show that
log ( 1 c x ) = log ( u c ) ( c + 1 ) c u = log ( 1 c u ) log ( 1 c c 2 + c ) = c 2 + c u c t ( t c ) d t . c 2 + c u c + 1 t 2 d t = c x c < c + 1 x .
Combining Equations (A1) and (A3) yields
i = r + 1 r + M b ( i ) r r + M b ( x ) d x r r + M c + 1 x d x = 2 ( c + 1 ) r + M r = 2 ( c + 1 ) M r + M + r < ( c + 1 ) M r .
This completes the proof of Equation (3). We now prove Equation (4). Squaring both sides of Equation (A3) yields
log 2 ( 1 c x ) < ( c + 1 ) 2 x .
Therefore,
i = r + 1 r + M log 2 ( 1 c i ) < i = r + 1 r + M ( c + 1 ) 2 i .
However, 1 / x is positive and decreasing for x > 0 ; therefore, its definite integral can be bound from below by the function’s right-hand Riemann sum. Hence,
i = r + 1 r + M ( c + 1 ) 2 i r r + M ( c + 1 ) 2 x d x = ( c + 1 ) 2 log 1 + M r ,
which completes the proof.

References

  1. Hernández-Lerma, O. Adaptive Markov Control Processes; Applied Mathematical Sciences; Springer: New York, NY, USA, 1989. [Google Scholar]
  2. Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st ed.; John Wiley & Sons, Inc.: New York, NY, USA, 1994. [Google Scholar]
  3. Chang, H.S.; Fu, M.C.; Hu, J.; Marcus, S.I. Recursive Learning Automata Approach to Markov Decision Processes. IEEE Trans. Automat. Control 2007, 52, 1349–1355. [Google Scholar] [CrossRef]
  4. Chang, H.S.; Fu, M.C.; Hu, J.; Marcus, S.I. Simulation-Based Algorithms for Markov Decision Processes; Communications and Control Engineering; Springer: London, UK, 2007. [Google Scholar]
  5. Hu, J.; Chang, H.S. An Approximate Stochastic Annealing algorithm for finite horizon Markov decision processes. In Proceedings of the 2010 49th IEEE Conference on Decision and Control (CDC), Atlanta, GA, USA, 15–17 December 2010; pp. 5338–5343. [Google Scholar] [CrossRef]
  6. Bertsekas, D.P.; Tsitsiklis, J.N. Neuro-Dynamic Programming, 1st ed.; Athena Scientific: Nashua, NH, USA, 1996. [Google Scholar]
  7. Santharam, G.; Sastry, P.S. A Reinforcement Learning Neural Network for Adaptive Control of Markov Chains. Trans. Syst. Man Cybern.—Part A Syst. Hum. 1997, 27, 588–600. [Google Scholar] [CrossRef]
  8. Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, King’s College, Cambridge, UK, 1989. [Google Scholar]
  9. Watkins, C.; Dayan, P. Technical Note: Q-Learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
  10. Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
  11. Muqattash, I.M. Multi-Armed Bandits with Applications to Markov Decision Processes and Scheduling Problems. Ph.D. Thesis, Stony Brook University, Stony Brook, NY, USA, 2014. [Google Scholar]
  12. Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R.E. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 2003, 32, 48–77. [Google Scholar] [CrossRef]
  13. Bubeck, S.; Cesa-Bianchi, N. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found. Trends Mach. Learn. 2012, 5, 1–122. [Google Scholar] [CrossRef] [Green Version]
  14. Pandey, S.; Chakrabarti, D.; Agarwal, D. Multi-armed Bandit Problems with Dependent Arms. In ICML ’07: Proceedings of the 24th International Conference on Machine Learning; ACM: New York, NY, USA, 2007; pp. 721–728. [Google Scholar] [CrossRef] [Green Version]
  15. Serfling, R. Some Elementary Results on Poisson Approximation in a Sequence of Bernoulli Trials. SIAM Rev. 1978, 20, 567–579. [Google Scholar] [CrossRef]
  16. Topsøe, F. Some Bounds for the Logarithmic Function. Res. Rep. Collect. 2004, 7, 1–20. [Google Scholar]
  17. Hernández-Lerma, O.; Lasserre, J.B. Error bounds for rolling horizon policies in discrete-time Markov control processes. IEEE Trans. Autom. Control 1990, 35, 1118–1124. [Google Scholar] [CrossRef]
  18. Guestrin, C.; Koller, D.; Parr, R.; Venkataraman, S. Efficient Solution Algorithms for Factored MDPs. J. Artif. Int. Res. 2003, 19, 399–468. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Example network topologies.
Figure 1. Example network topologies.
Stats 06 00006 g001
Figure 2. Performance of REGA, RASA, and PGS against a cycle network configuration of 10 machines with machine failure probabilities p 1 = 0.7 ,   p 2 = 0.1 ,   p 3 = 0.01 .
Figure 2. Performance of REGA, RASA, and PGS against a cycle network configuration of 10 machines with machine failure probabilities p 1 = 0.7 ,   p 2 = 0.1 ,   p 3 = 0.01 .
Stats 06 00006 g002
Figure 3. Performance of REGA, RASA, and PGS against a star network configuration of 10 machines with machine failure probabilities p 1 = 0.7 ,   p 2 = 0.1 ,   p 3 = 0.01 .
Figure 3. Performance of REGA, RASA, and PGS against a star network configuration of 10 machines with machine failure probabilities p 1 = 0.7 ,   p 2 = 0.1 ,   p 3 = 0.01 .
Stats 06 00006 g003
Table 1. Performance of REGA, RASA, OREGA, and PGS against a cycle network configuration of 10 machines with machine failure probabilities p 1 = 0.7 ,   p 2 = 0.1 ,   p 3 = 0.01 . The optimal value is V * = 149.93 .
Table 1. Performance of REGA, RASA, OREGA, and PGS against a cycle network configuration of 10 machines with machine failure probabilities p 1 = 0.7 ,   p 2 = 0.1 ,   p 3 = 0.01 . The optimal value is V * = 149.93 .
M i REGARASAOREGAPGS
35160.8 (0.99)158.7 (0.85)161.0 (0.92)139.8 (0.77)
75151.0 (0.81)150.5 (0.84)149.6 (1.03)135.1 (0.43)
100148.1 (0.88)147.1 (0.65)143.7 (0.58)132.7 (0.51)
125144.6 (0.43)145.0 (0.52)140.2 (0.62)132.3 (0.42)
150144.5 (0.52)144.5 (0.73)138.7 (0.49)131.4 (0.33)
175142.6 (0.50)141.8 (0.46)137.0 (0.36)131.2 (0.33)
200142.1 (0.41)141.9 (0.56)136.3 (0.31)131.8 (0.35)
225140.7 (0.42)141.5 (0.57)135.5 (0.29)130.3 (0.27)
250140.7 (0.39)140.8 (0.48)135.0 (0.27)131.2 (0.39)
Table 2. Performance of REGA, RASA, OREGA, and PGS against a star network configuration of 10 machines with machine failure probabilities p 1 = 0.7 ,   p 2 = 0.1 ,   p 3 = 0.01 . The optimal value is V * = 149.93 .
Table 2. Performance of REGA, RASA, OREGA, and PGS against a star network configuration of 10 machines with machine failure probabilities p 1 = 0.7 ,   p 2 = 0.1 ,   p 3 = 0.01 . The optimal value is V * = 149.93 .
M i REGARASAOREGAPGS
35162.7 (0.58)159.5 (0.93)163.6 (0.34)145.2 (0.76)
75153.6 (0.79)152.4 (0.73)152.9 (0.65)139.2 (0.46)
100149.6 (0.52)149.9 (0.60)146.7 (0.58)138.4 (0.47)
125149.4 (0.61)147.3 (0.57)143.5 (0.50)137.8 (0.40)
150147.8 (0.44)148.0 (0.47)142.4 (0.32)138.0 (0.37)
175146.1 (0.48)146.3 (0.41)141.7 (0.39)137.9 (0.46)
200145.8 (0.51)145.9 (0.50)141.3 (0.29)136.4 (0.40)
225145.4 (0.40)145.0 (0.49)140.5 (0.27)137.0 (0.27)
250145.1 (0.37)144.7 (0.48)140.3 (0.26)136.5 (0.37)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Muqattash, I.; Hu, J. An ϵ-Greedy Multiarmed Bandit Approach to Markov Decision Processes. Stats 2023, 6, 99-112. https://doi.org/10.3390/stats6010006

AMA Style

Muqattash I, Hu J. An ϵ-Greedy Multiarmed Bandit Approach to Markov Decision Processes. Stats. 2023; 6(1):99-112. https://doi.org/10.3390/stats6010006

Chicago/Turabian Style

Muqattash, Isa, and Jiaqiao Hu. 2023. "An ϵ-Greedy Multiarmed Bandit Approach to Markov Decision Processes" Stats 6, no. 1: 99-112. https://doi.org/10.3390/stats6010006

Article Metrics

Back to TopTop