Next Article in Journal
On Proof-of-Accuracy Consensus Protocols
Previous Article in Journal
Fractality of Borsa Istanbul during the COVID-19 Pandemic
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Gear Bandits, Partial Conservation Laws, and Indexability †

Department of Statistics, Carlos III University of Madrid, 28903 Getafe, Spain
This paper was presented by the author at the 32nd European Conference on Operational Research (EURO 2022), Espoo, Finland, 3–6 July 2022. An earlier preliminary draft was presented by the author at the 3rd International Conference on Performance Evaluation Methodologies and Tools (ValueTools 2008), Athens, Greece, 20–24 October 2008, where an extended abstract was published in its proceedings.
Mathematics 2022, 10(14), 2497; https://doi.org/10.3390/math10142497
Submission received: 31 May 2022 / Revised: 13 July 2022 / Accepted: 14 July 2022 / Published: 18 July 2022
(This article belongs to the Section Probability and Statistics)

Abstract

:
This paper considers what we propose to call multi-gear bandits, which are Markov decision processes modeling a generic dynamic and stochastic project fueled by a single resource and which admit multiple actions representing gears of operation naturally ordered by their increasing resource consumption. The optimal operation of a multi-gear bandit aims to strike a balance between project performance costs or rewards and resource usage costs, which depend on the resource price. A computationally convenient and intuitive optimal solution is available when such a model is indexable, meaning that its optimal policies are characterized by a dynamic allocation index (DAI), a function of state–action pairs representing critical resource prices. Motivated by the lack of general indexability conditions and efficient index-computing schemes, and focusing on the infinite-horizon finite-state and -action discounted case, we present a verification theorem ensuring that, if a model satisfies two proposed PCL-indexability conditions with respect to a postulated family of structured policies, then it is indexable and such policies are optimal, with its DAI being given by a marginal productivity index computed by a downshift adaptive-greedy algorithm in A N steps, with A + 1 actions and N states. The DAI is further used as the basis of a new index policy for the multi-armed multi-gear bandit problem.

1. Introduction

There is a substantial literature analyzing the indexability of infinite-horizon discrete-time binary-action Markov decision processes (MDP), i.e., their optimal solution by index policies, starting with the seminal work of Bellman in [1] on a Bernoulli bandit model. Such MDPs can be interpreted as models of a generic dynamic and stochastic project, which can be operated in a passive or an active mode. We propose to refer to such operating modes as gears, to reflect their natural ordering by increasing activity level.
The model in [1] had the property that, while the project is passive, its state does not change, which corresponds to a classic bandit setting. It was later shown that, given a finite collection of classic bandits, one of which must be active at each time, the policy that maximizes the expected total discounted reward in such a multi-armed bandit problem has a remarkably simple structure which overcomes the curse of dimensionality of a standard dynamic programming approach: it suffices to evaluate what Gittins and Jones [2] called the dynamic allocation index (DAI), later known as the Gittins index, of each project, which is a function of its state, and then activate at each time a project with largest index. See, e.g., the seminal work of Gittins and Jones [2] and Gittins [3], the monograph by Gittins [4], and alternative proofs by Whittle [5], Weber [6], and Bertsimas and Niño-Mora [7].
The assumption in classic bandit models that passive projects do not change state was removed by Whittle in [8], introducing restless bandits. That paper further introduced an index for restless bandits (the Whittle index), which characterizes the optimal operation of a single restless project and provides a suboptimal heuristic policy for scheduling multiple such projects, in the so-called multi-armed restless bandit problem. The latter has huge modeling power but is computationally intractable, and Whittle’s index policy has proven effective in an ever-increasing variety of models for multifarious applications. Thus, e.g., to name a few, scheduling multi-class make-to-stock queues [9], scheduling multi-class queues with finite buffers [10], admission control and routing to parallel queues with reneging [11], obsolescence mitigation strategies [12], sensor scheduling and dynamic channel selection [13,14,15,16], group maintenance [17], multi-target tracking with Kalman filter dynamics [18,19], scheduling multi-armed bandits with switching costs [20] or switching delays [21], the dynamic prioritization of medical treatments or interventions [22,23], and resource allocation with varying requests and with resources shared by multiple requests [24].
Yet, while the Gittins index is well defined for any classic bandit, the Whittle index only exists for some restless bandits, called indexable. Whittle pointed out in [8] the need of finding sufficient conditions for indexability (the existence of the index).
While researchers have deployed a wide variety of ingenious ad hoc techniques for proving the indexability of particular restless bandit models and computing their Whittle indices, the author has developed over the last two decades in a series of papers a systematic approach to accomplish such goals. Thus, Ref. [25] introduced a framework for establishing both the indexability of a general finite-state restless bandit and the optimality of a postulated family of structured policies, based on the concept of partial conservation laws (PCLs), also introduced there. If project performance metrics satisfy so-called PCL-indexability conditions, then the project’s Whittle index can be efficiently computed in N steps, where N is the number of states, by an adaptive-greedy index algorithm. This algorithm is an extension of the classic index-computing algorithm of Klimov [26] for computing the indices that characterize the optimal policy for scheduling a multi-class queue with feedback. Note that Klimov’s algorithm was adapted in [7] for computing the Gittins index, based on a framework of generalized conservation laws.
PCLs extend classical conservation laws in stochastic scheduling. See, e.g., the conservation laws in Coffman and Mitrani [27], the strong conservation laws in Shanthikumar and Yao [28], and the generalized conservation laws in the work of Bertsimas and Niño-Mora [7].
The author further developed the PCL framework for analyzing the indexability of finite-state restless bandits in [29], which introduced projects fueled by a generic resource with a general resource consumption function. The framework was extended to the countably infinite state space case in [30] and to the bias optimality criterion in [10]. Such early work is surveyed in the discussion paper  [31]. The framework was then extended to projects with a continuous real state in [32], motivated by sensor scheduling applications. As for the adaptive-greedy algorithm for the Whittle index, an efficient computational implementation was presented in [33].
The extension of the concept of indexability from two-gear bandits to multi-gear bandits was first outlined by Weber [34] in the setting of an illustrative example given by a three-action queueing admission control model. Ref. [34] further outlined an index-computing algorithm for such a three-action model extending the aforementioned adaptive-greedy algorithm for the Whittle index. Such an insightful outline, was, however, not theoretically supported.
The author formalized and extended Weber’s [34] outline to introduce in [35] a general multi-armed multi-mode bandit problem with finite-state projects, motivated by a model of optimal dynamic power allocation to multiple users sharing a wireless downlink communication channel subject to a peak energy constraint, proposing an index policy when individual projects are indexable. Ref. [35] further outlined an extension of the PCL framework along with an index algorithm, yet without proofs or analyses.
See also the recent work of Zayas-Cabán et al. [36] on a finite-horizon multi-armed multi-gear bandit problem and that of Killian et al. [37] on multi-action bandits, yet without a focus on indexability.
A related strand of research is the work on the optimality of structured policies in MDP models, mostly focusing on the monotonicity of optimal actions on the state. See, e.g., Serfozo [38] and the book chapter by Heyman and Sobel ([39] Ch. 8). In some models with a one-dimensional state, most notably those arising in queueing theory, researchers have established the optimality of policies given by multiple thresholds. Thus, e.g., Crabill [40] shows that such policies are optimal for selecting the speed of a single server catering to a queue. Such a model is an example of what we call here a multi-gear bandit. Similar results are obtained, e.g., in Sabeti [41], Ata and Shneorson [42], and Mayorga et al. [43]. The methods proposed herein also serve to establish the optimality of postulated families of structured policies, different from prevailing approaches, typically based on submodularity.
Another related line of work is the computational complexity of solving discounted finite-state and -action MDPs with general-purpose algorithms, most notably the classical methods of value iteration, policy iteration, and linear optimization. Such methods are not strongly polynomial in that the number of iterations required to compute an optimal policy depends not only on the number of states and actions but also on other factors, most notably the discount factor. Thus, e.g., Ye [44] showed that the number of iterations required by policy iteration is bounded by O ( ( N 2 A / ( 1 β ) ) log ( N / ( 1 β ) ) , which shows that policy iteration is strongly polynomial but only if the discount factor is fixed and hence not part of the input. Such a bound has been improved by Scherrer [45] to O ( ( N A / ( 1 β ) ) log ( 1 / ( 1 β ) ) ) . Otherwise, policy iteration is known to have exponential worst-case complexity. See Hollanders et al. [46] and references therein. In contrast, as we will see, the algorithm presented herein solves a multi-gear bandit model with N states and A + 1 actions in precisely A N steps and is hence a strongly polynomial algorithm.
In contrast with the aforementioned outlines in the earlier work [34,35], this paper presents a theoretically supported extension of the PCL-based sufficient indexability conditions from two-gear bandits to multi-gear bandits, along with an intuitive and efficient algorithm for computing the model’s index.
The main contribution is a verification theorem (Theorem 1) which ensures that, if the performance metrics of a multi-gear bandit model under a postulated family of structured policies satisfy two PCL-indexability conditions, then both the model is indexable and such policies are optimal, with the model’s index being computed by a downshift adaptive-greedy algorithm in A N steps as pointed out above.
The remainder of the paper is organized as follows. Section 2 describes the multi-gear bandit model and formulates the main result, the verification theorem for indexability. Section 3, Section 4, Section 5 and Section 6 lay out the groundwork needed to prove the verification theorem. Thus, Section 3 discusses the linear optimization reformulation of the relevant MDP model. Section 4 presents the required relations between project performance metrics. Section 5 analyzes the output of the proposed index-computing algorithm. Section 6 presents the framework of partial conservation laws in the present setting. Then, Section 7 draws on the above to present our proof of the verification theorem. Section 8 applies the indexability property to provide a performance bound and a novel index policy for the multi-armed multi-gear bandit problem. Section 9 outlines some extensions, in particular to the long-run average cost criterion (Section 9.1), to models with uncontrollable states (Section 9.2) and to models with a countably infinite state space (Section 9.3). Finally, Section 10 concludes with a discussion of the results.

2. Preliminaries and Formulation of the Main Result

2.1. Multi-Gear Bandits

We next describe a general MDP model for the optimal operation of a multi-gear dynamic and stochastic project, which we call the multi-gear bandit problem. Consider a general discrete-time infinite-horizon discounted MDP model of a controlled dynamic and stochastic project that consumes a single resource. At the start of each time period t = 0 , 1 , , the controller observes the current project state s(t), which moves through the finite state space N { 1 , , N } , and then selects an action a(t) from the finite action space A = { 0 , 1 , , A } . The choice of action at each time t is based on a possibly randomized function of the system’s history  H ( t ) { s ( t ) } { ( s ( t ) , a ( t ) ) :   t = 0 , , t 1 } , consisting of the current state s ( t ) and previous states visited and actions taken, if any. This corresponds to adopting a control policy (policy for short) π from the class Π of history-dependent randomized policies (see ([47] Sec. 2.1.5)). We will call such policies admissible.
We will refer to action 0 as the passive action, as it models the project’s evolution in the absence of control, and to 1 , , A as the active actions. Such actions model distinct gears for operating the project which are naturally ordered by their increasing resource consumption. Henceforth, we will use the terms action and gear interchangeably.
When the project occupies state s ( t ) = i at the start of a period and action a ( t ) = a is selected, it incurs a holding cost h i a and consumes a quantity q i a of the resource in the period, time-discounted with factor 0 < β < 1 . Then, the project state moves in a Markovian fashion from s ( t ) = i to s ( t + 1 ) = j with probability p i j a .
Consistently with the interpretation of actions a as operating gears ordered by increasing activity levels, we shall assume that higher gears entail larger resource consumptions, so the resource consumption q i a is monotone increasing in the gear a for each state i:
0 q i 0 < q i 1 < < q i A , i N .
Intuitively, to compensate for their larger resource consumptions, higher gears should be more beneficial in some sense than lower gears, e.g., they might tend to drive the project towards less costly states or yield lower holding costs.
We further introduce a scalar parameter λ R modeling the resource unit price. Note that λ could take negative values, in which case it would represent a subsidy for using the resource. We shall consider the project’s λ-price problem, which is to find an admissible project operating policy π ( λ ) minimizing the expected total discounted holding and resource usage cost for any initial state. Writing as E i π [ · ] the expectation under policy π starting from state i, we denote by
V i ( λ , π ) E i π t = 0 h s ( t ) a ( t ) + λ q s ( t ) a ( t ) β t
the corresponding expected total discounted cost incurred by the project when resource usage is charged at price λ . The resulting optimal (project) cost function is
V i ( λ ) inf { V i ( λ , π ) :   π Π } , i N .
We can thus formulate the project’s λ -price problem as
( P λ ) find   π ( λ ) Π :   V i ( λ , π ( λ ) ) = V i ( λ ) , i N .
We shall refer to a policy π ( λ ) solving the λ -price problem ( P λ ) as a λ-optimal policy.
Denoting by Π SD the class of stationary deterministic policies (see [47] (Sec. 2.1.5)), which base action choice on the current state only, standard results in MDP theory (see [47] (Theorem 6.2.10.a)) ensure the existence of a λ -optimal policy π ( λ ) Π SD . Both the optimal cost function V i ( λ ) and the optimal stationary deterministic policies for the λ -price problem ( P λ ) are determined by Bellman’s discounted-cost optimality equations
V i ( λ ) = min a A h i a + λ q i a + β j N p i j a V j ( λ ) , i N .
It is well known that the optimal cost function V i ( λ ) is the unique solution to such equations and that a stationary deterministic policy is optimal iff (i.e., if and only if) it selects an action attaining the minimum in the right-hand side of (4) for each state i. We shall also call such actions λ -optimal. For fixed λ , the Bellman equations can be solved numerically by classical methods, such as value iteration, policy iteration, and linear optimization. See, e.g., ([47] Sec. 6).

2.2. Indexability

Instead of solving the λ -price problem for specific values of the parameter λ , we shall pursue an alternative approach aiming at a complete understanding of optimal policies over the entire parameter space, by fully characterizing the optimal actions for the parametric collection P { P λ :   λ R } of all λ -price problems. Such a characterization will be given in terms of critical parameter values λ i , a , as defined next.
Note that, below and throughout the paper, we use the standard abbreviation iff for if and only if.
Definition 1
(Indexability and DAI). We call the above multi-gear bandit model indexable if there exist critical resource prices λ i , a for every state i and active action (gear) a 1 satisfying λ i , A λ i , 1 , such that, for any such state and resource price λ R : (i) action 0 is λ-optimal in state i iff λ λ i , 1 ; (ii) action 1 a A 1 is λ-optimal in state i iff λ i , a + 1 λ λ i , a ; and (iii) action A is λ-optimal in state i iff λ λ i , A . We call λ i , a the model’s dynamic allocation index (DAI), viewed as a function of ( i , a ) .
Remark 1.
(i) 
The definition of indexability for multi-action bandits was first outlined by Weber [34] in the setting of a three-action project model and was first formalized by the author [35] in the general setting considered herein. The latter paper further introduced the multi-armed multi-mode bandit problem and proposed to use the above DAI as the basis for a heuristic index policy for it. The concept of indexability for two-gear (active/passive) bandits has its roots in the work of Bellman [1], where he characterized the optimal policies for operating a Bernoulli bandit in terms of critical parameter values. Gittins and Jones [2] showed that such critical values (later known as Gittins indices) provide a tractable optimal policy for the classic multi-armed bandit problem, involving the optimal sequential activation of a collection of two-gear bandits that do not change state when passive. The idea of indexability was extended by Whittle [8] to two-gear bandits that can change state when passive, called restless bandits. He also proposed to use the corresponding Whittle index policy as a heuristic for the intractable multi-armed restless bandit problem, when the individual bandits (projects) are indexable, which need not be the case as there are nonindexable bandits.
(ii) 
Writing as V i ( λ , a , ) h i a + λ q i a + β j N p i j a V j ( λ ) the optimal cost function with initial action a, indexability means that there exist critical prices λ i , a as in Definition 1 such that, for each state i,
V i ( λ , 0 , ) V i ( λ , a , ) for a 1 λ λ i , 1 for 0 < a < A :   V i ( λ , a , ) V i ( λ , a , ) for a a λ i , a + 1 λ λ i , a V i ( λ , A , ) V i ( λ , a , ) for a < A λ λ i , A .
(iii) 
In intuitive terms, when an indexable project model occupies state i, it is λ-optimal to select the lowest (passive) gear 0 iff the resource is expensive enough ( λ λ i , 1 ); it is λ-optimal to select the highest gear A iff the resource is cheap enough ( λ λ i , a ); and it is λ-optimal to select the intermediate gear 0 < a < A iff the resource price lies between the critical prices λ i , a + 1 and λ i , a .
(iv) 
In an indexable model, λ i , a is the unique critical resource price λ for which gears a 1 and a are both λ-optimal in state i; hence, it is the unique solution to the equation
V i ( λ , a 1 , ) = V i ( λ , a , ) .
Yet, note that for a nonindexable model, Equation (6) need not have a unique solution.
(v) 
Let A i ( λ ) be the set of λ-optimal actions in state i. If the model is indexable then, for each active action a 1 , A i ( λ ) { a , , A } (i.e., there is an optimal action greater than or equal to a) iff λ i , a λ and A i ( λ ) { 0 , , a 1 } (i.e., there is an optimal action less than a) iff λ i , a λ .
We shall address the following research goals: (1) identify sufficient conditions ensuring that the above multi-gear bandit model is indexable and (2) for models satisfying such conditions, provide an efficient means of computing the DAI.

2.3. Project Performance Metrics and Their Characterization

To formulate the main result and facilitate the required analyses, we consider certain project performance metrics. We measure the holding cost incurred by the project under a policy π Π starting from the initial-state distribution s ( 0 ) p = ( p i ) i N , so P { s ( 0 ) = i } = p i for i N , by the (holding) cost metric
F p ( π ) = E p π t = 0 h s ( t ) a ( t ) β t ,
where E p π [ · ] denotes expectation under policy π starting from s ( 0 ) p . Similarly, we measure the corresponding resource usage by the resource (usage) metric
G p ( π ) = E p π t = 0 q s ( t ) a ( t ) β t .
When s ( 0 ) = i , we write F i ( π ) and G i ( π ) . Note that F p ( π ) = i p i F i ( π ) and G p ( π ) = i p i G i ( π ) . As for the total cost metric V i ( λ , π ) in (2), we can thus express it as
V i ( λ , π ) = F i ( π ) + λ G i ( π ) .
We shall similarly write V p ( λ , π ) when s ( 0 ) p .
We next address the characterization of such metrics for stationary deterministic policies. We will represent any such policy by the partition S = ( S a ) a A = ( S 0 , , S A ) it naturally induces on the state space N , where S a is the subset of states where the policy selects gear a. We shall refer to it as the S-policy or policy S.
The performance metrics F i ( S ) and G i ( S ) for the S-policy are thus characterized as the unique solutions to the linear equation systems
F i ( S ) = h i a + β j N p i j a F j ( S ) , i S a , a A ,
and
G i ( S ) = q i a + β j N p i j a G j ( S ) , i S a , a A .
In the sequel we will find it convenient to use vector notation, denoting vectors and matrices in boldface and writing, e.g., F ( S ) = ( F i ( S ) ) i N and F B ( S ) = ( F i ( S ) ) i B for B N , and similarly for G ( S ) , h a and q a . We will also write P a = ( p i j a ) i , j N , P B B a = ( p i j a ) i B , j B and P B · a = ( p i j a ) i B , j N for B , B N .
The above equations characterizing F ( S ) and G ( S ) are thus formulated as
F S a ( S ) = h S a a + β P S a · a F ( S ) , a A .
and
G S a ( S ) = q S a a + β P S a · a G ( S ) , a A .
We shall further consider corresponding marginal metrics. Denote by a , S the policy that selects gear a at time t = 0 and then adopts the S-policy thereafter. Note that
F i ( a , S ) = h i a + β j N p i j a F j ( S )
and
G i ( a , S ) = q i a + β j N p i j a G j ( S ) .
For given actions a a , we define the marginal (holding) cost metric
f i a , a ( S ) F i ( a , S ) F i ( a , S ) = h i a h i a + β j N p i j a F j ( S ) β j N p i j a F j ( S ) ,
which measures the decrement in the holding cost metric that results from the shifting of the initial gear from a to a starting from state i, provided that the S-policy is followed thereafter.
We also define the marginal resource (usage) metric
g i a , a ( S ) G i ( a , S ) G i ( a , S ) = q i a q i a + β j N p i j a G j ( S ) β j N p i j a G j ( S ) ,
which measures the corresponding increment in the resource metric.
In vector notation, we can write the above identities as
f a , a ( S ) F ( a , S ) F ( a , S ) = h a h a + β ( P a P a ) F ( S )
and
g a , a ( S ) G ( a , S ) G ( a , S ) = q a q a + β ( P a P a ) G ( S ) .
If g i a , a ( S ) > 0 for certain i, a, a , and S, we further define the marginal productivity (MP) metric as the ratio of the marginal cost metric to the marginal resource metric:
m i a , a ( S ) f i a , a ( S ) g i a , a ( S ) .
We next present a preliminary result, on which we draw later on, establishing further relations between metrics F ( S ) and f a , a ( S ) and between G ( S ) and g a , a ( S ) . Note that, below, 0 S a denotes a vector of zeros with components indexed by S a .
Lemma 1.
For any stationary deterministic policy S and action a ,
(a) 
( I β P a ) F ( S ) h a = ( f S a a , a ( S ) ) a A { a } 0 S a ;
(b) 
q a ( I β P a ) G ( S ) = ( g S a a , a ( S ) ) a A { a } 0 S a .
Proof. 
(a) For a a , using in turn (7) and (11) we obtain
f S a a , a ( S ) = h S a a h S a a + β ( P S a · a P S a · a ) F ( S ) = ( F S a ( S ) h S a a β P S a · a F ( S ) ) + ( h S a a h S a a + β ( P S a · a P S a · a ) F ( S ) ) = F S a ( S ) β P S a · a F ( S ) h S a a ,
and
F S a ( S ) β P S a · a F ( S ) h S a a = 0 S a .
Part (b) follows similarly.    □

2.4. Main Result: A Verification Theorem for Indexability

We next present our main result, giving sufficient conditions for indexability. The conditions correspond to the framework of PCL-indexability, which is extended here from the two-gear setting in [25,29,30,32] to the multi-gear setting.
We will use the following notation. Given a stationary deterministic policy S = ( S 0 , , S A ) , actions a a , and a state j S a , the policy denoted by S ^ = T j a , a S is defined by S ^ a = S a { j } , S ^ a = S a { j } , and S ^ a = S a for a a a . Thus, T j a , a S is obtained from S by shifting the gear selected in state j from a to a .
The verification theorem below refers to indexability relative to a structured family F of stationary deterministic policies, which one needs to postulate a priori, based on insight on the particular model at hand. We shall thus refer to the family of F -policies S F , and to F -indexability, as defined below.
Definition 2
( F -indexability). We call the model F -indexable if (i) it is indexable and (ii) F -policies are optimal for the λ-price problem ( P λ ) in (3), for any λ R .
Note that Definition 2(ii) refers to the optimality of F -policies for all λ -price problems ( P λ ) . By this we mean that, for any λ R , there exists a λ -optimal policy S ( λ ) F .
We require F to satisfy the following connectedness assumption, which is motivated by algorithmic considerations. The assumption ensures that it is possible to go from policy ( , , , N ) to ( N , , , ) , both of which must be in F , through a sequence of policies in F where each policy in the sequence is obtained from the previous one by downshifting the gear selected in a single state to the next lower gear. Conversely, it is possible to go from ( N , , , ) to ( , , , N ) through a sequence of policies in F where each policy in the sequence is obtained from the previous one by upshifting the gear selected in a single state to the next higher gear.
Assumption 1.
The family of policies F satisfies the following conditions:
(i) 
( , , , N ) F and ( N , , , ) F ;
(ii) 
For each S F { ( N , , , ) } there exist a 1 and j S a such that T j a , a 1 S F ;
(iii) 
For each S F { ( , , , N ) } there exist a < A and j S a such that T j a , a + 1 S F .
Note that the above concepts of downshifting and upshifting gears naturally induce a partial ordering ⪯ on the class of all stationary deterministic policies S and in particular on F . Thus, given S and S , we write S S if, at every state, S does not select a higher gear than S . If, further, S S , we write S S . Assumption 1 shows that the poset (partially ordered set) ( F , ) contains the least element ( N , , , ) and the largest element ( , , , N ) .
The verification theorem refers to the downshift adaptive-greedy index algorithm  DS ( F ) shown in Algorithm 1. This takes as input the model parameters and, in K A N steps, produces as output a sequence of distinct state–action pairs ( j k , a k ) spanning N × ( A { 0 } ) along with corresponding sequences of F -policies S k and scalars m j k , a k for k = 1 , , K . Actually, the sequence S k is a chain of the poset ( F , ) ordered as
( N , , , ) = S K + 1 S K S 1 = ( , , , N ) .
Algorithm 1: Downshift adaptive-greedy index algorithm DS ( F ) .
Output:  { ( j k , a k ) , S k , m j k , a k } k = 1 K
Initialization: S 1 : = ( , , , N ) ;   a 1 : = A
pick j 1 arg min j N , T j a 1 , a 1 1 S 1 F m j a 1 1 , a 1 ( S 1 )
m j 1 , a 1 : = m j 1 a 1 1 , a 1 ( S 1 ) ;   S 2 : = T j 1 a 1 , a 1 1 S 1
Loop:
fork := 2 to K do
      pick j k , a k arg min ( j , a ) : j S a k , T j a , a 1 S k F m j a 1 , a S k ,
         with m j a 1 , a S k : = m j k 1 , a k 1 + g j a 1 , a S k 1 g j a 1 , a S k m j a 1 , a S k 1 m j k 1 , a k 1
       m j k , a k : = m j k a k 1 , a k S k ; S k + 1 : = T j k a k , a k 1 S k
end {for}
Remark. 2.
(i) 
Algorithm DS ( F ) extends to multi-gear bandits the adaptive-greedy algorithm for computing the Whittle index for restless (two-gear) bandits introduced by the author in [25] and further developed in [29]. In turn, this has its early roots in Klimov’s adaptive-greedy index algorithm [26] for computing the indices that give the optimal policy for scheduling a multi-class queue with feedback. Note that Klimov’s algorithm was first adapted in [7] to compute the Gittins index for classic bandits (i.e., two-gear bandits that do not change state when passive).
(ii) 
The term downshift refers to the way in which the algorithm generates the sequence S k of F -policies. It starts with policy S 1 , which selects the highest gear A in every state. Then, at each step k of the algorithm, it selects a state j k in which to downshift gears from a k to a k 1 , keeping the same gears in the other states, thus obtaining the next policy S k + 1 . The selection of such a state j k is performed in an adaptive-greedy fashion, by choosing a state in which such a downshifting change entails a minimal MP decrease, as measured by the MP metric m j a 1 , a ( S k ) , while also ensuring that the next policy S k + 1 will be in F . One can visualize the workings of the algorithm in terms of balls trickling down a grid in which states are positioned in columns and gears in rows, with gear A at the top. Initially, all balls are in the top row, as the algorithm starts with policy S 1 ( , , , N ) , so gear A is chosen in every state. Then, at each step of the algorithm, one ball trickles down from its current row to that immediately below, which represents another F -policy. The algorithm ends in K steps when all N balls have trickled down to the bottom row, which corresponds to policy S K + 1 ( N , , , ) , so gear 0 is chosen in every state.
(iii) 
Note that, by construction, (1) each policy S k in the sequence produced by the algorithm satisfies S k F , i.e., it is an F -policy, for k = 1 , , K + 1 , and (2) the state–action pairs ( j k , a k ) produced by the algorithm are all distinct, spanning the K state–action pairs ( j , a ) N × ( A { 0 } ) corresponding to active gears a 1 .
Definition 3
(PCL ( F ) -indexability and MP index). We call a multi-gear bandit model PCL-indexable with respect to F -policies, or PCL ( F ) -indexable, if the following hold:
(PCLI1)
g j a 1 , a ( S ) > 0 for every policy S F , active action a 1 , and state j N ;
(PCLI2)
Algorithm DS ( F ) computes the m j k , a k in nondecreasing order:
m j 1 , a 1 m j 2 , a 2 m j K , a K .
In such a case, we call m j , a the project’s MP index or MPI for short.
Remark. 3.
(i) 
Condition (PCLI1) means that the marginal resource metric corresponding to upshifting gears in any state, relative to any F -policy, is positive. Note that it is equivalent to requiring that g j a , a ( S ) > 0 for a < a , since g j a , a ( S ) = g j a , a + 1 ( S ) + + g j a 1 , a ( S ) . Such a condition and the fact that the S k are in F ensures that the MP index m j , a computed by the algorithm is well defined.
(ii) 
The recursive formula used in the algorithm for computing the MP metrics as
m j a 1 , a ( S k ) : = m j k 1 , a k 1 + g j a 1 , a ( S k 1 ) g j a 1 , a ( S k ) ( m j a 1 , a ( S k 1 ) m j k 1 , a k 1 )
is justified by Lemma 8.
We next state the verification theorem.
Theorem 1.
If a multi-gear bandit model is PCL ( F ) -indexable, then it is F -indexable with its DAI being given by its MPI, i.e., λ j , a = m j , a .
The proof of Theorem 1 is presented in Section 7. It requires substantial preliminary groundwork, which is laid out in Section 3, Section 4, Section 5 and Section 6.

3. Linear Optimization Reformulation of the λ -Price Problem

We start the required groundwork by reviewing the standard linear optimization (LO) formulation of a finite-state and -action MDP (see, e.g., ([47] Sec. 6.9)), since it applies to the λ -price problem ( P λ ) in (3), as it is needed in subsequent analyses. It is well known that such an MDP can be reformulated as an LO problem on variables x j a for state–action pairs ( j , a ) K N × A , which represent discounted state–action occupancy measures. Thus, variable x j a corresponds to the measure
x p j a ( π ) E p π t = 0 1 { ( j , a ) } ( s ( t ) , a ( t ) ) β t ,
where 1 B ( · ) denotes the indicator function of a set B, so x p j a ( π ) is the expected total discounted number of times that action a is selected in state j under policy π starting from s ( 0 ) p . We write it as x i j a ( π ) when s ( 0 ) = i .
Such a standard LO formulation is
( L λ ( p ) ) : minimize ( j , a ) K ( h j a + λ q j a ) x j a subject   to :   x j a 0 a A x j a β ( i , a ) K p i j a x i a = p j , j N ,
or, in vector notation, writing the probability mass function p as a vector p ,
( L λ ( p ) ) : minimize a A x a ( h a + λ q a ) subject   to :   x a 0 , a A a A x a ( I β P a ) = p .
We next use the above LO formulation to show that the λ -price problem ( P λ ) in (3) can be reformulated into an equivalent problem in which holding costs under action A are zero, a result on which we draw later on.
Thus, define
h ^ a h a ( I β P a ) ( I β P A ) 1 h A , a A ,
and note that
h ^ A = 0 .
Denote by F ^ i ( π ) , f ^ i a , a ( S ) , and m ^ i a , a ( S ) the cost, marginal cost, and marginal productivity metrics defined in Section 2.3, but for the model with modified holding costs h ^ i a in (17). The following result clarifies the relations between such metrics and those for the original holding costs h i a . Recall that we denote by S 1 ( , , , N ) the policy that selects the highest gear in every state.
Lemma 2.
For any admissible policy π, stationary deterministic policy S, and actions a a :
(a) 
F ^ p ( π ) = F p ( π ) F p ( S 1 ) ;
(b) 
f ^ i a , a ( S ) = f i a , a ( S ) ;
(c) 
m ^ i a , a ( S ) = m i a , a ( S ) .
Proof. 
(a) The measures x p a ( π ) = ( x p j a ( π ) ) j N , viewed as row vectors, satisfy the constraints in (16):
a = 0 A x p a ( π ) ( I β P a ) = p .
Hence, since the matrices I β P a are invertible, we can write
x p A ( π ) + a = 0 A 1 x p a ( π ) ( I β P a ) ( I β P A ) 1 = p ( I β P A ) 1 .
We thus obtain
F p ( π ) = a = 0 A x p a ( π ) h a = a = 0 A 1 x p a ( π ) h a + x p A ( π ) h A = p ( I β P A ) 1 h A + a = 0 A 1 x p a ( π ) h a ( I β P a ) ( I β P A ) 1 h A = p ( I β P A ) 1 h A + a = 0 A 1 x p a ( π ) h ^ a = F p ( S 1 ) + F ^ p ( π ) ,
where the last line is obtained from the previous one by using Equation (7), which yields
F ( S 1 ) = ( I β P A ) 1 h A .
(b) We have, using (11) and (17), and part (a),
f ^ a , a ( S ) h ^ a h ^ a + β ( P a P a ) F ^ ( S ) = h a h a + β ( P a P a ) ( I β P A ) 1 h A + β ( P a P a ) ( F ( S ) F ( S 1 ) ) = h a h a + β ( P a P a ) F ( S ) = f a , a ( S ) .
(c) This part follows directly from part (b) and (13), since m ^ i a , a ( S ) f ^ i a , a ( S ) / g i a , a ( S ) = f i a , a ( S ) / g i a , a ( S ) = m i a , a ( S ) .    □
Corollary 1.
For any state j ,
(a) 
f j a 1 , a ( S 1 ) = h ^ j a 1 h ^ j a , a 1 ;
(b) 
m j a , A ( S 1 ) = h ^ j a / g j a , A ( S 1 ) , a < A .
Proof .
(a) We have
f a 1 , a ( S 1 ) = f ^ a 1 , a ( S 1 ) h ^ a 1 h ^ a + β ( P a 1 P a ) F ^ ( S 1 ) = h ^ a 1 h ^ a ,
where we have used in turn Lemma 2(b), (11), (18) and (20).
(b) This part follows from (a), since m j a , A ( S 1 ) = f j a , A ( S 1 ) / g j a , A ( S 1 ) = h ^ j a / g j a , A ( S 1 ) .    □
Denote by ( P ^ λ ) the modified λ -price problem where the h a in problem ( P λ ) are replaced by h ^ a . Write as V ^ p ( λ , π ) the project cost metric for the modified model.
Lemma 3.
Problems ( P λ ) and ( P ^ λ ) are equivalent, since V ^ p ( λ , π ) = V p ( λ , π ) F p ( S 1 ) .
Proof. 
We have, using Lemma 2(a),
V ^ p ( λ , π ) = F ^ p ( π ) + λ G p ( π ) = V p ( λ , π ) F p ( S 1 ) .
   □

4. Relations between Performance Metrics

This section presents relations between performance metrics that we will need to prove the verification theorem. We start with a result giving decomposition identities that relate the metrics F p ( π ) and G p ( π ) for an admissible policy π to the metrics F p ( S ) and G p ( S ) under a particular stationary deterministic policy S, where p is the initial-state distribution. The resulting decomposition identities involve marginal cost metrics f j a , a ( S ) and marginal resource metrics g j a , a ( S ) .
Lemma 4
(Performance metrics decomposition). For any admissible policy π and stationary deterministic policy S:
(a) 
F p ( S ) + a < a j S a f j a , a ( S ) x p j a ( π ) = F p ( π ) + a < a j S a f j a , a ( S ) x p j a ( π ) ;
(b) 
G p ( π ) + a < a j S a g j a , a ( S ) x p j a ( π ) = G p ( S ) + a < a j S a g j a , a ( S ) x p j a ( π ) .
Proof. 
(a) We can write, using in turn (19), Lemma 1(a), and f j a , a ( S ) = f j a , a ( S ) ,
0 = a x p a ( π ) ( I β P a ) p F ( S ) = a x p a ( π ) ( I β P a ) F ( S ) p F ( S ) = a x p a ( π ) ( I β P a ) F ( S ) h a p F ( S ) + a x p a ( π ) h a = a x p a ( π ) ( I β P a ) F ( S ) h a F p ( S ) + F p ( π ) = a a x p S a a ( π ) f S a a , a ( S ) F p ( S ) + F p ( π ) = a > a x p S a a ( π ) f S a a , a ( S ) + a < a x p S a a ( π ) f S a a , a ( S ) F p ( S ) + F p ( π ) = a < a x p S a a ( π ) f S a a , a ( S ) a > a x p S a a ( π ) f S a a , a ( S ) F p ( S ) + F p ( π ) = a < a j S a x p j a ( π ) f j a , a ( S ) a > a j S a x p j a ( π ) f j a , a ( S ) F p ( S ) + F p ( π ) .
Part (b) follows similarly as part (a).    □
The following result draws on the above to relate metrics F p ( S ) and F p ( T j a , a S ) to G p ( S ) and G p ( T j a , a S ) , respectively, which clarifies the interpretation of marginal metrics f j a , a ( S ) and g j a , a ( S ) . Recall that T j a , a S is the modification of policy S that results by shifting gear in state j from a to a .
Lemma 5.
For any actions a a and state j S a :
(a) 
F p ( S ) = F p ( T j a , a S ) + f j a , a ( S ) x p j a ( T j a , a S ) ;
(b) 
F p ( T j a , a S ) = F p ( S ) + f j a , a ( T j a , a S ) x p j a ( S ) ;
(c) 
G p ( T j a , a S ) = G p ( S ) + g j a , a ( S ) x p j a ( T j a , a S ) ;
(d) 
G p ( S ) = G p ( T j a , a S ) + g j a , a ( T j a , a S ) x p j a ( S ) .
Proof. 
(a) Writing Lemma 4(a) as
F p ( S ) = F p ( π ) + l k j S l f j l , k ( S ) x p j k ( π )
and taking π = T j a , a S in the latter expression, with j S a , gives
F p ( S ) = F p ( T j a , a S ) + l k j S l f j l , k ( S ) x p j k ( T j a , a S ) .
Now, on the one hand, for l a , policy T j a , a S selects gear l in states j S l . Hence, for k l and such j , x p j k ( T j a , a S ) = 0 .
On the other hand, for l = a , policy π = T j a , a S selects gear l = a in states j S l { j } and selects gear a in state j = j . Hence, x p j k ( T j a , a S ) = 0 for j S l { j } , since k l = a and x p j k ( T j a , a S ) = 0 for j = j if k a .
Thus, the only positive x p j k ( T j a , a S ) in (21) can be x p j a ( T j a , a S ) ; hence, (21) reduces to
F p ( S ) = F p ( T j a , a S ) + f j a , a ( S ) x p j a ( T j a , a S ) .
(b) Let S = T j a , a S and note that j S a and S = T j a , a S . Hence, by part (a),
F p ( T j a , a S ) = F p ( S ) = F p ( T j a , a S ) + f j a , a ( S ) x p j a ( T j a , a S ) = F p ( S ) + f j a , a ( T j a , a S ) x p j a ( S ) .
Parts (c) and (d) follow similarly as (a) and (b).    □
The following result, which follows easily from Lemma 5(c,d), clarifies the interpretation of PCL ( F ) -indexability condition (PCLI1) in Definition 3, as a natural monotonicity property of the resource metric G i ( S ) .
Proposition 1.
Condition (PCLI1) is equivalent to the following: for S F , a < a < a , j S a ,
(a) 
G i ( T j a , a S ) G i ( S ) G i ( T j a , a S ) , i j , and G j ( T j a , a S ) < G j ( S ) < G j ( T j a , a S ) .
(b) 
If p has full support, G p ( T j a , a S ) < G p ( S ) < G p ( T j a , a S ) .
Remark 4.
Proposition 1 yields the following intuitive interpretation of condition (PCLI1): it means that, for any F -policy S, modifying S by downshifting gears in one state results in a lower or equal value of the resource usage metric; and, conversely, modifying S by upshifting gears in one state results in a higher or equal resource usage metric. When the initial state is drawn from a distribution with full support, downshifting gears leads to a strictly lower resource usage, and upshifting gears leads to a strictly higher resource usage.
The next result shows that, under (PCLI1), the increment F p ( T j a , a S ) F p ( S ) is proportional to G p ( S ) G p ( T j a , a S ) , with proportionality constant m j a , a ( S ) , which is equal to m j a , a ( T j a , a S ) .
Lemma 6.
Under condition (PCLI1), for any actions a a and state j S a ,
(a) 
F p ( T j a , a S ) F p ( S ) = m j a , a ( S ) ( G p ( S ) G p ( T j a , a S ) ) = m j a , a ( T j a , a S ) ( G p ( S ) G p ( T j a , a S ) ) ;
(b) 
m j a , a ( S ) = m j a , a ( T j a , a S ) .
Proof. 
(a) We have, using Lemma 5(a, c),
F p ( T j a , a S ) F p ( S ) = f j a , a ( S ) x p j a ( T j a , a S ) = f j a , a ( S ) G p ( T j a , a S ) G p ( S ) g j a , a ( S ) = m j a , a ( S ) ( G p ( S ) G p ( T j a , a S ) ) .
On the other hand, using Lemma 5(b, d) we obtain
F p ( T j a , a S ) F p ( S ) = f j a , a ( T j a , a S ) x p j a ( S ) = f j a , a ( T j a , a S ) G p ( S ) G p ( T j a , a S ) ) g j a , a ( T j a , a S ) = m j a , a ( T j a , a S ) ( G p ( S ) G p ( T j a , a S ) ) .
(b) This part follows directly from part (a).    □
The following result shows in its part (a) that, under condition (PCLI1), the increment f j a , a ( S ) f j a , a ( T i a ¯ , a ¯ S ) is proportional to g j a , a ( S ) g j a , a ( T i a ¯ , a ¯ S ) , with the proportionality constant being m i a ¯ , a ¯ ( S ) . Then, its part (b) draws on this result to obtain a relation between MP metrics that we use in Algorithm 1.
Lemma 7.
Under condition (PCLI1) , for any actions a a and a ¯ a ¯ with T i a ¯ , a ¯ S F and states i S a ¯ and j ,
(a) 
f j a , a ( S ) f j a , a ( T i a ¯ , a ¯ S ) = m i a ¯ , a ¯ ( S ) ( g j a , a ( S ) g j a , a ( T i a ¯ , a ¯ S ) ) ;
(b) 
g j a , a ( T i a ¯ , a ¯ S ) m j a , a ( T i a ¯ , a ¯ S ) m i a ¯ , a ¯ ( S ) = g j a , a ( S ) m j a , a ( S ) m i a ¯ , a ¯ ( S ) .
Proof. 
(a) From (11) and (12), we obtain
f a , a ( S ) f a , a ( T i a ¯ , a ¯ S ) = β ( P a P a ) ( F ( S ) F ( T i a ¯ , a ¯ S ) )
and
g a , a ( S ) g a , a ( T i a ¯ , a ¯ S ) = β P a P a ( G ( S ) G ( T i a ¯ , a ¯ S ) )
Now, combining Lemma 6(a) with this gives
f a , a ( S ) f a , a ( T i a ¯ , a ¯ S ) = m i a ¯ , a ¯ ( S ) ( g a , a ( S ) g a , a ( T i a ¯ , a ¯ S ) ) .
(b) We can write
m j a , a ( T i a ¯ , a ¯ S ) f j a , a ( T i a ¯ , a ¯ S ) g j a , a ( T i a ¯ , a ¯ S ) = f j a , a ( S ) g j a , a ( T i a ¯ , a ¯ S ) m i a ¯ , a ¯ ( S ) g j a , a ( S ) g j a , a ( T i a ¯ , a ¯ S ) 1 = g j a , a ( S ) g j a , a ( T i a ¯ , a ¯ S ) f j a , a ( S ) g j a , a ( S ) m i a ¯ , a ¯ ( S ) g j a , a ( S ) g j a , a ( T i a ¯ , a ¯ S ) 1 = g j a , a ( S ) g j a , a ( T i a ¯ , a ¯ S ) m j a , a ( S ) m i a ¯ , a ¯ ( S ) g j a , a ( S ) g j a , a ( T i a ¯ , a ¯ S ) 1 = m i a ¯ , a ¯ ( S ) + g j a , a ( S ) g j a , a ( T i a ¯ , a ¯ S ) m j a , a ( S ) m i a ¯ , a ¯ ( S ) ,
where the second line is obtained by using part (a).    □

5. Analysis of the Output of Algorithm DS ( F )

This section derives further relations between project performance metrics that will play the role of key tools for elucidating and analyzing the output of the downshift adaptive-greedy index algorithm DS ( F ) in Algorithm 1 and hence to prove the verification theorem. Throughout this section, { ( j k , a k ) , S k , m j k , a k } k = 1 K is an output of algorithm DS ( F ) .
We start with a result justifying recursive index update Formula (14) in the algorithm.
Lemma 8.
Let condition (PCLI1) hold. Then, for any state j, active action a 1 , and k = 2 , , K ,
m j a 1 , a ( S k ) = m j k 1 , a k 1 + g j a 1 , a ( S k 1 ) g j a 1 , a ( S k ) ( m j a 1 , a ( S k 1 ) m j k 1 , a k 1 ) ,
or, equivalently,
m j a 1 , a ( S k 1 ) = m j k 1 , a k 1 + g j a 1 , a ( S k ) g j a 1 , a ( S k 1 ) m j a 1 , a ( S k ) m j k 1 , a k 1 .
Proof. 
The result follows by noting that m j k 1 , a k 1 = m j k 1 a k 1 1 , a k 1 and taking i = j k 1 , a ¯ = a k 1 1 , a ¯ = a k 1 , S = S k 1 , and a = a 1 in Lemma 7(b), since T i a ¯ , a ¯ S = S k F .    □
The following result expresses, in its part (b), the MP metric m j l a l 1 , a l ( S k ) , for k < l , as the sum of index m j k , a k and a positive linear combination of the index differences m j n , a n m j n 1 , a n 1 for n = k + 1 , , l . Part (a) is a preliminary result needed to prove part (b).
Lemma 9.
Under condition (PCLI1), the following holds for 1 k < l K :
(a) 
For l = k + 1 , , l ,
m j l a l 1 , a l ( S k ) = m j k , a k + 1 g j l a l 1 , a l ( S k ) [ n = k + 1 l 1 g j l a l 1 , a l ( S n ) ( m j n , a n m j n 1 , a n 1 ) + g j l a l 1 , a l ( S l ) m j l a l 1 , a l ( S l ) m j l 1 , a l 1 ] .
(b) 
f j l a l 1 , a l ( S k ) = g j l a l 1 , a l ( S k ) m j k , a k + n = k + 1 l g j l a l 1 , a l ( S n ) ( m j n , a n m j n 1 , a n 1 ) , or, equivalently,
m j l a l 1 , a l ( S k ) = m j k , a k + n = k + 1 l g j l a l 1 , a l ( S n ) g j l a l 1 , a l ( S k ) ( m j n , a n m j n 1 , a n 1 ) .
Proof. 
(a) We prove the result by induction on l = k + 1 , , l . For l = k + 1 , Equation (24) holds because, by Equation (23) in Lemma 8, we have
m j l a l 1 , a l ( S k ) = m j k , a k + g j l a l 1 , a l ( S k + 1 ) g j l a l 1 , a l ( S k ) m j l a l 1 , a l ( S k + 1 ) m j k , a k .
Suppose now that Equation (24) holds for some l with k < l < l . We will use that, by Equation (23) in Lemma 8, we have
m j l a l 1 , a l ( S l ) = m j l , a l + g j l a l 1 , a l ( S l + 1 ) g j l a l 1 , a l ( S l ) m j l a l 1 , a l ( S l + 1 ) m j l , a l .
Now, substituting the right-hand side of the latter identity for m j l a l 1 , a l ( S l ) in Equation (24) gives
m j l a l 1 , a l ( S k ) = m j k , a k + 1 g j l a l 1 , a l ( S k ) [ n = k + 1 l 1 g j l a l 1 , a l ( S n ) ( m j n , a n m j n 1 , a n 1 ) + g j l a l 1 , a l ( S l ) m j l , a l + g j l a l 1 , a l ( S l + 1 ) g j l a l 1 , a l ( S l ) m j l a l 1 , a l ( S l + 1 ) m j l , a l m j l 1 , a l 1 ] = m j k , a k + 1 g j l a l 1 , a l ( S k ) [ n = k + 1 l g j l a l 1 , a l ( S n ) ( m j n , a n m j n 1 , a n 1 ) + g j l a l 1 , a l ( S l + 1 ) m j l a l 1 , a l ( S l + 1 ) m j l , a l ] ,
which shows that the result also holds for l + 1 and hence completes the induction.
(b) This part corresponds to the case l = l in part (a), noting that m j l a l 1 , a l ( S l ) = m j l , a l .    □
In the following result and henceforth we use the notation a k ( j ) to denote the action selected in state j by policy S k . Thus, e.g., a 1 ( j ) = A for every state j, since S 1 = ( , , , N ) , and a K + 1 ( j ) = 0 for every state j, since S K + 1 = ( N , , , ) .
Lemma 10.
Under condition (PCLI1), the following holds for 1 k l K :
f j l a l 1 , a k ( j l ) ( S k ) = g j l a l 1 , a k ( j l ) ( S k ) m j k , a k + n = k + 1 l g j l a l 1 , a n ( j l ) ( S n ) ( m j n , a n m j n 1 , a n 1 ) ,
or, equivalently,
m j l a l 1 , a k ( j l ) ( S k ) = m j k , a k + n = k + 1 l g j l a l 1 , a n ( j l ) ( S n ) g j l a l 1 , a k ( j l ) ( S k ) ( m j n , a n m j n 1 , a n 1 ) .
Proof. 
Fix k. We prove the result by induction on l = k , , K . For l = k , using that a k ( j k ) = a k gives that (26) reduces to f j l a k 1 , a k ( S k ) = g j k a k 1 , a k ( S k ) m j k , a k , which holds by construction, since m j k , a k is defined in the algorithm precisely as m j k a k 1 , a k ( S k ) = f j l a k 1 , a k ( S k ) / g j k a k 1 , a k ( S k ) .
Suppose now that (26) holds up to and including some l with k < l < K . We will prove that it must then hold for l + 1 . For such a purpose, we distinguish two cases, depending on whether j l + 1 = j l or j l + 1 j l . Start with the case j l + 1 = j l . To simplify the argument below, we write j l as j and a l as a. In this case, the algorithm downshifts in step l the gear in state j from a to a 1 and in step l + 1 downshifts again in state j from gear a 1 to a 2 . Hence, a l ( j ) = a and a l + 1 ( j ) = a l + 1 = a 1 . We can write
f j l + 1 a l + 1 1 , a k ( j l + 1 ) ( S k ) = f j l + 1 a l + 1 1 , a l + 1 ( S k ) + f j l + 1 a l + 1 , a k ( j ) ( S k ) = f j a 2 , a 1 ( S k ) + f j a 1 , a k ( j ) ( S k ) = g j a 2 , a 1 ( S k ) m j k , a k + n = k + 1 l + 1 g j a 2 , a 1 ( S n ) ( m j n , a n m j n 1 , a n 1 ) + g j a 1 , a k ( j ) ( S k ) m j k , a k + n = k + 1 l g j a 1 , a n ( j ) ( S n ) ( m j n , a n m j n 1 , a n 1 ) = g j a 2 , a k ( j ) ( S k ) m j k , a k + n = k + 1 l + 1 g j a 2 , a n ( j ) ( S n ) ( m j n , a n m j n 1 , a n 1 ) = g j l + 1 a l + 1 1 , a k ( j ) ( S k ) m j k , a k + n = k + 1 l + 1 g j l + 1 a l + 1 1 , a n ( j ) ( S n ) ( m j n , a n m j n 1 , a n 1 ) ,
where we have used in turn the elementary property f j a , a ( S ) = f j a , a ( S ) + f j a , a ( S ) , Lemma 9(b), the induction hypothesis, and a l + 1 ( j ) = a 1 . Therefore, the result holds for l + 1 in this case.
Consider now the case j l + 1 j l , in which a l + 1 = a l + 1 ( j l + 1 ) = a l ( j l + 1 ) . To simplify the argument below, we write j l + 1 as j and a l + 1 as a. In this case, the algorithm downshifts in step l + 1 at state j from gear a to a 1 . If a < A , the previous downshift at state j, from gear a + 1 to a, occurred at some earlier step l < l , so j l = j , a l 1 = a , and
a l + 1 ( j ) = = a l ( j ) = a l + 1 ( j ) = a .
We can now write
f j l + 1 a l + 1 1 , a k ( j l + 1 ) ( S k ) = f j l + 1 a l + 1 1 , a l + 1 ( S k ) + f j l + 1 a l + 1 , a k ( j ) ( S k ) = f j a 1 , a ( S k ) + f j a , a k ( j ) ( S k ) = g j a 1 , a ( S k ) m j k , a k + n = k + 1 l + 1 g j a 1 , a ( S n ) ( m j n , a n m j n 1 , a n 1 ) + g j l a l 1 , a k ( j l ) ( S k ) m j k , a k + n = k + 1 l g j l a l 1 , a n ( j l ) ( S n ) ( m j n , a n m j n 1 , a n 1 ) = g j a 1 , a k ( j ) ( S k ) m j k , a k + n = k + 1 l g j a 1 , a n ( j ) ( S n ) ( m j n , a n m j n 1 , a n 1 ) + n = l + 1 l + 1 g j a 1 , a ( S n ) ( m j n , a n m j n 1 , a n 1 ) = g j l + 1 a l + 1 1 , a k ( j ) ( S k ) m j k , a k + n = k + 1 l + 1 g j l + 1 a l + 1 1 , a n ( j ) ( S n ) ( m j n , a n m j n 1 , a n 1 ) ,
where we have used in turn Lemma 9(b), the induction hypothesis, and (27).
Suppose now that a = A . Then,
a k ( j ) = a k + 1 ( j ) = = a l + 1 ( j ) = A ,
and we can write
f j l + 1 a l + 1 1 , a k ( j l + 1 ) ( S k ) = f j A 1 , A ( S k ) = g j A 1 , A ( S k ) m j k , a k + n = k + 1 l + 1 g j A 1 , A ( S n ) ( m j n , a n m j n 1 , a n 1 ) = g j l + 1 a l + 1 1 , a k ( j ) ( S k ) m j k , a k + n = k + 1 l + 1 g j l + 1 a l + 1 1 , a n ( j ) ( S n ) ( m j n , a n m j n 1 , a n 1 ) ,
where we have used in turn Lemma 9(b) and (28).
Hence, the result also holds for l + 1 in the case j l + 1 j l , which completes the induction proof.    □
The next result expresses the modified holding costs h ^ j a as positive linear combinations of m j 1 , a 1 and the differences m j n , a n m j n 1 , a n 1 . We will use it in Lemmas 14 and 15 to reformulate the modified cost objective V ^ p ( λ , π ) .
Lemma 11.
Under condition (PCLI1),
g j 1 a 1 1 , A ( S 1 ) m j 1 , a 1 = h ^ j 1 a 1 1 g j l a l 1 , A ( S 1 ) m j 1 , a 1 + n = 2 l g j l a l 1 , a n ( j l ) ( S n ) ( m j n , a n m j n 1 , a n 1 ) = h ^ j l a l 1 , l = 2 , , K .
Proof. 
The result follows from Corollary 1(a), which shows that f j a 1 , A ( S 1 ) = h ^ j a 1 h ^ j A = h ^ j a 1 (since h ^ j A = 0 ), and Lemma 10, used with k = 1 , noting that a 1 ( j ) = A .    □
The following result shows that m j k 1 , a k 1 can be expressed in two different ways in terms of MP metrics.
Lemma 12.
Under condition (PCLI1),
m j k 1 a k 1 1 , a k 1 ( S k ) = m j k 1 a k 1 1 , a k 1 ( S k 1 ) = m j k 1 , a k 1 , k = 2 , , K + 1 .
Proof. 
The first identity follows from the result m j a , a ( T j a , a S ) = m j a , a ( S ) in Lemma 6(b), taking S = S k 1 , a = a k 1 , a = a k 1 1 , and j = j k 1 , and noting that S k = T j k 1 a k 1 , a k 1 1 S k 1 . The second identity follows by definition of m j k 1 , a k 1 .    □
The following result relates metrics under two successive policies as generated by the index algorithm.
Lemma 13.
Under condition (PCLI1),
(a) 
F p ( S k ) = F p ( S k 1 ) + f j k 1 a k 1 1 , a k 1 ( S k ) x p j k 1 a k 1 ( S k 1 ) ,   k = 2 , , K + 1 ;
(b) 
F p ( S k ) = F p ( S k + 1 ) f j k a k 1 , a k ( S k + 1 ) x p j k a k ( S k ) ,   k = 1 , , K ;
(c) 
G p ( S k ) = G p ( S k 1 ) g j k 1 a k 1 1 , a k 1 ( S k ) x p j k 1 a k 1 ( S k 1 ) ,   k = 2 , , K + 1 ;
(d) 
G p ( S k ) = G p ( S k + 1 ) + g j k a k 1 , a k ( S k + 1 ) x p j k a k ( S k ) ,   k = 1 , , K ;
(e) 
V p ( λ , S k ) = V p ( λ , S k 1 ) ( λ m j k 1 , a k 1 ) g j k 1 a k 1 1 , a k 1 ( S k ) x p j k 1 a k 1 ( S k 1 ) , k = 2, …, K + 1.
(f) 
V p ( λ , S k ) = V p ( λ , S k + 1 ) ( m j k , a k λ ) g j k a k 1 , a k ( S k + 1 ) x p j k a k ( S k ) , k = 1, …, K.
Proof. 
(a) This part follows from Lemma 5(a) by taking S = S k , a = a k 1 1 , a = a k 1 , and j = j k 1 , noting that T j k 1 a k 1 1 , a k 1 S k = S k 1 in F p ( S ) = F p ( T j a , a S ) + f j a , a ( S ) x p j a ( T j a , a S ) .
(b) This part follows directly from (a).
(c) This part follows from Lemma 5(c) by taking S, a, a , and j as in part (a) in G p ( T j a , a S ) = G p ( S ) + g j a , a ( S ) x p j a ( T j a , a S ) .
(d) This part follows directly from (c).
(e) The result follows from
V p ( λ , S k ) = F p ( S k ) + λ G p ( S k ) = F p ( S k 1 ) + f j k 1 a k 1 1 , a k 1 ( S k ) x p j k 1 a k 1 ( S k 1 ) + λ G p ( S k 1 ) g j k 1 a k 1 1 , a k 1 ( S k ) x p j k 1 a k 1 ( S k 1 ) = V p ( λ , S k 1 ) + f j k 1 a k 1 1 , a k 1 ( S k ) λ g j k 1 a k 1 1 , a k 1 ( S k ) x p j k 1 a k 1 ( S k 1 ) = V p ( λ , S k 1 ) + m j k 1 a k 1 1 , a k 1 ( S k ) λ g j k 1 a k 1 1 , a k 1 ( S k ) x p j k 1 a k 1 ( S k 1 ) = V p ( λ , S k 1 ) λ m j k 1 , a k 1 g j k 1 a k 1 1 , a k 1 ( S k ) x p j k 1 a k 1 ( S k 1 ) ,
where we have used parts (a, c) and Lemma 12.
(f) The result follows from
V p ( λ , S k ) = F p ( S k ) + λ G p ( S k ) = F p ( S k + 1 ) f j k a k 1 , a k ( S k + 1 ) x p j k a k ( S k ) + λ G p ( S k + 1 ) + g j k a k 1 , a k ( S k + 1 ) x p j k a k ( S k ) = V p ( λ , S k + 1 ) f j k a k 1 , a k ( S k + 1 ) λ g j k a k 1 , a k ( S k + 1 ) x p j k a k ( S k ) = V p ( λ , S k + 1 ) m j k a k 1 , a k ( S k + 1 ) λ g j k a k 1 , a k ( S k + 1 ) x p j k a k ( S k ) = V p ( λ , S k 1 ) λ m j k 1 , a k 1 g j k 1 a k 1 1 , a k 1 ( S k ) x p j k 1 a k 1 ( S k 1 ) ,
where we have used parts (b, d) and Lemma 12.    □

6. Partial Conservation Laws

This section shows that, under condition (PCLI1) in Definition 3, project performance metrics satisfy certain partial conservation laws (PCLs), which extend those previously introduced by the author for finite-state restless (two-gear) bandits in [25,29]. It further uses those PCLs to lay further groundwork towards the proof of Theorem 1.
In the following result, we assume that the initial-state distribution p has full support, which we write as p > 0 .
Proposition 2
(PCLs). Suppose that (PCLI1) holds and let p > 0 . Then, metrics G p ( π ) and x p j a ( π ) , for states j and actions a < A , satisfy the following: for any admissible policy π and S F ,
(a.1) 
G p ( π ) + a < a j S a g j a , a ( S ) x p j a ( π ) G p ( S ) , with equality (conservation law),
G p ( π ) + a < a j S a g j a , a ( S ) x p j a ( π ) = G p ( S ) ,
iff π selects gears a a in states j S a (so π S ), for a = 0 , , A 1 .
(a.2) 
In particular, for S = S K + 1 = ( N , , , ) , it holds that G p ( π ) G p ( S K + 1 ) , with equality iff π selects gear 0 in every state.
(a.3) 
In the case S = S 1 = ( , , , N ) , we have the conservation law
G p ( π ) + a < A j N g j a , A ( S 1 ) x p j a ( π ) = G p ( S 1 ) .
(b) 
a < a j S a g j a , a ( S ) x p j a ( π ) 0 , with equality iff π selects gears a a in states j S a (so S π ), for a = 1 , , A .
Proof. 
(a.1) From Lemma 4(b), we obtain, under condition (PCLI1),
G p ( π ) + a < a j S a g j a , a ( S ) x p j a ( π ) = G p ( S ) + a < a j S a g j a , a ( S ) x p j a ( π ) G p ( S ) ,
with equality iff (using the fact that the initial state distribution p has full support) x p j a ( π ) = 0 for j S a with a < a , i.e., iff π selects gears a a in states j S a , for a = 0 , , A 1 .
Parts (a.2) and (a.3) are direct consequences of (a.1).
Part (b) follows directly from (PCLI1).    □
The next result shows how to reformulate the equivalent modified holding cost metric F ^ p ( π ) a = 0 A 1 j N h ^ j a x p j a ( π ) (see Section 3) in terms of the output { ( j k , a k ) , S k , m j k , a k } k = 1 K of algorithm DS ( F ) in Algorithm 1 and expressions arising in the PCLs in Proposition 2.
Lemma 14.
Under condition (PCLI1),
F ^ p ( π ) = m j 1 , a 1 a < a j S a 1 g j a , a ( S 1 ) x p j a ( π ) + k = 2 K ( m j k , a k m j k 1 , a k 1 ) a < a j S a k g j a , a ( S k ) x p j a ( π ) .
Proof. 
The result follows from Lemma 11, which yields
F ^ p ( π ) a = 0 A 1 j N h ^ j a x p j a ( π ) = l = 1 K h ^ j l a l 1 x p j l a l 1 ( π ) = g j 1 a 1 1 , A ( S 1 ) m j 1 , a 1 x p j 1 a 1 1 ( π ) + l = 2 K g j l a l 1 , A ( S 1 ) m j 1 , a 1 + k = 2 l g j l a l 1 , a k ( j l ) ( S k ) ( m j k , a k m j k 1 , a k 1 ) x p j l a l 1 ( π ) = m j 1 , a 1 l = 1 K g j l a l 1 , A ( S 1 ) x p j l a l 1 ( π ) + k = 2 K ( m j k , a k m j k 1 , a k 1 ) l = k K g j l a l 1 , a k ( j l ) ( S k ) x p j l a l 1 ( π ) = m j 1 , a 1 a < a j S a 1 g j a , a ( S 1 ) x p j a ( π ) + k = 2 K ( m j k , a k m j k 1 , a k 1 ) a < a j S a k g j a , a ( S k ) x p j a ( π ) .
   □
The following result draws on the previous one by showing how to reformulate the cost metric V ^ p ( λ , π ) of the equivalent modified λ -price problem (see Lemma 3) in terms of the output of algorithm DS ( F ) .
Lemma 15.
Under (PCLI1), V ^ p ( λ , π ) can be reformulated into the following equivalent expressions:
(a) 
V ^ p ( λ , π ) = λ G p ( S 1 ) + ( m j 1 , a 1 λ ) a < a j S a 1 g j a , a ( S 1 ) x p j a ( π ) + k = 2 K ( m j k , a k m j k 1 , a k 1 ) a < a j S a k g j a , a ( S k ) x p j a ( π ) ;
(b) 
V ^ p ( λ , π ) = m j 1 , a 1 G p ( S 1 ) + k = 2 l 1 ( m j k , a k m j k 1 , a k 1 ) G p ( π ) + a < a j S a k g j a , a ( S k ) x p j a ( π ) + ( λ m j l 1 , a l 1 ) G p ( π ) + a < a j S a l g j a , a ( S l ) x p j a ( π ) + ( m j l , a l λ ) a < a j S a l g j a , a ( S l ) x p j a ( π ) + k = l + 1 K ( m j k , a k m j k 1 , a k 1 ) a < a j S a k g j a , a ( S k ) x p j a ( π ) ;
(c) 
V ^ p ( λ , π ) = m j 1 , a 1 G p ( S 1 ) + k = 2 K ( m j k , a k m j k 1 , a k 1 ) G p ( π ) + a < a j S a k g j a , a ( S k ) x p j a ( π ) + ( λ m j K , a K ) G p ( π ) .
Proof. 
All three parts follow from the definition V ^ p ( λ , π ) F ^ p ( π ) + λ G p ( π ) and Lemma 14 by suitably rearranging terms.    □
The next result draws on the above to reformulate the cost metrics V ^ p ( λ , S l ) in terms of the output of algorithm DS ( F ) .
Lemma 16.
Under condition (PCLI1),
(a) 
V ^ p ( λ , S 1 ) = λ G p ( S 1 ) ;
(b) 
For 2 l K ,
V ^ p ( λ , S l ) = m j 1 , a 1 G p ( S 1 ) + k = 2 l 1 ( m j k , a k m j k 1 , a k 1 ) G p ( S k ) + ( λ m j l 1 , a l 1 ) G p ( S l ) ;
(c) 
V ^ p ( λ , S K + 1 ) = m j 1 , a 1 G p ( S 1 ) + k = 2 K ( m j k , a k m j k 1 , a k 1 ) G p ( S k ) + ( λ m j K , a K ) G p ( S K + 1 ) .
Proof. 
Each part follows from the corresponding part in Lemma 15, using the PCLs in Proposition 2 to simplify the resulting expressions.    □

7. Proof of the Verification Theorem

We are now ready to prove Theorem 1.
Proof of Theorem 1.
We will prove the result by showing the following: (i) policy S 1 is λ -optimal iff λ m j 1 , a 1 ; (ii) for 2 l K , policy S l is λ -optimal iff m j l 1 , a l 1 λ m j l , a l ; and (iii) policy S K + 1 is λ -optimal iff λ m j K , a K . Note that (i, ii, iii) imply that the model is F -indexable with DAI λ j , a being given by the MPI m j , a .
We consider below that p > 0 , i.e., the initial-state distribution p has full support.
Start with (i). If λ m j 1 , a 1 , we have, for any policy π ,
V ^ p ( λ , π ) = λ G p ( S 1 ) + ( m j 1 , a 1 λ ) a < a j S a 1 g j a , a ( S 1 ) x p j a ( π ) + k = 2 K ( m j k , a k m j k 1 , a k 1 ) a < a j S a k g j a , a ( S k ) x p j a ( π ) λ G p ( S 1 ) = V ^ p ( λ , S 1 ) ,
where we have used Lemmas 15(a) and 16(a) and conditions (PCLI1, PCLI2). Hence, policy S 1 is λ -optimal.
Conversely, suppose that policy S 1 is λ -optimal. Then, using Lemma 13(f), we obtain
0 V p ( λ , S 2 ) V p ( λ , S 1 ) = ( m j 1 , a 1 λ ) g j 1 a 1 1 , a 1 ( S 2 ) x p j 1 a 1 ( S 1 ) .
Now, since g j 1 a 1 1 , a 1 ( S 2 ) > 0 (by (PCLI1)) and x p j 1 a 1 ( S 1 ) > 0 (because p has full support), it follows that λ m j 1 , a 1 .
Consider now (ii). If m j l 1 , a l 1 λ m j l , a l for some l with 2 l K , we have, for any policy π ,
V ^ p ( λ , π ) = m j 1 , a 1 G p ( S 1 ) + k = 2 l 1 ( m j k , a k m j k 1 , a k 1 ) G p ( π ) + a < a j S a k g j a , a ( S k ) x p j a ( π ) + ( λ m j l 1 , a l 1 ) G p ( π ) + a < a j S a l g j a , a ( S l ) x p j a ( π ) + ( m j l , a l λ ) a < a j S a l g j a , a ( S l ) x p j a ( π ) + k = l + 1 K ( m j k , a k m j k 1 , a k 1 ) a < a j S a k g j a , a ( S k ) x p j a ( π ) V ^ p ( λ , S l ) ,
where we have further used Lemmas 15(b) and 16(b), Proposition 2, and conditions (PCLI1, PCLI2). Hence, policy S l is λ -optimal.
Conversely, suppose that policy S l is λ -optimal. Then, using Lemma 13(e, f), we obtain
0 V p ( λ , S l + 1 ) V p ( λ , S l ) = ( m j l , a l λ ) g j l a l 1 , a l ( S l + 1 ) x p j l a l ( S l )
and
0 V p ( λ , S l 1 ) V p ( λ , S l ) = ( λ m j l 1 , a l 1 ) g j l 1 a l 1 1 , a l 1 ( S l ) x p j l 1 a l 1 ( S l 1 ) .
Now, since g j l a l 1 , a l ( S l + 1 ) > 0 , g j l 1 a l 1 1 , a l 1 ( S l ) > 0 , x p j l a l ( S l ) > 0 and x p j l 1 a l 1 ( S l 1 ) > 0 , it follows that m j l 1 , a l 1 λ m j l , a l .
Finally, consider (iii). If λ m j K , a K , we can write, for any policy π ,
V ^ p ( λ , π ) = m j 1 , a 1 G p ( S 1 ) + k = 2 K ( m j k , a k m j k 1 , a k 1 ) G p ( π ) + a < a j S a k g j a , a ( S k ) x p j a ( π ) + ( λ m j K , a K ) G p ( π ) V ^ p ( λ , S K + 1 ) ,
where we have further used Lemmas 15(c) and 16(c), Proposition 2, and conditions (PCLI1, PCLI2). Hence, policy S K + 1 is λ -optimal.
Conversely, suppose that policy S K + 1 is λ -optimal. Then, using Lemma 13(f), we obtain
0 V p ( λ , S K ) V p ( λ , S K + 1 l ) = ( λ m j K , a K ) g j K a K 1 , a K ( S K + 1 ) x p j K a K ( S K ) .
Now, since g j K a K 1 , a K ( S K + 1 ) > 0 and x p j K a K ( S K ) > 0 , it follows that λ m j K , a K . This completes the proof.    □

8. Application to Multi-Armed Multi-Gear Bandit Problem: Bound and Index Policy

8.1. The Multi-Armed Multi-Gear Bandit Problem (MAMGBP)

Besides the intrinsic interest of the indexability property in Definition 1 for solving optimally the multi-gear bandit model, we next discuss as further motivation for such a property its application to design a suboptimal heuristic policy for the intractable multi-armed multi-gear bandit problem (MAMGBP) introduced by the author in [35] (where it was called the multi-armed multi-mode bandit problem).
The MAMGBP concerns the optimal dynamic allocation of a single shared resource to a finite collection of L projects modeled as multi-gear bandits, subject to a peak resource constraint stating that the total resource usage in each period cannot exceed a given amount q ¯ . Denote by s l ( t ) and a l ( t ) the state and the action at time t for project l = 1 , , L , which belong to the state and action spaces N l = { 1 , , N l } and A l = { 0 , , A l } , respectively. The parameters of project l are denoted here by h l ( j l , a l ) , q l ( j l , a l ) , and p l a ( i l , j l ) .
The MAMGBP is a multi-dimensional MDP with joint state  s ( t ) = ( s l ( t ) ) l = 1 L belonging to the joint state space N l = 1 L N l and joint action a ( t ) = ( a l ( t ) ) l = 1 L .
The joint holding cost and joint resource consumption are additive across projects, being h ( i , a ) l = 1 L h l ( i l , a l ) and q ( i , a ) l = 1 L q l ( i l , a l ) in joint state i = ( i l ) l = 1 L under joint action a = ( a l ) l = 1 L . The set of feasible actions in joint state i , satisfying the aforementioned peak resource constraint, is
A ( i ) a l = 1 L A l :   q ( i , a ) q ¯ .
To ensure that there always exists a feasible joint action, we require that, for every joint state i ,
q ( i , 0 ) q ¯ .
Individual project state transitions are conditionally independent given that the actions at every project have been selected, and hence the joint transition probabilities are multiplicative across projects, being given by p l a ( i , j ) l = 1 L p l a l ( i l , j l ) .
Let Π ( q ¯ ) be the class of history-dependent randomized policies for selecting a feasible joint action at each time period, where we make explicit its dependence on q ¯ and denote by E i π [ · ] the expectation under policy π Π ( q ¯ ) starting from the joint state i . The expected total discounted holding cost incurred under policy π starting from i is
F ( i , π ) E i π l = 1 L t = 0 h l ( s l ( t ) , a l ( t ) ) β t ,
and hence the optimal holding cost is
F ( i ) inf { F ( i , π ) :   π Π ( q ¯ ) } .
We can thus formulate the MAMGBP as follows:
( P ) find   π Π ( q ¯ ) : F ( i , π ) = F ( i ) , i N .
We shall refer to a policy π solving the MAMGBP ( P ) as a P-optimal policy.
Again, standard results in MDP theory ensure the existence of a P-optimal policy π in the class Π SD of stationary deterministic policies, which is determined by the Bellman equations
F ( i ) = min a A ( i ) h ( i , a ) + β j N p l a ( i , j ) F ( j ) , i N .
However, these equations are hindered by the curse of dimensionality, as the size of the state space N grows exponentially with the number L of projects, which renders them computationally intractable in practice for all but small L.

8.2. A Bound for the MAMGBP

Ref. [35] introduced a Lagrangian approach for obtaining a lower bound for the optimal value of the MAMGBP, extending that of Whittle [8] for the case of two-gear projects. First, we construct a relaxation of problem ( P ) by (i) relaxing the class of admissible policies from Π ( q ¯ ) to Π ( ) , thus allowing violations to the sample-path peak resource constraints
q ( s ( t ) , a ( t ) ) q ¯ , t = 0 , 1 , ,
and (ii) replacing the latter by the following aggregate relaxed version in expectation:
E i π t = 0 q ( s ( t ) , a ( t ) ) β t q ¯ 1 β .
This leads to the following relaxation of problem ( P ) in (35):
( P ^ ) minimize E i π t = 0 h ( s ( t ) , a ( t ) ) β t subject   to :   π Π ( ) E i π t = 0 q ( s ( t ) , a ( t ) ) β t q ¯ 1 β .
The relaxed problem  ( P ^ ) is a constrained MDP (see, e.g., [48]), for which an optimal policy generally depends on the initial state i . Such problems are amenable to a Lagrangian approach. Introducing a non-negative multiplier λ 0 attached to the constraint in (37), we can dualize the latter, i.e., bring it into the objective, obtaining the Lagrangian relaxation
( P ^ λ ) minimize π Π ( ) E i π t = 0 h ( s ( t ) , a ( t ) ) β t + λ E i π t = 0 q ( s ( t ) , a ( t ) ) β t q ¯ 1 β .
Note that, for any initial joint state i and multiplier λ 0 , the optimal cost V ^ ( i , λ ) of ( P ^ λ ) is a lower bound for that of relaxed problem ( P ^ ) , which we denote by F ^ ( i ) . In turn, the latter gives a lower bound for the optimal cost F ( i ) of ( P ) . Thus,
V ^ ( i , λ ) F ^ ( i ) F ( i ) .
In light of (39), we are interested in finding an optimal multiplier λ ( i ) solving the dual problem
( D ) maximize λ 0 V ^ ( i , λ ) .
Since V ^ ( i , λ ) is a concave function of λ , being a minimum of linear functions of λ , a local maximum of problem ( D ) will be a global maximum. Furthermore, since the above problems can be formulated as finite linear optimization (LO) problems, the strong duality property of the latter ensures the existence of an optimal multiplier λ ( i ) 0 solving problem ( D ) which attains the upper bound F ^ ( i ) , i.e., with V ^ ( i , λ ( i ) ) = F ^ ( i ) . This corresponds to the satisfaction of the complementary slackness property
λ ( i ) E i π t = 0 q ( s ( t ) , a ( t ) ) β t q ¯ 1 β = 0 ,
where π is an optimal policy for problem ( P ^ λ ( i ) ) .
Now, since individual project state transitions are conditionally independent given that a joint action has been selected, it suffices—as noted by Whittle [8] for the case of two-gear projects—to consider in ( P ^ λ ) decoupled policies π = ( π l ) l = 1 L , where π l Π l and Π l is the class of admissible policies for operating project las if it were in isolation. This allows us to reformulate problem ( P ^ λ ) as
( P ^ λ ) minimize l = 1 L E i l π l t = 0 h l ( s l ( t ) , a l ( t ) ) + λ q l ( s l ( t ) , a l ( t ) ) β t λ q ¯ 1 β subject   to :   π l Π l , l = 1 , , L .
We can thus decouple problem ( P ^ λ ) into the individual project subproblems
( P ^ l , λ ) minimize π l Π l E i l π l t = 0 h l ( s l ( t ) , a l ( t ) ) + λ q l ( s l ( t ) , a l ( t ) ) β t ,
for l = 1 , , L . Denoting by V l ( i l , λ ) the minimum cost objective of subproblem ( P ^ l , λ ) , it follows that the optimal cost V ^ ( i , λ ) of Lagrangian relaxation ( P ^ λ ) is decoupled as
V ^ ( i , λ ) = l = 1 L V l ( i l , λ ) λ q ¯ 1 β ,
which allows us to reformulate dual problem ( D ) in (40) as
( D ) maximize λ 0 l = 1 L V l ( i l , λ ) λ q ¯ 1 β
Now, suppose that each project l is indexable with DAI λ l ( j l , a l ) , so such indices characterize as in Definition 1 the optimal policies for individual project subproblems ( P ^ l , λ ) , which facilitates the evaluation of optimal costs V l ( i l , λ ) and hence the computational solution of dual problem ( D ) . For such a purpose, one can use the result that, if π l ( λ ) is an optimal policy for subproblem ( P ^ l , λ ) , then E i l π l ( λ ) t = 0 q l ( s l ( t ) , a l ( t ) ) β t is a subgradient of V l ( i l , λ ) , seen as a function of λ .

8.3. A Downshift Index Policy for the MAMGBP

Assuming that individual projects are indexable, the author proposed in [35] a suboptimal heuristic index policy for the above MAMGBP based on the projects’ DAIs. Here we present a different proposal of a heuristic index policy based on individual project DAIs, which is more easily implementable than that in [35].
Suppose that at time t the joint state is j = ( j l ) l = 1 L . Consider the project DAIs evaluated at such states, λ l ( j l , a l ) , for project l = 1 , , L . The proposed index policy is described in Algorithm 2, which specifies how to obtain the joint action a ^ = ( a ^ l ) l = 1 L prescribed in such a joint state.
Algorithm 2: Downshift index policy for the MAMGBP.
Input:  j = ( j l ) l = 1 L (current joint state)
Output: a ^ = ( a ^ l ) l = 1 L (prescribed joint action)
Initialization: a l : = A l , l = 1 , , L
Loop:
while l = 1 L q l ( j l , a l ) > q ¯ or min 1 l L :   a l 1 λ l ( j l , a l ) 0   do
          pick  l ^ arg min 1 l L :   a l 1 λ l ( j l , a l )
           a l ^ : = a l ^ 1 (downshift gear in project l ^ )
end { while }
a ^ : = a = ( j l ) l = 1 L
In short, the algorithm starts by assigning the highest possible gear A l to each project l. If this is feasible, in that it does not violate the peak resource constraint, this would be the prescribed joint action. Otherwise, the algorithm proceeds by downshifting one of the projects to the next lower gear. The chosen project is one that has minimum DAI at the current gear. The algorithm proceeds until the peak resource constraint is satisfied and the DAIs at the projects with prescribed active actions, if any, are non-negative. In light of the above, we call the policy resulting from this algorithm the downshift index policy.
The intuition behind the design of such a policy is that projects should be operated in such a way that two conflicting goals are balanced: (1) higher gears are to be preferred to lower gears whenever possible; and (2) the resulting joint action must be feasible, satisfying the peak resource constraint. The proposed downshift index policy is designed to strike such a balance. If a joint action is not feasible so that a project must be downshifted to a lower gear, the chosen project is one where the loss in performance due to such a downshift, for which the project DAIs are used as a proxy measure, is minimal.
Note that the downshift index policy reduces to the Whittle index policy in the case of two-gear projects.

9. Some Extensions

This section presents some extensions to the above framework.

9.1. Extension to the Long-Run Average Cost Criterion

The above results for the discounted cost criterion readily extend to the (long-run) average cost criterion (see, e.g., ([47] Ch. 8)) under appropriate ergodicity conditions. Consider the average cost, including holding and resource usage costs (charged at price λ ), of running the project starting from state i under a policy π Π , defined by
V ¯ i ( λ , π ) lim sup T 1 T E i π t = 0 T 1 h s ( t ) a ( t ) + λ q s ( t ) a ( t ) .
We further define the corresponding optimal cost
V ¯ i ( λ ) inf π Π V ¯ i ( λ , π ) .
We can thus formulate the project’s average λ-price problem as
( P ¯ λ ) find   π ( λ ) Π :   V ¯ i ( λ , π ( λ ) ) = V ¯ i ( λ ) , i N .
We shall refer to a policy π ( λ ) solving the average λ -price problem ( P ¯ λ ) as a λ-optimal policy.
We shall make the following assumption.
Assumption 2.
The following conditions hold:
(i) 
The model is weakly accessible, so the state space N can be partitioned into two subsets N tr and N acc , such that (i.a) all states in N tr are transient under every stationary policy and (i.b) for every two states i and j in N acc , j is accessible from i, so there exists a stationary policy π and a positive integer t such that P i π { s ( t ) = j } > 0 .
(ii) 
Every policy S F is unichain, i.e., it induces a single recurrent class plus possible additional transient states.
Now, by standard results in average-cost MDP theory (see [49] (Sec. 5.2)), Assumption 2 (i) ensures the existence of a λ -optimal policy π ( λ ) Π SD , with the optimal average cost V ¯ i ( λ ) being independent of the initial state, i.e., V ¯ i ( λ ) V ¯ ( λ ) .
Now, by using the Laurent series expansions for finite-state and -action MDP models (see Corollaries 8.2.4 and 8.2.5 in [47]), we have the following. For any stationary deterministic policy π Π SD ,
x ¯ i j a ( π ) lim T 1 T E i π t = 0 T 1 1 { a ( t ) = a } = lim β 1 ( 1 β ) x i j a ( π ) ,
F ¯ i ( π ) lim T 1 T E i π t = 0 T 1 h s ( t ) a ( t ) = lim β 1 ( 1 β ) F i ( π ) ,
G ¯ i ( π ) lim T 1 T E i π t = 0 T 1 q s ( t ) a ( t ) = lim β 1 ( 1 β ) G i ( π ) .
Furthermore, for any S = ( S 0 , , S A ) F , Assumption 2(ii) ensures that the above metrics do not depend on the initial state i, so we can write x ¯ j a ( S ) , F ¯ ( S ) , and G ¯ ( S ) . Furthermore, we have the Laurent series expansions
F i ( S ) = F ¯ ( S ) 1 β + φ i ( S ) + O ( 1 β ) , a s β 1
and
G i ( S ) = G ¯ ( S ) 1 β + γ i ( S ) + O ( 1 β ) , a s β 1 ,
where the bias terms φ i ( S ) and γ i ( S ) are determined, up to an additive constant, by the evaluation equations
F ¯ ( S ) + φ i ( S ) = h i a + j N p i j a φ j ( S ) , i S a , a A
and
G ¯ ( S ) + γ i ( S ) = q i a + j N p i j a γ j ( S ) , i S a , a A .
From the above, (9) and (10) we can define the average marginal (holding) cost metric
f ¯ i a , a ( S ) h i a h i a + j N p i j a φ j ( S ) j N p i j a φ j ( S ) = lim β 1 f i a , a ( S )
and the the average marginal resource (usage) metric
g ¯ i a , a ( S ) q i a q i a + j N p i j a γ j ( S ) j N p i j a γ j ( S ) = lim β 1 g i a , a ( S ) .
If g ¯ i a , a ( S ) > 0 , we further define the average MP metric
m ¯ i a , a ( S ) f ¯ i a , a ( S ) g ¯ i a , a ( S ) = lim β 1 m i a , a ( S ) .
We thus have the following verification theorem, which is the average criterion counterpart to Theorem 1 for the discounted criterion. Note that the following theorem refers to the corresponding concepts for the average criterion of F -indexability, PCL ( F ) -indexability, and downshifting algorithm DS ¯ ( F ) , which is as algorithm DS ( F ) but using the average marginal metrics instead of the discounted ones.
Theorem 2.
If the average cost model is PCL ( F ) -indexable, then it is F -indexable, with DAI λ ¯ j , a given by the MPI m ¯ j , a .

9.2. Models with Uncontrollable States

In the above framework, we have assumed that the DAI λ j , a is defined for all project states j N . Yet, in some models, this need not be the case, in particular in those having uncontrollable states. We call a project state i uncontrollable if only one action is available at i, or, equivalently, if all actions a give the same transition probabilities, so p i j a = p i j 0 for all a. This concept was considered by the author in the corresponding framework for two-gear projects developed in [29]. If there are uncontrollable states, we decompose the state space as N = N cont N unc , where N cont is the controllable state space and N unc is the uncontrollable state space.
In such a case, the above framework carries over by defining the concept of indexability and DAI by focusing on the controllable state space N cont , so the DAI λ j , a will only be defined for states j N count . The required adaptions are straightforward. For example, the policy notation S = ( S 0 , , S A ) used above can now be interpreted as meaning that, under such a policy, action a is taken in controllable states j S a for a = 0 , , A , as S 0 , …, S A is now a partition of N cont .

9.3. Models with a Countably Infinite State Space

The extension of the above framework to models with a countably infinite state space raises issues mainly in the definition of the downshift adaptive-greedy algorithm. Thus, the algorithm would not terminate, and it might not traverse the entire space of ( j , a ) for which the index is defined. Furthermore, it might possibly entail choosing among infinitely many state–action pairs ( j , a ) at each step.
Yet, in some countably infinite state space models such issues are easily addressed. Consider, e.g., a model that might arise in queueing theory where the state is the number of customers in the system so the state space is the set of non-negative integers, N { 0 , 1 , 2 , } . Imagine that the actions or gears a correspond to server speeds, so higher gears give faster service rates. The holding costs are used to model penalties (possibly nonlinear) for congestion.
In such a setting, it is natural to conjecture that optimal policies should be multi-threshold policies. Any such policy is characterized by thresholds z 1 z 2 z A , with the interpretation that gear 0 is used in states 1 j z 1 , gear a is used in states z a < j z a + 1 for a = 1 , , A 1 , and gear A is used in states j > z A . Note that the optimality of such policies has been established in some queueing models, see, e.g., [40,41,42,43].
The present framework would be applied to such a setting as follows. Rather than considering directly such multi-threshold policies and trying to establish their optimality, one would postulate a corresponding family of policies F . Note that in such a model state 0 would be uncontrollable, as there is no meaningful choice of action when the queue is empty. Thus, excluding state 0 from consideration, the postulated family F would consist of partitions S = ( S 0 , , S A ) of the controllable state space with the following property: if gear a is selected in a state j (i.e., j S a ), then at any lower state j < j a gear a a must be selected.
It is easy to see that, in this setting, the natural extension of the downshift adaptive-greedy algorithm would indeed traverse the entire space of state–action pairs ( j , a ) for which the DAI is defined, which would provide an alternative approach to address such problems to that previously considered in the aforementioned literature.

10. Discussion

This paper has introduced novel sufficient conditions for the indexability of multi-gear bandits modeling a dynamic and stochastic project consuming a single resource, along with an efficient index-computing algorithm. This can be used to efficiently solve general MDP models that satisfy such conditions, and the index has further been used to design a heuristic index policy for the more complex multi-armed multi-gear bandit problem. This work opens a number of further avenues for developing such an approach, including the following: developing an efficient implementation of and testing the algorithm; deploying the new PCL-indexability conditions in a variety of relevant models arising in applications; extending the approach to models with a countable state space; and extending the approach to models with a continuous state space.

Funding

This research has been funded in part by the Spanish State Research Agency (Agencia Estatal de Investigación, AEI) under grant PID2019-109196GB-I00/AEI/10.13039/501100011033 and by the Comunidad de Madrid in the setting of the multi-year agreement with Carlos III University of Madrid within the line of activity “Excelencia para el Profesorado Universitario”, in the framework of the V Regional Plan of Scientific Research and Technological Innovation 2016–2020.

Conflicts of Interest

The author declares no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Bellman, R. A problem in the sequential design of experiments. Indian J. Stat. 1956, 16, 221–229. [Google Scholar]
  2. Gittins, J.C.; Jones, D.M. A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (European Meeting of Statisticians, Budapest, 1972); Gani, J., Sarkadi, K., Vincze, I., Eds.; North-Holland: Amsterdam, The Netherlands, 1974; pp. 241–266. [Google Scholar]
  3. Gittins, J.C. Bandit processes and dynamic allocation indices. J. Roy. Stat. Soc. Ser. B 1979, 41, 148–177. [Google Scholar] [CrossRef] [Green Version]
  4. Gittins, J.C. Multi-Armed Bandit Allocation Indices; Wiley: Chichester, UK, 1989. [Google Scholar]
  5. Whittle, P. Multi-armed bandits and the Gittins index. J. Roy. Stat. Soc. Ser. B 1980, 42, 143–149. [Google Scholar] [CrossRef]
  6. Weber, R. On the Gittins index for multiarmed bandits. Ann. Appl. Probab. 1992, 2, 1024–1033. [Google Scholar] [CrossRef]
  7. Bertsimas, D.; Niño-Mora, J. Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems. Math. Oper. Res. 1996, 21, 257–306. [Google Scholar] [CrossRef]
  8. Whittle, P. Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 1988, 25, 287–298. [Google Scholar] [CrossRef]
  9. Veatch, M.H.; Wein, L.M. Scheduling a multiclass make-to-stock queue: Index policies and hedging points. Oper. Res. 1996, 44, 634–647. [Google Scholar] [CrossRef] [Green Version]
  10. Niño-Mora, J. Marginal productivity index policies for scheduling a multiclass delay-/loss-sensitive queue. Queueing Syst. 2006, 54, 281–312. [Google Scholar] [CrossRef] [Green Version]
  11. Niño-Mora, J. Marginal productivity index policies for admission control and routing to parallel multi-server loss queues with reneging. In Proceedings of the International Conference on Network Control and Optimization (NET-COOP 2007), Avignon, France, 5–7 June 2007; Chahed, T., Tuffin, B., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2007; Volume 4465, pp. 138–149. [Google Scholar]
  12. Kumar, U.D.; Saranga, H. Optimal selection of obsolescence mitigation strategies using a restless bandit model. Eur. J. Oper. Res. 2010, 200, 170–180. [Google Scholar] [CrossRef]
  13. Washburn, R.B.; Schneider, M.K. Optimal policies for a class of restless multiarmed bandit scheduling problems with applications to sensor management. J. Adv. Inform. Fusion 2008, 3, 3–13. [Google Scholar]
  14. Liu, K.; Zhao, Q. Indexability of restless bandit problems and optimality of Whittle index for dynamic multichannel access. IEEE Trans. Inform. Theory 2010, 56, 5547–5567. [Google Scholar] [CrossRef]
  15. Borkar, V.S.; Kasbekar, G.S.; Pattathil, S.; Shetty, P.Y. Opportunistic scheduling as restless bandits. IEEE Trans. Control Netw. Syst. 2018, 5, 1952–1961. [Google Scholar] [CrossRef] [Green Version]
  16. Yang, F.; Luo, X. A restless MAB-based index policy for UL pilot allocation in massive MIMO over Gauss–Markov fading channels. IEEE Trans. Veh. Technol. 2020, 69, 3034–3047. [Google Scholar] [CrossRef]
  17. Abbou, A.; Makis, V. Group maintenance: A restless bandits approach. Informs J. Comput. 2019, 31, 719–731. [Google Scholar] [CrossRef]
  18. La Scala, B.F.; Moran, B. Optimal target tracking with restless bandits. Digit. Signal Process. 2006, 16, 479–487. [Google Scholar] [CrossRef]
  19. Dance, C.R.; Silander, T. Optimal policies for observing time series and related restless bandit problems. J. Mach. Learn. Res. 2019, 20, 35. [Google Scholar]
  20. Niño-Mora, J. A faster index algorithm and a computational study for bandits with switching costs. Informs J. Comput. 2008, 20, 255–269. [Google Scholar] [CrossRef]
  21. Niño-Mora, J. Fast two-stage computation of an index policy for multi-armed bandits with setup delays. Mathematics 2021, 9, 52. [Google Scholar] [CrossRef]
  22. Ayer, T.; Zhang, C.; Bonifonte, A.; Spaulding, A.C.; Chhatwal, J. Prioritizing hepatitis C treatment in US prisons. Oper. Res. 2019, 67, 853–873. [Google Scholar] [CrossRef] [Green Version]
  23. Mate, A.; Madaan, L.; Taneja, A.; Madhiwalla, N.; Verma, S.; Singh, G.; Hegde, A.; Varakantham, P.; Tambe, M. Field study in deploying restless multi-armed bandits: Assisting non-profits in improving maternal and child health. arXiv 2019, arXiv:2109.08075. [Google Scholar] [CrossRef]
  24. Fu, J.; Moran, B.; Taylor, P.G. A restless bandit model for resource allocation, competition, and reservation. Oper. Res. 2022, 70, 416–431. [Google Scholar] [CrossRef]
  25. Niño-Mora, J. Restless bandits, partial conservation laws and indexability. Adv. Appl. Probab. 2001, 33, 76–98. [Google Scholar] [CrossRef] [Green Version]
  26. Klimov, G.P. Time-sharing service systems. I. Theory Probab. Appl. 1974, 19, 532–551. [Google Scholar] [CrossRef]
  27. Coffman, E.G., Jr.; Mitrani, I. A characterization of waiting time performance realizable by single-server queues. Oper. Res. 1980, 28, 810–821. [Google Scholar] [CrossRef]
  28. Shanthikumar, J.G.; Yao, D.D. Multiclass queueing systems: Polymatroidal structure and optimal scheduling control. Oper. Res. 1992, 40, S293–S299. [Google Scholar] [CrossRef]
  29. Niño-Mora, J. Dynamic allocation indices for restless projects and queueing admission control: A polyhedral approach. Math. Program. 2002, 93, 361–413. [Google Scholar] [CrossRef]
  30. Niño-Mora, J. Restless bandit marginal productivity indices, diminishing returns and optimal control of make-to-order/make-to-stock M/G/1 queues. Math. Oper. Res. 2006, 31, 50–84. [Google Scholar] [CrossRef]
  31. Niño-Mora, J. Dynamic priority allocation via restless bandit marginal productivity indices. TOP 2007, 15, 161–198. [Google Scholar] [CrossRef]
  32. Niño-Mora, J. A verification theorem for threshold-indexability of real-state discounted restless bandits. Math. Oper. Res. 2020, 45, 465–496. [Google Scholar] [CrossRef] [Green Version]
  33. Niño-Mora, J. A fast-pivoting algorithm for Whittle’s restless bandit index. Mathematics 2020, 8, 2226. [Google Scholar] [CrossRef]
  34. Weber, R. Comments on: Dynamic priority allocation via restless bandit marginal productivity indices. TOP 2007, 15, 211–216. [Google Scholar] [CrossRef]
  35. Niño-Mora, J. An index policy for multiarmed multimode restless bandits. In Proceedings of the 3rd International Conference on Performance Evaluation Methodologies and Tools (ValueTools’08), Athens, Greece, 20–24 October 2008; ICST: Brussels, Belgium, 2008. [Google Scholar] [CrossRef] [Green Version]
  36. Zayas-Cabán, G.; Jasin, S.; Wang, G. An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Adv. Appl. Probab. 2019, 51, 745–772. [Google Scholar] [CrossRef]
  37. Killian, J.A.; Biswas, A.; Shah, S.; Tambe, M. Q-learning Lagrange policies for multi-action restless bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; ACM: New York, NY, USA, 2021; pp. 871–881. [Google Scholar]
  38. Serfozo, R.F. Monotone optimal policies for Markov decision processes. In Stochastic Systems: Modeling, Identification and Optimization, II; Wets, R.J.B., Ed.; Mathematical Programming Studies; Springer: Berlin/Heidelberg, Germany, 1976; Volume 6, pp. 202–215. [Google Scholar]
  39. Heyman, D.P.; Sobel, M.J. Stochastic Models in Operations Research. Vol. II: Stochastic Optimization; McGraw-Hill: New York, NY, USA, 1984. [Google Scholar]
  40. Crabill, T.B. Optimal control of a service facility with variable exponential service times and constant arrival rate. Manag. Sci. 1972, 18, 560–566. [Google Scholar] [CrossRef]
  41. Sabeti, H. Optimal selection of service rates in queueing with different cost. J. Oper. Res. Soc. Jpn. 1973, 16, 15–35. [Google Scholar]
  42. Ata, B.; Shneorson, S. Dynamic control of an M/M/1 service system with adjustable arrival and service rates. Manag. Sci. 2006, 52, 1778–1791. [Google Scholar] [CrossRef] [Green Version]
  43. Mayorga, M.E.; Ahn, H.S.; Shanthikumar, J.G. Optimal control of a make-to-stock system with adjustable service rate. Probab. Eng. Inform. Sci. 2006, 20, 609–634. [Google Scholar] [CrossRef] [Green Version]
  44. Ye, Y.Y. The simplex and policy-iteration methods are strongly polynomial for the Markov Decision Problem with a fixed discount rate. Math. Oper. Res. 2011, 36, 593–603. [Google Scholar] [CrossRef] [Green Version]
  45. Scherrer, B. Improved and generalized upper bounds on the complexity of policy iteration. Math. Oper. Res. 2016, 41, 758–774. [Google Scholar] [CrossRef] [Green Version]
  46. Hollanders, R.; Gerencser, B.; Delvenne, J.C.; Jungers, R.M. Improved bound on the worst case complexity of Policy Iteration. Oper. Res. Lett. 2016, 44, 267–272. [Google Scholar] [CrossRef] [Green Version]
  47. Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; Wiley: New York, NY, USA, 1994. [Google Scholar]
  48. Altman, E. Constrained Markov Decision Processes; Chapman & Hall/CRC: Boca Raton, FL, USA, 1999. [Google Scholar]
  49. Bertsekas, D.P. Dynamic Programming and Optimal Control, 4th ed.; Athena Scientific: Belmont, MA, USA, 2012; Volume 2. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Niño-Mora, J. Multi-Gear Bandits, Partial Conservation Laws, and Indexability. Mathematics 2022, 10, 2497. https://doi.org/10.3390/math10142497

AMA Style

Niño-Mora J. Multi-Gear Bandits, Partial Conservation Laws, and Indexability. Mathematics. 2022; 10(14):2497. https://doi.org/10.3390/math10142497

Chicago/Turabian Style

Niño-Mora, José. 2022. "Multi-Gear Bandits, Partial Conservation Laws, and Indexability" Mathematics 10, no. 14: 2497. https://doi.org/10.3390/math10142497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop