Next Article in Journal
Mixed Mechanisms for Auctioning Ranked Items
Next Article in Special Issue
Positive Solutions of the Fractional SDEs with Non-Lipschitz Diffusion Coefficient
Previous Article in Journal
Adaptive Bernstein Copulas and Risk Management
Previous Article in Special Issue
The Consistency of the CUSUM-Type Estimator of the Change-Point and Its Application
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Fast-Pivoting Algorithm for Whittle’s Restless Bandit Index

Department of Statistics, Carlos III University of Madrid, 28903 Getafe, Spain
Mathematics 2020, 8(12), 2226; https://doi.org/10.3390/math8122226
Submission received: 20 November 2020 / Revised: 8 December 2020 / Accepted: 10 December 2020 / Published: 15 December 2020
(This article belongs to the Special Issue Applied Probability)

Abstract

:
The Whittle index for restless bandits (two-action semi-Markov decision processes) provides an intuitively appealing optimal policy for controlling a single generic project that can be active (engaged) or passive (rested) at each decision epoch, and which can change state while passive. It further provides a practical heuristic priority-index policy for the computationally intractable multi-armed restless bandit problem, which has been widely applied over the last three decades in multifarious settings, yet mostly restricted to project models with a one-dimensional state. This is due in part to the difficulty of establishing indexability (existence of the index) and of computing the index for projects with large state spaces. This paper draws on the author’s prior results on sufficient indexability conditions and an adaptive-greedy algorithmic scheme for restless bandits to obtain a new fast-pivoting algorithm that computes the n Whittle index values of an n-state restless bandit by performing, after an initialization stage, n steps that entail ( 2 / 3 ) n 3 + O ( n 2 ) arithmetic operations. This algorithm also draws on the parametric simplex method, and is based on elucidating the pattern of parametric simplex tableaux, which allows to exploit special structure to substantially simplify and reduce the complexity of simplex pivoting steps. A numerical study demonstrates substantial runtime speed-ups versus alternative algorithms.

1. Introduction

We consider a general two-action (1: engaged/active; 0: rested/passive) semi-Markov decision process (SMDP) restless bandit model (see, for example, ([1], Ch. 11) and [2]) of a dynamic and stochastic project, whose state X ( t ) moves over continuous time t [ 0 , ) across the state space N , which is assumed finite consisting of n | N | states. At each of an increasing sequence of decision epochs { t k } k = 0 starting at t 0 = 0 and with t k as k , the embedded state X k X ( t k ) is observed and then an action A k A ( t k ) { 0 , 1 } is chosen, which remains fixed over the subsequent stage [ t k , t k + 1 ) , so A ( t ) = A ( t k ) for t [ t k , t k + 1 ) . Note that the project state X ( t ) can change within such stages, and that processes X ( t ) and A ( t ) are piecewise constant and right-continuous.
When the project is in state X ( t ) = i under action A ( t ) = a , it accrues rewards and consumes a generic resource at rates R i a and Q i a per unit time, respectively, which are exponentially discounted with rate α > 0 . Resource consumption rates are assumed to satisfy the natural requirement that 0 < Q i 1 Q i 0 0 , so when the project is active, it consumes resources at a positive rate, which is not lower than when it is passive.
To select actions, a policy π is adopted from the class Π of history-dependent randomized policies, which base the choice of action at each decision epoch t k on the history H k { ( X k , A k ) : 0 k < k } { X k } of embedded states and actions.
Suppose that the amount of resource consumed by the project must be paid for at the unit price λ , so, writing as E i π [ · ] the expectation starting from X ( 0 ) = i under π ,
V i π ( λ ) E i π 0 R X ( t ) A ( t ) λ Q X ( t ) A ( t ) e α t d t
is the (expected total discounted) project value starting from i under π , and V i * ( λ ) sup π Π V i π ( λ ) is the optimal project value starting from state i. Consider now the λ-price problem
( P λ ) : find π λ * Π such   that V i π λ * ( λ ) = V i * ( λ ) for   every   state i N ,
which entails finding a policy that maximizes the project value for every initial state. We will call P λ -optimal a policy π λ * solving the λ -price problem ( P λ ) in (1).
Under mild assumptions, it is well-known from the theory of finite-state and -action SMDPs (see ([1], Ch. 11)) that, to solve problem (1), it suffices to consider the class Π S D of stationary deterministic policies, which prescribe a fixed action at each decision epoch based on the current project state. For any given resource price λ , a P λ -optimal policy in Π S D can be computed by classic algorithms, such as value or policy iteration.
Instead, we will treat the resource price λ as a parameter, and consider a solution approach for the parametric collection P of all λ -price problems ( P λ ) as λ R . Thus, let us say that the project, or more precisely, the parametric problem collection P , is indexable if, for every project state i, there exists a finite critical price λ i * such that, for any problem ( P λ ) and at a decision epoch in which the project is in state i: (i) it is optimal to engage the project if, and only if, λ i * λ ; and (ii) it is optimal to rest the project if, and only if, λ i * λ . Hence, both actions will be optimal in state i if, and only if, λ i * = λ .
We will refer to the mapping i λ i * as the project’s Whittle index, since it was Whittle who introduced such a concept in [2], in a Markovian setting with resource consumption Q i a a . As for the extension to general resource consumption Q i a , it was introduced in [3]. In the case, Q i a a , the Whittle index extends the Gittins index for classic (non-restless) bandit projects, which do not change state while passive. Thus, strictly speaking, the index considered herein for general Q i a is a generalized Whittle index, which reduces to a generalized Gittins index in the non-restless case.
Considering a semi-Markov instead of a purely Markovian setting significantly expands the modeling power of the resulting restless bandits, as in some applications, the time during which a project remains engaged before the next decision can be made may follow a general distribution. As an example (see [4]), imagine that a “project” represents a queue of jobs with Poisson arrivals and generally distributed service times, where serving a job (active action) is non-preemptive, i.e., it cannot be interrupted once started. In [5], semi-Markov restless bandits are used to model dynamic job assignment for energy-efficient server farms. Semi-Markov restless bandits can also be used to model classic (non-restless) bandits where changing from engaging one project to another entails project-dependent switching times, as shown in [6].
Besides its intrinsic interest for solving the aforementioned single-project parametric problem collection P , Whittle proposed in [2] to use the index λ i * as the basis of a widely popular heuristic for the multi-armed restless bandit problem (MARBP), in which M out of N > M restless bandit projects must be selected to be engaged at each time to maximize the value (under a discounted or long-run average criterion) earned from the N projects over an infinite horizon. For a sample of recent applications, see, for example, [5,7,8,9,10,11,12,13,14,15,16,17,18]. While the MARBP is computationally intractable (PSPACE-hard; see [19]), the Whittle index policy is an intuitively appealing heuristic where, at each time, M projects with the highest current indices are engaged, so the Whittle index plays the role of a priority index for a project to be engaged. This policy is convenient for practical implementation, as it avoids the curse of dimensionality, since each project has its own Whittle index. Furthermore, an extensive body of numerical evidence accumulated over the last three decades has shown that the policy is often nearly optimal. Though its exact analysis remains elusive, Whittle’s conjecture that his proposed policy enjoys a form of asymptotic optimality has been established under certain conditions. See [20,21,22,23].
Yet, unlike the Gittins index, which is well-defined for any classic bandit, the Whittle index exists only for a limited type of restless bandits, which are called indexable. Typically, researchers use ad hoc analyses to prove indexability and calculate the Whittle index in particularly models. In contrast, the author has introduced and developed in [3,24] a methodology to establish indexability and compute the Whittle index for general finite-state restless bandits, extended to the semi-Markov denumerable-state case in [4] and to the continuous-state case in [25]. The effectiveness of such an approach, based on verification of so-called PCL-indexability conditions—as they are grounded on satisfaction by project performance metrics of partial conservation laws (PCLs)—has been demonstrated in diverse models. See, for example, [14,26,27,28,29,30,31,32,33,34,35].
In the case of finite-state restless bandits, [3,24] introduced the concept of satisfaction of PCLs by project performance metrics, along with a related one-pass adaptive-greedy index algorithm, which calculates the n index values of an n-state PCL-indexable project in n steps. This has its early roots in Klimov’s algorithm in [36] for calculating the optimal priority indices for scheduling a multiclass queue with Bernoulli feedback, which was extended in [37] to a framework of stochastic scheduling systems satisfying so-called generalized conservation laws, such as the classic (non-restless) multi-armed bandit problem and branching bandits. See also [38,39]. Yet, the aforementioned work has not addressed the efficient computational implementation of such an algorithm, which is necessary for its actual deployment and widespread application.
This paper develops an efficient computational scheme for calculating the Whittle index of a general finite-state PCL-indexable restless bandit, by extending the approach in [40] from discrete-time classic bandits to semi-Markov restless bandits. The main contribution is a new fast-pivoting block implementation of the adaptive-greedy Whittle index algorithm in [3,24] that performs—after an initialization stage that entails solving a block n × n linear equation system— ( 2 / 3 ) n 3 + O ( n 2 ) arithmetic operations. The complexity of the resulting algorithm can be further reduced in particular models, by exploiting the special structure of the underlying state space and transition matrices. However, it appears unlikely that such an operation count can be further reduced for general restless bandits, since as shown in [3,24] computing the Whittle index, even if the ordering of the project states was known in advance, entails the solution of an n × n linear equation system, which would be solved in ( 2 / 3 ) n 3 + O ( n 2 ) operations by Gaussian elimination.
Note that this is the fastest operation count for a general Whittle index algorithm presented to date. In contrast, the Whittle index algorithm in [41], which applies only to the average criterion, has a complexity of O ( n 4 2 n ) operations, which reduces to O ( n 5 ) for projects with a one-dimensional state that are known beforehand to be both indexable and solved optimally by threshold policies.

Structure of the Paper

The rest of the paper proceeds as follows. Section 2 presents a review of the related literature. Section 3 reviews previous results on Whittle indexation for finite-state restless bandits via PCL-indexability. Section 4 lays the groundwork for an efficient implementation of the adaptive-greedy index algorithm, drawing on dynamic programming (DP) and linear programming (LP) methods. Section 5 applies such results to develop a fast-pivoting computational implementation of the adaptive-greedy index algorithm. Section 6 outlines how to extend the previous results to the average criterion. Section 7 presents the results of a numerical study testing the runtimes of several index algorithms and demonstrating that the proposed fast-pivoting algorithm has significantly lower runtimes than alternative algorithms. Section 8 discusses the results presented in the paper. Section 9 concludes the paper.

2. Review of Related Literature

In this section, we review related literature, focusing on relatively recent work. We refer the reader to the recent monograph [42] on multi-armed bandits, which discusses both the MABP and the MARBP and their widespread applications.
When projects are classic (non-restless), the Whittle index reduces to the Gittins index. The efficient computation of the Gittins index has been extensively investigated in the literature. See, for example, [43,44,45,46,47,48]. While some algorithms proposed in the aforementioned papers perform O ( n 3 ) arithmetic operations for a general n-state bandit, [40] presents a fast-pivoting implementation of the adaptive-greedy Gittins index algorithm in [37] that performs ( 2 / 3 ) n 3 + O ( n 2 ) arithmetic operations, thus achieving lower complexity than alternative algorithms and matching that of Gaussian elimination for solving an n × n linear equation system. It is unlikely that such a complexity count can be improved, since it is shown in [37] that computing the Gittins index reduces precisely to solving an n × n linear equation system whose solution is subject to certain inequalities. The algorithm in [40] is based on an elucidation of the pattern of parametric simplex tableaux and exploitation of special structure to lower the computational effort of simplex pivoting steps with respect to standard pivoting.
In contrast, efficient computation of the Whittle index for restless bandits has received relatively scant research attention. Three approaches can be distinguished in the literature: deriving the Whittle index in closed form, iterative index approximation, and exact numerical computation. The first approach is to derive the Whittle index in closed form, as, for example,  in [5,13,15,16,17,18]. This is to be preferred whenever possible, as the resulting analytical expressions for the Whittle index, besides facilitating its numerical evaluation, may provide valuable insight on the index dependence on model parameters. However, obtaining closed-form Whittle index formulae is only possible in relative simple models, typically with a one-dimensional state.
In more complex models in which the Whittle index cannot be evaluated in closed form, the most widespread approach, which has its roots in the calibration method for the Gittins index in [49], is to apply an iterative procedure for approximately computing the index. This is done, for example, in [7,8,9,10,11,12]. Besides the drawback that the resulting index is only an approximation to the true Whittle index, this approach is typically computationally expensive.
As for the third approach, efficient exact numerical computation of the Whittle index, this has received the least research attention to date. In [3,24], the author introduced an adaptive-greedy algorithm for computing the Whittle index in general finite-state restless bandits, provided that they satisfy the PCL-indexability conditions also introduced in such work. The algorithmic computes the n Whittle index values of an n-state bandit in n steps, proceeding in a greedy fashion at each step. However, as given in [3,24], such a method provides only an algorithmic scheme, as it is not specified how certain metrics that arise in its description are to be computed in practice. The effectiveness of such an approach is demonstrated in [3] in the setting of a broad birth–death restless bandit model, both under the discounted and long-run average criteria, motivated by queueing admission control and routing problems, where a specific implementation of the adaptive-greedy algorithm for such a case is obtained that computes the first n Whittle indices in O ( n ) operations, under mild conditions on model parameters. An alternative approach to long-run average birth–death restless bandits is developed in [50]. Yet, to prove indexability, Proposition 2 in that paper assumes both that threshold policies are optimal (which in the birth–death model in [3] is obtained as a byproduct of the PCL-indexability conditions) and that a certain function of steady-state probabilities is strictly increasing, which is nontrivial to verify. Further, the Reference [50] does not give an index-computing algorithm, but an expression for the Whittle index in terms of steady-state metrics, which also appears in [3].
A Whittle index computing algorithm for general continuous-time finite-state restless bandits has been recently proposed in [41], focusing on the average criterion, as it is stated there that the approach does not apply to the discounted criterion. As for the arithmetic-operation complexity of the Whittle index algorithm in [41], which also checks for indexability, Remark 3.5 states that it performs O ( n 4 2 n ) operations. Even in the case that indexability is known to hold and that threshold policies are optimal, it is stated in Remark 4.1 of that paper that the complexity of that index algorithm reduces to O ( n 5 ) operations. This is to be contrasted with the Whittle index algorithm presented herein, which has a cubic operation complexity in the number n of bandit states.

3. Review of Finite-State Restless Bandit Whittle Indexation via PCL-Indexability

This section reviews key results of the author’s approach to restless bandit indexation, as it applies to a single finite-state semi-Markov restless bandit project, which we will simply refer to as a “project” in the sequel.

3.1. SMDP Restless Bandits and Their Discrete-Stage Reformulation

Consider a SMDP restless bandit project, as outlined in Section 1. We next describe its standard discrete-stage reformulation (see, for example, ([1], Ch. 11)). If, at decision epoch t k the project lies in state X k = i and action A k = a is chosen, the joint distribution of the length t k + 1 t k of the following ( i , a ) -stage and embedded state X k + 1 is characterized by the transition distribution function
H i j a ( t ) P i a t k + 1 t k t , X k + 1 = j , t 0 ,
having Laplace–Stieltjes transform (LST)
ψ i j a ( α ) 0 e α t d H i j a ( t ) = E i a 1 { X k + 1 = j } e α ( t k + 1 t k ) , α > 0 ,
where P i a and E i a denote probability and expectation conditional on starting from state i with action a ( X k = i , A k = a ).
From H i j a ( t ) , we have that the distribution of the length of an ( i , a ) -stage is
H i a ( t ) P i a { t k + 1 t k t } = j N H i j a ( t ) ,
having LST
ψ i a ( α ) E i a [ e α ( t k + 1 t k ) ] = j N ψ i j a ( α ) ,
and mean
m i a E i a [ t k + 1 t k X k = i , A k = a ] = 0 t d H i a ( t ) .
The one-stage transition probabilities for the embedded process are
p i j a P i a { X k + 1 = j X k = i , A k = a } = lim t H i j a ( t ) = lim α 0 ψ i j a ( α ) .
Recall that the process X ( t ) may change state between successive decision epochs. Its dynamics within an ( i , a ) -stage are characterized by
p ˜ i j a ( s ) P i a { X ( t k + s ) = j t k + 1 t k > s } , s 0 ,
the conditional probability that, s time units after the start of an ( i , a ) -stage, and given that this is still ongoing, the project occupies state j. We can thus formulate the expected discounted amount of resource consumed and reward obtained in an ( i , a ) -stage, respectively, as 
q i a E i a t k t k + 1 Q X ( t ) A k e α ( t t k ) d t = j N Q j a 0 p ˜ i j a ( s ) H ¯ i a ( s ) e α s d s
and
r i a E i a t k t k + 1 R X ( t ) A k e α ( t t k ) d t = j N R j a 0 p ˜ i j a ( s ) H ¯ i a ( s ) e α s d s ,
where H ¯ i a ( s ) 1 H i a ( s ) is the tail distribution of the length of an ( i , a ) -stage.
Recall that we denote by n | N | the number of project states.

3.2. Indexability, Whittle Index, and the Achievable Resource–Reward Performance Region

We start by considering the discounted criterion with rate α > 0 , deferring discussion of the average criterion to Section 6. We consider the following project performance metrics to evaluate a policy π Π starting from a state i: the reward metric
F i π E i π 0 R X ( t ) A ( t ) e α t d t = E i π k = 0 r X k A k e α t k ,
measuring the expected discounted value of rewards obtained, and the resource metric
G i π E i π 0 Q X ( t ) A ( t ) e α t d t = E i π k = 0 q X k A k e α t k ,
giving the expected discounted quantity of resource consumed. Note that the right-hand side expectations in (5) and (6) are formulated in terms of the aforementioned discrete-stage reformulation.
We will also refer to the metrics obtained by drawing the initial state from a probability mass vector p = ( p i ) i N , given by
F p π i N p i F i π and G p π i N p i G i π .
To calibrate the marginal value of engaging the project in each state, we introduce a parameter λ R representing the resource (unit) price, and define the (net) project value metric V p π ( λ ) F p π λ G p π , so the optimal project value is V p * ( λ ) sup π Π V p π ( λ ) . We also write V i π ( λ ) and V i * ( λ ) as in Section 1. Consider the parametric family P of λ-price problems ( P λ ) defined in (1) for λ R . Thus, for given λ , problem (1) is to find an admissible policy that maximizes the project value.
The dynamic programming (DP) optimality equations for λ -price problem ( P λ ) are
V i * ( λ ) = max a { 0 , 1 } r i a λ q i a + j N ψ i j a V j * ( λ ) , i N ,
where we write ψ i j a = ψ i j a ( α ) . Classic results of the SMDP theory ensure existence of a P λ -optimal policy π λ * solving (1) that is stationary deterministic ( π λ * Π S D ).
We will also consider the optimal project value starting from state i with initial action a, given by
V i a , * ( λ ) r i a λ q i a + j N ψ i j a V j * ( λ ) ,
and say that action a is P λ -optimal in state i if V i a , * ( λ ) V i 1 a , * ( λ ) . It will be convenient to reformulate such a definition in terms of the sign of Δ a = 0 a = 1 V i a , * ( λ ) V i 1 , * ( λ ) V i 0 , * ( λ ) , the marginal value of engaging the project in state i. Thus, action a = 1 is P λ -optimal in i if Δ a = 0 a = 1 V i a , * ( λ ) 0 ; a = 0 is P λ -optimal in i if Δ a = 0 a = 1 V i a , * ( λ ) 0 ; and, hence, both actions are P λ -optimal in i if, and only if,
Δ a = 0 a = 1 V i a , * ( λ ) = 0 .
Since actions are binary, it is convenient to represent a stationary deterministic policy by its active set S N , which is the subset of states where, at a decision epoch, the policy chooses the active action. We will refer to the S-active policy and write F i S , G i S , and  V i S ( λ ) .
Thus, we can reformulate λ -price problem (1) as the discrete optimization problem
( P λ ) : find S * N such   that V i S * ( λ ) = V i * ( λ ) for   every i N .
For a given resource price λ , the  P λ -optimal active and passive sets are given by
S * , 1 ( λ ) i N : Δ a = 0 a = 1 V i a , * ( λ ) 0 and S λ * , 0 i N : Δ a = 0 a = 1 V i a , * ( λ ) 0 .

3.3. Indexability

We will address the parametric problem collection P in (1) through the concept of indexability, extended in [2] from classic to restless bandits with resource consumption Q i a a , and further extended in [3] to general Q i a .
Under indexability, the  P λ -optimal active and passive sets S * , 1 ( λ ) and S * , 0 ( λ ) are characterized by an index attached to project states. Note that the definition below refers to the sign function sgn : R { 1 , 0 , 1 } , and follows the formulation of indexability in ([25], Definition 1).
Definition 1
(Indexability and Whittle index). The project is indexable if there exists a mapping i λ i * for i N , such that
sgn Δ a = 0 a = 1 V i a , * ( λ ) = sgn ( λ i * λ ) , for   every   state i N and   price λ R .
We call λ i * the project’s (generalized) Whittle index (Gittins index if the project is non-restless ) .
Thus, under indexability, the  P λ -optimal active and passive sets S * , 1 ( λ ) and S * , 0 ( λ ) are characterized as
S * , 1 ( λ ) i N : λ i * λ and S λ * , 0 i N : λ i * λ .
As for the economic interpretation of the Whittle index, it is shown in [3,4,27] that λ i * measures the marginal productivity of the resource at state i.

3.4. PCL-Indexability and Adaptive-Greedy Algorithm

In applications of Whittle indexation to given restless bandit models, researchers are concerned with establishing analytically their indexability and efficiently computing the index. For that purpose, the author introduced the PCL-indexability conditions in [3,4,24] because they are grounded on the satisfaction of partial conservation laws (PCLs). See [38] for a survey on conservation laws in stochastic scheduling and multi-class queueing systems.
Such conditions can be used to establish indexability relative to a nonempty family F 2 N of active sets that one should posit a priori (based on insight on the model’s structure).
Definition 2
( F -indexability). We say that the project, or more precisely, the parametric collection P of λ-price problems ( P λ ) in (11), is F -indexable if: (i) it is indexable; and (ii) for every price λ, there exists an optimal active set S * ( λ ) F , such that the S * ( λ ) -active policy is P λ -optimal.
Thus, for an F -indexable project, problem (11) can be reduced to
( P λ ) : find S * F such   that V i S * ( λ ) = V i * ( λ ) for   every i N ,
since, in such a case F -policies (those with active sets S F ) are optimal for (11).
Algorithmic considerations lead us to require that active-set family F satisfies the connectedness conditions stated next. We will refer to the inner boundary of S with respect to F , given by
F in S { i S : S \ { i } F } ,
and to the outer boundary of S with respect to F , given by
F out S { j S c : S { j } F } .
Assumption 1.
(i)
For every nonempty S F , F in S is nonempty.
(ii)
For every S F \ { N } , F out S is nonempty.
Note that from the requirement that F be nonempty, along with Assumption 1, it follows that , N F .
To formulate the following result, we need to consider certain marginal metrics. For an action a { 0 , 1 } and an active set S N , denote by a , S the policy taking initially action a, and thereafter following the S-active policy. For a state i and an active set S, define the marginal resource (consumption) metric by
g i S G i 1 , S G i 0 , S ,
that is, as the marginal increase in resource consumed resulting from taking first the active rather than the passive action at state i, provided that the S-active policy is followed thereafter.
Define also the marginal reward metric by
f i S F i 1 , S F i 0 , S ,
that is, as the marginal increase in the value of rewards gained. Finally, for  g i S 0 , define the marginal productivity rate metric by
λ i S f i S g i S .
The following definition refers to the adaptive-greedy index algorithm, which is given in Algorithms 1 and 2 in its top-down and bottom-up versions, respectively. In the former, index values are computed from largest to smallest, whereas in the latter they are computed in the opposite order.  
Algorithm 1: Adaptive-greedy algorithm: top-down version AG F T D .
Output:  { j k , λ j k * } k = 1 n
S0 := ∅
for   k : = 1 tondo
  pick  j k arg max { λ j S k 1 : j F out S k 1 }
   λ j k * : = λ j k S k 1 ;   S k : = S k 1 { j k }
end { for }
Algorithm 2: Adaptive-greedy algorithm: bottom-up version AG F BU .
Output:  { i k , λ i k * } k = 1 n
S 0 : = N
for k : = 1 tondo
  pick  i k arg min { λ i S k 1 : i F in S k 1 }
   λ i k * : = λ i k S k 1 ;   S k : = S k 1 \ { i k }
end { for }
Definition 3
(PCL ( F ) -indexability). We say that the project isPCL ( F ) -indexable if:
(i) 
for every active set S F , g i S > 0 for each i N ; and
(ii) 
algorithm AG F TD computes a monotone non-increasing index sequence ( λ j 1 * λ j n * ) ; or algorithm AG F BU computes a monotone non-decreasing index sequence ( λ i 1 * λ i n * ) .
Theorem 1.
Suppose that the project is PCL ( F ) -indexable. Then, it is F -indexable and the index λ i * computed by either adaptive-greedy index algorithm is its Whittle index.
Theorem 1 was introduced and proven, in increasingly general settings, in ([24], Corollary 2), (Ref. [3], Theorem 6.3) and ([4], Theorem 4.1). The author’s work reviewed in [27] demonstrates the applicability of such a result to a number of relevant restless bandit models.
Regarding the interpretation of positive marginal resource condition (i) in Definition 3, it is shown in ([3], Proposition 6.2) and in ([4], Lemma 4.3) that it can be reformulated in terms of the resource metric as follows: for p > 0 ,
G p S \ { i } < G p S < G p S { j } , S F , i S , j S c .
Thus, condition (i) represents a strong form of monotone increasingness of resource metric G p S in its active set S relative to inclusion. Additionally, recalling the definition of G p π in (7) in terms of metrics G i π , (19) can, in turn, be reformulated in terms of the latter metrics as follows: for S F , i S and j S c ,
G S \ { i } G S G S { j } , S F , i S , j S c
where G π = ( G i 0 π ) i 0 N and “⪇” means “less than or equal to componentwise, but not equal.”
Along with condition (i), condition (ii) in Definition 3 is interpreted in ([3], Section 6.4) in terms of satisfaction of the law of diminishing marginal returns (to resource usage) in economics, relative to F -policies. Such an interpretation is based on the result in ([3], Proposition 6.4(a)) that the marginal productivity rates λ i S in (18) for S F can be recast in terms of resource and reward metrics as
λ i S = F p S F p S \ { i } G p S G p S \ { i } , i S F p S { i } F p S G p S { i } G p S , i S c .
Such a reformulation allows us, in turn, to reformulate the adaptive-greedy algorithms above in a geometrically intuitive form in the resource–reward ( γ , ϕ ) plane, as shown in Algorithms 3 and 4. We thus see that the top-down algorithm AG F TD aims to traverse the upper boundary ¯ R p of the achievable resource–reward performance region R p from left to right using only active sets in F , whereas the bottom-up version AG F BU seeks to traverse such an upper boundary from right to left, proceeding in a greedy fashion at each step. In such a setting, condition (ii) in Definition 3 means that the function obtained by linear interpolation on the points ( G p S , F p S ) produced by either algorithm is concave. The remarkable result in Theorem 1 is that this, along with condition (i), suffices to ensure that such a function characterizes the upper boundary ¯ R p and the Whittle index.
In [4], such results are extended to restless projects on the denumerable state space of nonnegative integers, for which the resource metric is increasing along the nested family of active sets induced by the corresponding ordering. Further, in Section 3 of that paper the F -indexability of such projects is characterized in terms of satisfaction of the law of diminishing marginal returns relative to F -policies. In [25], such results are further extended to continuous-state projects.
Algorithm 3: Reformulated adaptive-greedy index algorithm: top-down version AG F TD .
Output:  { j k , λ j k * } k = 1 n
S0 := ∅
for   k : = 1 tondo
  pick  j k arg max F p S k 1 { j } F p S k 1 G p S k 1 { j } G p S k 1 : j F out S k 1
   λ j k * : = λ j k S k 1 ;   S k : = S k 1 { j k }
end { for }
Algorithm 4: Reformulated adaptive-greedy index algorithm: bottom-up version AG F BU .
Output:  { i k , λ i k * } k = 1 n
S 0 : = N
for k : = 1 tondo
  pick  i k arg min F p S k 1 F p S k 1 \ { i } G p S k 1 G p S k 1 \ { i } : i F in S k 1
   λ i k * : = λ i k S k 1 ;   S k : = S k 1 \ { i k }
end { for }

4. Laying the Groundwork for an Efficient Implementation of the Adaptive-Greedy Algorithm

This section lays the groundwork for developing an efficient implementation of the adaptive-greedy index algorithm. It draws on LP methods, based on formulating the λ -price problem (1) as a parametric LP problem, and elucidating the structure of its simplex tableaux.

4.1. Optimality Equations and Parametric LP Formulation

While the LP formulation below is well-known in SMDP theory (cf. [51]), for convenience we next outline it, starting from the optimality Equations (8) for (1). The primal LP formulation of such optimality equations is
V p * ( λ ) = min j N p j v j subject to x i a : v i j N ψ i j a v j r i a λ q i a , ( i , a ) N × { 0 , 1 } .
Our analyses will instead use the dual problem,
V p * ( λ ) = max ( j , a ) N × { 0 , 1 } ( r j a λ q j a ) x j a subject to v j : a { 0 , 1 } ( x j a j N ψ i j a x i a ) = p j , j N x j a 0 , ( j , a ) N × { 0 , 1 } .
It will be convenient to deal with the latter in matrix notation, as 
V p * ( λ ) = max ( r 0 λ c 0 ) x 0 + ( r 1 λ c 1 ) x 1 subject to ( I Ψ 0 ) T ( I Ψ 1 ) T x 0 x 1 = p x 0 , x 1 0 ,
where x a = ( x j a ) is a column vector, r a = ( r j a ) and c a = ( q j a ) are row vectors, and  T is the transposition operator.
Dual variables x j a correspond to the project’s discounted state-action occupation measures. For an initial state i, policy π , action a and state j, let
x i j a , π E i π k = 0 1 { X ( t k ) = j , A ( t k ) = a } e α t k
be the expected discounted number of ( j , a ) -stages under π starting from i. If the initial state is drawn from p , x j a corresponds to occupation measure x p j a , π i p i x i j a , π . Note that reward and resource metrics are linear functions of such occupation measures: writing x p a , π = ( x p j a , π ) j N ,
F p π = ( j , a ) { 0 , 1 } × N r j a x p j a , π = r 0 x p 0 , π + r 1 x p 1 , π and G p π = ( j , a ) { 0 , 1 } × N q j a x p j a , π = c 0 x p 0 , π + c 1 x p 1 , π .

4.2. Bases, Basic Feasible Solutions, and Reduced Costs

We next analyze the LP problem (22), starting by elucidating its basic feasible solutions (BFS). Each BFS is obtained from a basis corresponding to an active set S N , and hence we will refer to the S-active BFS. Yet, note that different S’s might yield the same BFS. For given S, we decompose the above vectors and matrices as
x a = x S a x S c a , p = p S p S c , Ψ a = Ψ S S a Ψ S S c a Ψ S c S a Ψ S c S c a , I = I S S 0 S S c 0 S c S I S c S c ,
and introduce the matrices
Ψ S Ψ S S 1 Ψ S S c 1 Ψ S c S 0 Ψ S c S c 0 , Ψ S c Ψ S S 0 Ψ S S c 0 Ψ S c S 1 Ψ S c S c 1 , B S ( I Ψ S ) T , N S ( I Ψ S c ) T , H S ( B S ) 1 , A S H S N S .
Note that Ψ S is the transition-transform matrix for the S-active policy. Additionally, B S is the basis matrix in LP problem (22) for the S-active BFS, which has as basic variables
x S 1 x S c 0 ,
and N S is the non-basic matrix of LP problem (22) corresponding to non-basic variables
x S 0 x S c 1 .
We can thus rearrange the constraints in (22), decomposing them as
B S x S 1 x S c 0 + N S x S 0 x S c 1 = p .
We next evaluate performance metrics under the S-active policy. The notation x p j a , S below refers to occupation measure x p j a , π under such a policy. Further, F S = ( F j S ) j N , G S = ( G j S ) j N , f S = ( f j S ) j N and g S = ( g j S ) j N are row vectors.
Lemma 1.
(a)
x p S 0 , S x p S c 1 , S = 0 and x p S 1 , S x p S c 0 , S = H S p .
(b)
G S = c S 1 c S c 0 H S .
(c)
F S = r S 1 r S c 0 H S .
(d)
g S S g S c S = c S 1 c S c 0 A S c S 0 c S c 1 .
(e)
f S S f S c S = r S 1 r S c 0 A S r S 0 r S c 1 .
Proof. 
(a) Set to zero non-basic variables: x p S 0 , S = 0 and x p S c 1 , S = 0 . To calculate the values of basic variables, note that
B S x p S 1 x p S c 0 = p , and   hence x p S 1 , S x p S c 0 , S = H S p .
(b) Use part (a) with p = e j (the jth unit coordinate vector) to formulate resource metrics as
G j S = c S 1 c S c 0 x j S 1 , S x j S c 0 , S = c S 1 c S c 0 H S e j G S = c S 1 c S c 0 H S .
(c) Proceed as in (b) to formulate reward metrics as
F j S = r S 1 r S c 0 x j S 1 , S x j S c 0 , S = r S 1 r S c 0 H S e j F S = r S 1 r S c 0 H S .
(d) Represent marginal resource metrics (cf. (16)) as
g S S = G S S c S 0 G S ( Ψ S N 0 ) T and g S c S = c S c 1 + G S ( Ψ S c N 1 ) T G S c S .
Recast now the equalities in (25), using (b), as 
g S S g S c S = G S N S c S 0 c S c 1 = c S 1 c S c 0 H S N S c S 0 c S c 1 = c S 1 c S c 0 A S c S 0 c S c 1 .
(e) This part follows as part (d). □
The following result gives the reduced costs of the LP problem (22) in terms of the marginal resource and reward metrics in (16) and (17). It further uses such a result to represent such an LP problem’s objective in terms of reduced costs.
Lemma 2.
The reduced costs for non-basic variables corresponding to the S-active BFS for LP problem(22)are given by
f S S λ g S S f S c S + λ g S c S ,
and hence, the LP problem’s objective can be formulated as
( j , a ) N × { 0 , 1 } ( r j a λ q j a ) x j a = F p S λ G p S j S ( f j S λ g j S ) x j 0 + j S c ( f j S λ g j S ) x j 1 .
Proof. 
The results follow from the well-known representation of reduced costs in LP theory, as given by Lemma 1 (d,e), along with the well-known representation of the LP problem’s objective in terms of the value of the current BFS and reduced costs. □
The following result, which follows from Lemma 2, represents metrics G p π , F p π and objective F p π λ G p π in terms of their values under the S-active policy. These decomposition identities were first obtained in ([24], Theorem 3) and ([3], Proposition 6.1) via ad hoc arguments.
Lemma 3.
For any policy π Π :
(a)
G p π = G p S j S g j S x p j 0 , π + j S c g j S x p j 1 , π .
(b)
F p π = F p S j S f j S x p j 0 , π + j S c f j S x p j 1 , π .
(c)
F p π λ G p π = F p S λ G p S j S ( f j S λ g j S ) x p j 0 , π + j S c ( f j S λ g j S ) x p j 1 , π .
The following result, first established in ([3], Corollary 6.1), elucidates the relationship between resource and reward metrics and the corresponding marginal metrics.
Lemma 4.
(a)
For j S c , G p S { j } = G p S + g j S x p j 1 , S { j } and F p S { j } = F p S + f j S x p j 1 , S { j } .
(b)
For j S , G p S \ { j } = G p S g j S x p j 1 , S \ { j } and F p S \ { j } = F p S f j S x p j 1 , S \ { j } .
Proof. 
To obtain (a) use π = S { j } in Lemma 3 (a). Additionally, similarly with (b) with π = S \ { j } and Lemma 3 (b). □

5. A Fast-Pivoting Index Algorithm for PCL ( F ) -Indexable Projects

This section develops an efficient implementation of the adaptive-greedy index algorithm above, focusing on its top-down version, for a project that is PCL ( F ) -indexable.
We start by noting that the index of such a project can be computed by deploying in the LP formulation (22) of the λ -price problem the classic parametric-objective simplex algorithm in [52]. In the present setting, the parametric simplex tableau for the S-active BFS is shown in Table 1. This tableau has basic variables x S 1 and x S c 0 in rows and non-basic variables x S 0 and x S c 1 in columns, and further, has two rows of reduced-costs for non-basic variables. The tableau does not include right-hand sides or objectives, as they are not required in this context. The tableau is displayed just before pivoting on a j j S , where j S c . That is, just before moving variable x j 0 out of the basis, and  carrying x j 1 into the basis, which corresponds to changing from the S- to the S { j } -active BFS. After the pivot step, the updated tableau is shown in Table 2.
One can readily use such tableaux to implement the top-down adaptive-greedy algorithm AG F TD in Algorithm 3, by first constructing the initial tableau for the ∅-active policy, and then carrying out n = | N | pivot steps, at each of which the former BFS active set is augmented by a state. Such a direct approach results in an implementation that we call the Conventional-Pivoting Index (CP) algorithm. An immediate counting argument shows that the n pivot steps of that algorithm perform 2 n 3 + O ( n 2 ) arithmetic operations—without considering computation of the initial tableau, which will be addressed in Section 5.1 below.
Note that this algorithm can also be applied to a restless bandit instance to test numerically whether it is indexable. The project will be indexable if, and only if, the successive pivot steps, as the price parameter λ decreases from to in the parametric-objective simplex algorithm of [52], can be performed augmenting the current BFS by adding a state, thus producing a nested active-set family F 0 = { S 0 , S 1 , , S n } with S 0 = S n = N .
We now consider how to improve the efficiency of the CP algorithm, drawing on the observation that the tableau’s rows for basic variables x S 1 are not used to update the reduced costs. Thus, it is enough to update and store the reduced tableaux shown in Table 3. Table 2 demonstrates that a reduced tableau can be updated without using the rows that have been deleted. The resulting simplification of the CP algorithm yields the Reduced-Pivoting (RP) algorithm. By an elementary counting argument, it is readily seen that the RP algorithm carries out the n pivot steps in n 3 + O ( n 2 ) operations, thus improving by a factor of 2 the arithmetic operation complexity of the CP algorithm.
We can exploit the assumption of PCL ( F ) -indexability to reduce even further the operation count, by updating and storing only the minimal tableaux in Table 4. Such a tableau for the S { j } -active BFS is computed from that for the S-active BFS in Table 4, as displayed in Table 5. This results in the fast-pivoting (FP) adaptive-greedy index algorithm FP F in Algorithm 5.
Algorithm 5: The fast-pivoting adaptive-greedy index algorithm FP F .
Output:  { j k , λ j k * } k = 1 n
solve A ( 0 ) I N , N \ { j * } Ψ N , N \ { j * } 0 m ˜ 0 = I N , N \ { j * } Ψ N , N \ { j * } 1 m ˜ 1
g ( 0 ) f ( 0 )   : = c 1 r 1 c 0 r 0 A ( 0 ) ;   S 0 : =
for k : = 1 tondo
   λ j ( k 1 )   : = f j ( k 1 ) / g j ( k 1 ) ,   j F out S k 1
  pick j k arg max λ j ( k 1 ) :   j F out S k 1 ;   λ j k * : = λ j k k 1 ;   S k : = S k 1 { j k }
  if k < n then
    A S k c j k ( k ) : = ( 1 / a j k j k ( k 1 ) ) A S k c j k ( k 1 ) ;   A S k c S k c ( k ) : = A S k c S k c ( k 1 ) A S k c j k ( k ) A j k S k c ( k 1 )
  end { if }
   g S k c ( k ) : = g S k c ( k 1 ) g j k ( k 1 ) A S k c j k ( k ) ;   f S k c ( k ) : = f S k c ( k 1 ) f j k ( k 1 ) A S k c j k ( k )
end { for }
The next result evaluates the operation count of algorithm FP F , showing that it outperforms significantly that of algorithm RP. Note that the complexity of its loop—which performs the n pivot steps—matches that of Gaussian elimination for solving an n × n system of linear equations, and is hence unlikely that such complexity can be improved for general restless bandits.
Proposition 1.
The loop of algorithm FP F entails ( 2 / 3 ) n 3 + O ( n 2 ) arithmetic operations.
Proof. 
The operation count in the loop is dominated by the update of matrix A S k c S k c S k at each step k, taking 2 ( n k ) 2 arithmetic operations. This yields the total operation count as stated. □

5.1. Computing the Initial Tableau

We next address computation of the initial tableau, which corresponds to the ∅-active BFS, in a form that is numerically stable and that applies both to the discounted and the average criterion to be addressed in Section 6. The tableaux for the average criterion arise as limits letting the discount rate α vanish in the discounted tableaux.
Note that (cf. (24))
B = ( I Ψ 0 ) T , N = ( I Ψ 1 ) T , H = ( B ) 1 , A = H N .
Thus, a straightforward approach to computing A is to solve the linear equation system
( A ) T ( I Ψ 0 ) = ( I Ψ 1 ) .
However, this has a disadvantage: as α vanishes, the matrices I Ψ a become increasingly ill-conditioned, as they are singular for α = 0 —because they converge to I P a where P a ( p i j a ) .
To overcome such a drawback, we draw on the identity ( I Ψ a ) 1 = 1 Ψ a , which is a consequence from (2). From this and (28), we have
( A ) T ( 1 Ψ 0 ) = 1 Ψ 1 .
The latter identity gives a useful counterpart as α 0 . Thus, writing as ξ i a the length of an ( i , a ) -stage (cf. Section 3.1), and applying the MacLaurin series
ψ i a = E e α ξ i a = 1 α m i a + O ( α 2 ) as α 0 ,
one obtains in the limit
( A ) T m 0 = m 1 ,
with m i a the mean length of an ( i , a ) -stage and m = ( m i a ) i N .
We thus obtain the following approach to computing the initial tableau, for α 0 . Letting
m ˜ i a m i a if α = 0 ( 1 ψ i a ) / α if α > 0 ,
and m ˜ a = ( m ˜ i a ) i N , pick any state j * N and solve the (block) linear equation system
I N , N \ { j * } Ψ N , N \ { j * } 0 m ˜ 0 T A = I N , N \ { j * } Ψ N , N \ { j * } 1 m ˜ 1 T
to obtain A . Then, calculate initial reduced costs using (28) and Lemma 1 (d,e) as
g = c 1 c 0 A f = r 1 r 0 A .

6. Extension to the Average Criterion

In applications where the (long-run) average criterion is employed, one must consider the version of λ -price problem (1) based on the average reward and resource metrics given by
F i π lim inf T 1 T E i π 0 T R X ( t ) A ( t ) d t = lim inf K 1 K E i π k = 0 K r X k a k ,
and
G i π lim sup T 1 T E i π 0 T Q X ( t ) A ( t ) d t = lim sup K 1 K E i π k = 0 K q X k a k .
As in ([3], Section 6.5), we assume that the embedded process X k is communicating, so each state can be reached from every other state under some stationary policy. Under this assumption, the above metrics do not depend on the initial state under stationary deterministic policies, so one can write F S and G S for active sets S N . This allows a straightforward extension of the above indexability theory to the average criterion.
Regarding the algorithms discussed above, they apply without change to the average criterion, as the results presented in Section 5.1 show that the corresponding tableaux are simply the limits of the discounted tableaux as the discount rate goes to zero, and also outline how to evaluate the initial tableau. To properly extend the results, one must further assume that the active-set family F satisfies that, for every S F , the S-active policy is unichain, so it induces a single recurrent class on the embedded process X k plus a class of transient states, which may be empty.

7. Numerical Experiments

This section discusses results of numerical experiments, based on implementations by the author of the aforementioned algorithms.

Comparing Runtimes of Index Algorithms

The runtime performance of an algorithm depends not only on its arithmetic operation count, but also on its memory-access patterns, which can actually be the dominant factor. Hence, to compare the algorithms considered herein, a numerical study has been conducted, using MATLAB implementations developed by the author. The experiments were run on a PC with an Intel Core i7-8700 CPU at 3.2 GHz with 16 GB of RAM using MATLAB 2020b under Windows 10 Enterprise. For the state space sizes n = 1000 , 2000 , , 15,000, a discrete-time restless bandit instance was randomly constructed. Transition matrices were generated from random matrices with Uniform [ 0 , 1 ] entries, dividing each row by its sum. Active rewards were randomly generated with Uniform [ 0 , 1 ] entries, and passive rewards were zero. The discount factor was β = 0.8 .
For each generated instance, the CP algorithm was first used to test for indexability and for PCL-indexability (by checking the signs of marginal resource metrics for the nested active-set family obtained). Since such tests were positive in each instance, the Whittle index was calculated using the CP, RP, and FP algorithms (taking F = 2 N in the latter).
Table 6 records the runtimes for the loop of each algorithm, without counting the initialization stage of computing the initial tableau, while Figure 1 plots them, along with cubic least-squares fitting curves. These results highlight that the FP algorithm, whose loop operation count is of ( 2 / 3 ) n 3 + O ( n 2 ) , is indeed the fastest algorithm, followed by the CP and RP algorithms. Recall that 2 n 3 + O ( n 2 ) and n 3 + O ( n 2 ) are the loop operation counts for the CP and the RP algorithms.
The observed discrepancies between operation count complexity and actual runtime performance are explained by taking into account the memory-access patterns of the algorithms. Algorithm CP, which carries out conventional pivot steps, has efficient memory-access patterns, because the coefficient matrix A is always updated as a memory block of contiguous storage. Yet, both algorithms RP and FP achieve a reduction in operations by operating on submatrices of A , with resulting noncontiguous memory-access patterns that are time-consuming. However, for the FP algorithm, as the decrease in arithmetic operations is substantial, it compensates such inefficiencies, and comes out in practice as the fastest algorithm.

8. Discussion

This paper has presented a new algorithm for computing the Whittle index of a general finite-state semi-Markov restless bandit, based on an efficient implementation of the adaptive-greedy algorithmic scheme introduced in [3,24] for restless bandits, in which it was not specified how to evaluate certain metrics arising in the algorithm description. The algorithm extends to restless bandits the fast-pivoting implementation developed in [40] for classic (non-restless) bandits, and results from a similar approach, exploiting the structure of parametric simplex tableaux to reduce the operation count down to ( 2 / 3 ) n 3 + O ( n 2 ) —apart from the computational effort to compute the initial tableau. It is unlikely that such a complexity count can be improved for general restless bandits, as ( 2 / 3 ) n 3 + O ( n 2 ) is the complexity of Gaussian elimination for solving an n × n linear equation system, and it is shown in [3,24] that computing the Whittle index entails at least solving an equation system with such dimensions (albeit with an ordering of the states that generally is not known in advance).
The complexity of the proposed fast-pivoting algorithm is the best that has been reported in the literature. In contrast, the Whittle index algorithm recently presented in [41]—for the average criterion—has a complexity of O ( n 4 2 n ) , which reduces to O ( n 5 ) for one-dimensional indexable bandits provided it is known that threshold policies are optimal for them.
The new algorithm presented herein, whose implementation is straightforward, will be most useful for computing the Whittle index in complex models with large-scale, multi-dimensional state spaces for which closed index formulae cannot be derived, and an efficient computational approach is needed. Given the explosion of research interest in restless bandit models in the last decade, the proposed algorithm thus has the potential of becoming a useful tool, allowing researchers to expand the scope of Whittle’s index policy to large-scale complex models.
Future research directions include developing efficient implementations with substantially lower complexity by exploiting the special structure of relevant model classes arising in applications, and testing the algorithm in large-scale real world models with real data. Another avenue of research is to develop software implementations that improve the efficiency of the computationally costly block matrix operations required by the fast-pivoting algorithm.

9. Conclusions

To conclude, the findings of this paper can be summarized as follows:
-
A new algorithm to compute the Whittle index of a general n-state semi-Markov restless bandit is presented, which can also be used to test numerically for indexability. After an initialization step, the algorithm computes the n index values in an n-step loop with a complexity of ( 2 / 3 ) n 3 + O ( n 2 ) arithmetic operations.
-
The algorithm extends to Whittle’s index the fast-pivoting ( 2 / 3 ) n 3 + O ( n 2 ) algorithm introduced by the author in [30] for the Gittins index of classic (non-restless) banditss, which also has the lowest complexity to date.
-
The proposed algorithm has substantially better complexity than alternative algorithms proposed in the literature.
-
The algorithm will be especially useful for computing the Whittle index in large-scale multi-dimensional models where the index cannot be derived in closed form and alternative algorithms will result in prohibitive computation times.

Funding

This research has been developed over a number of years, and has been funded by the Spanish Government under grants MEC MTM2004-02334 and PID2019-109196GB-I00/AEI/10.13039/501100011033. This research was also funded in part by the Comunidad de Madrid in the setting of the multi-year agreement with Universidad Carlos III de Madrid within the line of activity “Excelencia para el Profesorado Universitario”, in the framework of the V Regional Plan of Scientific Research and Technological Innovation 2016–2020.

Acknowledgments

Preliminary early versions of this work were published by the author in the abridged proceedings [53] of the Second International Workshop on Tools for Solving Structured Markov Chains (SMCtools 2007), Nantes, France, 2007, and in the working paper [54].

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; Wiley: New York, NY, USA, 1994. [Google Scholar]
  2. Whittle, P. Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 1988, 25A, 287–298. [Google Scholar] [CrossRef]
  3. Niño-Mora, J. Dynamic allocation indices for restless projects and queueing admission control: A polyhedral approach. Math. Program. 2002, 93, 361–413. [Google Scholar] [CrossRef]
  4. Niño-Mora, J. Restless bandit marginal productivity indices, diminishing returns and optimal control of make-to-order/make-to-stock M/G/1 queues. Math. Oper. Res. 2006, 31, 50–84. [Google Scholar] [CrossRef]
  5. Fu, J.; Moran, B.; Guo, J.; Wong, E.W.M.; Zukerman, M. Asymptotically optimal job assignment for energy-efficient processor-sharing server farms. IEEE J. Sel. Areas Commun. 2016, 34, 4008–4023. [Google Scholar] [CrossRef]
  6. Niño-Mora, J. Computing an index policy for bandits with switching penalties. In Proceedings of the 2nd International Conference on Performance Evaluation Methodologies and Tools, Nantes, France, 22–27 October 2007. [Google Scholar] [CrossRef]
  7. Qian, Y.; Zhang, C.; Krishnamachari, B.; Tambe, M. Restless poachers: Handling exploration-exploitation tradeoffs in security domains. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, Singapore, 9–13 May 2016; pp. 123–131. [Google Scholar]
  8. Borkar, V.S.; Pattathil, S. Whittle indexability in egalitarian processor sharing systems. Ann. Oper. Res. 2017, 1–21. [Google Scholar] [CrossRef] [Green Version]
  9. Borkar, V.S.; Ravikumar, K.; Saboo, K. An index policy for dynamic pricing in cloud computing under price commitments. Appl. Math. 2017, 44, 215–245. [Google Scholar] [CrossRef]
  10. Borkar, V.S.; Kasbekar, G.S.; Pattathil, S.; Shetty, P.Y. Opportunistic scheduling as restless bandits. IEEE Trans. Control Netw. Syst. 2018, 5, 1952–1961. [Google Scholar] [CrossRef] [Green Version]
  11. Gerum, P.C.L.; Altay, A.; Baykal-Gursoy, M. Data-driven predictive maintenance scheduling policies for railways. Transp. Res. Part C Emerg. Technol. 2019, 107, 137–154. [Google Scholar] [CrossRef]
  12. Abbou, A.; Makis, V. Group maintenance: A restless bandits approach. INFORMS J. Comput. 2019, 31, 719–731. [Google Scholar] [CrossRef]
  13. Ayer, T.; Zhang, C.; Bonifonte, A.; Spaulding, A.C.; Chhatwal, J. Prioritizing hepatitis C treatment in US prisons. Oper. Res. 2019, 67, 853–873. [Google Scholar] [CrossRef] [Green Version]
  14. Niño-Mora, J. Resource allocation and routing in parallel multi-server queues with abandonments for cloud profit maximization. Comput. Oper. Res. 2019, 103, 221–236. [Google Scholar] [CrossRef]
  15. Fu, J.; Moran, B. Energy-efficient job-assignment policy with asymptotically guaranteed performance deviation. IEEE/ACM Trans. Netw. 2020, 28, 1325–1338. [Google Scholar] [CrossRef] [Green Version]
  16. Hsu, Y.P.; Modiano, E.; Duan, L.J. Scheduling algorithms for minimizing age of information in wireless broadcast networks with random arrivals. IEEE Trans. Mob. Comput. 2020, 19, 2903–2915. [Google Scholar] [CrossRef]
  17. Sun, J.Z.; Jiang, Z.Y.; Krishnamachari, B.; Zhou, S.; Niu, Z.S. Closed-form Whittle’s index-enabled random access for timely status update. IEEE Trans. Commun. 2020, 68, 1538–1551. [Google Scholar] [CrossRef]
  18. Li, D.; Ding, L.; Connor, S. When to switch? Index policies for resource scheduling in emergency response. Prod. Oper. Manag. 2020, 29, 241–262. [Google Scholar] [CrossRef]
  19. Papadimitriou, C.H.; Tsitsiklis, J.N. The complexity of optimal queuing network control. Math. Oper. Res. 1999, 24, 293–305. [Google Scholar] [CrossRef] [Green Version]
  20. Weber, R.R.; Weiss, G. On an index policy for restless bandits. J. Appl. Probab. 1990, 27, 637–648. [Google Scholar] [CrossRef]
  21. Weber, R.R.; Weiss, G. Addendum to ‘On an index policy for restless bandits’. Adv. Appl. Probab. 1991, 23, 429–430. [Google Scholar] [CrossRef] [Green Version]
  22. Ouyang, W.Z.; Eryilmaz, A.; Shroff, N.B. Downlink scheduling over Markovian fading channels. IEEE/ACM Trans. Netw. 2016, 24, 1801–1812. [Google Scholar] [CrossRef] [Green Version]
  23. Verloop, I.M. Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann. Appl. Probab. 2016, 26, 1947–1995. [Google Scholar] [CrossRef]
  24. Niño-Mora, J. Restless bandits, partial conservation laws and indexability. Adv. Appl. Probab. 2001, 33, 76–98. [Google Scholar] [CrossRef] [Green Version]
  25. Niño-Mora, J. A verification theorem for threshold-indexability of real-state discounted restless bandits. Math. Oper. Res. 2020, 45, 465–496. [Google Scholar] [CrossRef]
  26. Niño-Mora, J. Marginal productivity index policies for scheduling a multiclass delay-/loss-sensitive queue. Queueing Syst. 2006, 54, 281–312. [Google Scholar] [CrossRef] [Green Version]
  27. Niño-Mora, J. Dynamic priority allocation via restless bandit marginal productivity indices. Top 2007, 15, 161–198. [Google Scholar] [CrossRef]
  28. Cao, J.; Nyberg, C. Linear programming relaxations and marginal productivity index policies for the buffer sharing problem. Queueing Syst. 2008, 60, 247–269. [Google Scholar] [CrossRef]
  29. Huberman, B.A.; Wu, F. The economics of attention: Maximizing user value in information-rich environments. Adv. Complex Syst. 2008, 11, 487–496. [Google Scholar] [CrossRef] [Green Version]
  30. Niño-Mora, J. A faster index algorithm and a computational study for bandits with switching costs. INFORMS J. Comput. 2008, 20, 255–269. [Google Scholar] [CrossRef]
  31. Niño-Mora, J. Admission and routing of soft real-time jobs to multiclusters: Design and comparison of index policies. Comput. Oper. Res. 2012, 39, 3431–3444. [Google Scholar] [CrossRef]
  32. Niño-Mora, J. Towards minimum loss job routing to parallel heterogeneous multiserver queues via index policies. Eur. J. Oper. Res. 2012, 220, 705–715. [Google Scholar] [CrossRef]
  33. He, T.; Chen, S.; Kim, H.; Tong, L.; Lee, K.W. Scheduling parallel tasks onto opportunistically available cloud resources. In Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing, Honolulu, HI, USA, 24–29 June 2012; pp. 180–187. [Google Scholar]
  34. Menner, M.; Zeilinger, M.N. A user comfort model and index policy for personalizing discrete controller decisions. In Proceedings of the 2018 European Control Conference (ECC), Limassol, Cyprus, 12–15 June 2018; pp. 1759–1765. [Google Scholar]
  35. Dance, C.R.; Silander, T. Optimal policies for observing time series and related restless bandit problems. J. Mach. Learn. Res. 2019, 20, 35. [Google Scholar]
  36. Klimov, G.P. Time-sharing service systems. I. Theory Probab. Appl. 1974, 19, 532–551. [Google Scholar] [CrossRef]
  37. Bertsimas, D.; Niño-Mora, J. Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems. Math. Oper. Res. 1996, 21, 257–306. [Google Scholar] [CrossRef]
  38. Niño-Mora, J. Conservation laws and related applications. In Wiley Encyclopedia of Operations Research and Management Science; Wiley: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
  39. Niño-Mora, J. Klimov’s model. In Wiley Encyclopedia of Operations Research and Management Science; Cochran, J.J., Ed.; Wiley: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
  40. Niño-Mora, J. A (2/3)n3 fast-pivoting algorithm for the Gittins index and optimal stopping of a Markov chain. INFORMS J. Comput. 2007, 19, 596–606. [Google Scholar] [CrossRef] [Green Version]
  41. Ayesta, U.; Gupta, M.K.; Verloop, I.M. On the computation of Whittle’s index for Markovian restless bandits. Math. Methods Oper. Res. 2020, 1–30. [Google Scholar] [CrossRef]
  42. Zhao, Q. Multi-Armed Bandits: Theory and Applications to Online Learning in Networks; Morgan & Claypool: San Rafael, CA, USA, 2020. [Google Scholar]
  43. Varaiya, P.P.; Walrand, J.C.; Buyukkoc, C. Extensions of the multiarmed bandit problem: The discounted case. IEEE Trans. Automat. Control 1985, 30, 426–439. [Google Scholar] [CrossRef]
  44. Chen, Y.R.; Katehakis, M.N. Linear programming for finite state multi-armed bandit problems. Math. Oper. Res. 1986, 11, 180–183. [Google Scholar] [CrossRef]
  45. Kallenberg, L.C.M. Computation of the Gittins index. Math. Oper. Res. 1986, 11, 184–186. [Google Scholar] [CrossRef]
  46. Katehakis, M.N.; Veinott, A.F., Jr. The multi-armed bandit problem: Decomposition and computation. Math. Oper. Res. 1987, 12, 262–268. [Google Scholar] [CrossRef]
  47. Katta, A.K.; Sethuraman, J. The multi-armed bandit problem: Decomposition and computation. SIAM J. Discret. Math. 2004, 18, 110–113. [Google Scholar] [CrossRef] [Green Version]
  48. Sonin, I.M. A generalized Gittins index for a Markov chain and its recursive calculation. Stat. Probab. Lett. 2008, 78, 1526–1533. [Google Scholar] [CrossRef]
  49. Gittins, J.C. Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B 1979, 41, 148–177. [Google Scholar] [CrossRef] [Green Version]
  50. Larrañaga, M.; Ayesta, U.; Verloop, I.M. Dynamic control of birth-and-death restless bandits: Application to resource-allocation problems. IEEE/ACM Trans. Netw. 2016, 24, 3812–3825. [Google Scholar] [CrossRef]
  51. Kallenberg, L.C.M. Finite state and action MDPs. In Handbook of Markov Decision Processes; Kluwer: Boston, MA, USA, 2002; pp. 21–87. [Google Scholar]
  52. Gass, S.; Saaty, T. The computational algorithm for the parametric objective function. Nav. Res. Logist. Q. 1955, 2, 39–45. [Google Scholar] [CrossRef]
  53. Niño-Mora, J. Characterization and computation of restless bandit marginal productivity indices. In SMCtools ’07: Proceedings from the Second International Workshop on Tools for Solving Structured Markov Chains; ICST: Brussels, Belgium, 2007. [Google Scholar]
  54. Niño-Mora, J. Characterization and computation of restless bandit marginal productivity indices. Universidad Carlos III de Madrid (UC3M) Working Papers, Statistics and Econometrics 07–11. 2007. Available online: http://hdl.handle.net/10016/796 (accessed on 15 November 2020).
Figure 1. Runtimes of index algorithms vs. number of project states with cubic least-squares fits.
Figure 1. Runtimes of index algorithms vs. number of project states with cubic least-squares fits.
Mathematics 08 02226 g001
Table 1. Simplex tableau (parametric) for S-activeBFS, with pivot a j j S .
Table 1. Simplex tableau (parametric) for S-activeBFS, with pivot a j j S .
( x S 0 ) T x j 1 ( x S c \ { j } 1 ) T
x S 1 A S S S A S j S A S S c \ { j } S
x j 0 A j S S a j j S A j S c \ { j }
x S c \ { j } 0 A S c \ { j } , S S A S c \ { j } , j S A S c \ { j } S c \ { j } S
g S S g j S g S c \ { j } S
f S S f j S r S c \ { j } S
Table 2. Tableau for S { j } -active BFS, after pivoting on a j j S .
Table 2. Tableau for S { j } -active BFS, after pivoting on a j j S .
( x S 0 ) T x j 0 ( x S c \ { j } 1 ) T
x S 1 A S S S ( a j j S ) 1 A S j S A j S S ( a j j S ) 1 A S j S A S S c \ { j } S ( a j j S ) 1 A S j S A j S c \ { j } S
x j 1 ( a j j S ) 1 A j S S ( a j j S ) 1 ( a j j S ) 1 A j S c \ { j } S
x S c \ { j } 0 A S c \ { j } , S S ( a j j S ) 1 A S c \ { j } , j S A j S S ( a j j S ) 1 A S c \ { j } , j S A S c \ { j } S c \ { j } S ( a j j S ) 1 A S c \ { j } , j S A j S c \ { j } S
g S S + g j S ( a j j S ) 1 A j S S g j S ( a j j S ) 1 g S c \ { j } S + g j S ( a j j S ) 1 A j S c \ { j } S
f S S + f j S ( a j j S ) 1 A j S S f j S ( a j j S ) 1 f S c \ { j } S + f j S ( a j j S ) 1 A j S c \ { j } S
Table 3. Reduced tableau for S-active BFS, with pivot a j j S .
Table 3. Reduced tableau for S-active BFS, with pivot a j j S .
( x S 0 ) T x j 1 ( x S c \ { j } 1 ) T
x j 0 A j S S a j j S A j S c \ { j } S
x S c \ { j } 0 A S c \ { j } , S S A S c \ { j } , j S A S c \ { j } S c \ { j } S
g S S g j S g S c \ { j } S
f S S f j S f S c \ { j } S
Table 4. Minimal tableau for S-active BFS.
Table 4. Minimal tableau for S-active BFS.
( x S c 1 ) T
x S c 0 A S c S c S
g S c S
f S c S
Table 5. Minimal tableau for S { j } -active BFS, obtained after pivoting on a j j S .
Table 5. Minimal tableau for S { j } -active BFS, obtained after pivoting on a j j S .
( x S c \ { j } 1 ) T
x S c \ { j } 0 A S c \ { j } S c \ { j } S ( a j j S ) 1 A S c \ { j } , j S A j S c \ { j } S
g S c \ { j } S g j S ( a j j S ) 1 A S c \ { j } , j S
f S c \ { j } S f j S ( a j j S ) 1 A S c \ { j } , j S
Table 6. Runtimes (s) of index algorithms.
Table 6. Runtimes (s) of index algorithms.
nFPRPCP
1000 1.8 2.6 2.2
2000 14.8 23.6 19.6
3000 53.3 77.1 67.8
4000 121.4 193.4 156.3
5000 227.6 342.8 295.9
6000 433.2 647.9 541.8
7000 699.6 1034.8 862.3
8000 1118.3 1531.5 1280.9
9000 1530.8 2173.6 1822.3
10,000 2100.7 2919.3 2500.9
11,000 2687.2 3580.0 3277.7
12,000 3575.9 5055.7 4300.6
13,000 4747.4 6610.4 5539.2
14,000 5629.5 7923.1 6829.9
15,000 7254.1 8871.0 8250.5
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Niño-Mora, J. A Fast-Pivoting Algorithm for Whittle’s Restless Bandit Index. Mathematics 2020, 8, 2226. https://doi.org/10.3390/math8122226

AMA Style

Niño-Mora J. A Fast-Pivoting Algorithm for Whittle’s Restless Bandit Index. Mathematics. 2020; 8(12):2226. https://doi.org/10.3390/math8122226

Chicago/Turabian Style

Niño-Mora, José. 2020. "A Fast-Pivoting Algorithm for Whittle’s Restless Bandit Index" Mathematics 8, no. 12: 2226. https://doi.org/10.3390/math8122226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop