# Markovian Restless Bandits and Index Policies: A Review

## Abstract

**:**

## 1. Introduction

## 2. Antecedents: Multi-Armed Bandits and the Gittins Index Policy

## 3. Restless Multi-Armed Bandits and the Whittle Index Policy

## 4. Complexity, Approximation, and Relaxations

## 5. Indexability

## 6. Whittle Index Computation

## 7. Optimality of the Myopic Policy

## 8. Asymptotic Optimality of Index Policies

## 9. Multi-Action Bandits

## 10. Lagrangian Index and Fluid Relaxation Policies

## 11. Reinforcement Learning and Q-Learning Approaches

## 12. Regret-Based Online Learning

## 13. Applications: MDP Models

#### 13.1. Variants of the MABP

#### 13.2. Queueing Models

#### 13.3. Web Crawling

#### 13.4. Public Health Interventions

#### 13.5. Communication Networks

#### 13.6. Miscellaneous Applications

## 14. Applications: POMDP Models

## 15. Conclusions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

DP | Dynamic programming |

LP | Linear programming |

MDP | Markov decision process |

POMDP | Partially observable Markov decision process |

MABP | Multi-armed bandit problem |

RMABP | Restless multi-armed bandit problem |

## References

- Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; Wiley: New York, NY, USA, 1994. [Google Scholar]
- Gittins, J.C.; Jones, D.M. A dynamic allocation index for the sequential design of experiments. In Progress in Statistics, Proceedings of the European Meeting of Statisticians, Budapest, Hungary, 31 August–5 September 1972; Colloquia Mathematica Societatis János Bolyai; Gani, J., Sarkadi, K., Vincze, I., Eds.; North-Holland: Amsterdam, The Netherlands, 1974; Volume 9, pp. 241–266. [Google Scholar]
- Whittle, P. Restless bandits: Activity allocation in a changing world. J. Appl. Probab.
**1988**, 25A, 287–298. [Google Scholar] [CrossRef] - Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika
**1933**, 25, 275–294. [Google Scholar] [CrossRef] - Thompson, W.R. On the theory of apportionment. Am. J. Math.
**1935**, 57, 450–456. [Google Scholar] [CrossRef] - Wald, A. Sequential Analysis; Wiley: New York, NY, USA, 1947. [Google Scholar]
- Robbins, H. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc.
**1952**, 58, 527–535. [Google Scholar] [CrossRef][Green Version] - Bradt, R.N.; Johnson, S.M.; Karlin, S. On sequential designs for maximizing the sum of n observations. Ann. Math. Statist.
**1956**, 27, 1060–1074. [Google Scholar] [CrossRef] - Bellman, R. A problem in the sequential design of experiments. Sankhyā Indian J. Stat.
**1956**, 16, 221–229. [Google Scholar] - Bellman, R. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957. [Google Scholar]
- Howard, R.A. Dynamic Programming and Markov Processes; Wiley: New York, NY, USA, 1960. [Google Scholar]
- Bertsekas, D.P. Dynamic Programming and Optimal Control, 4th ed.; Athena Scientific: Belmont, MA, USA, 2017; Volume I. [Google Scholar]
- Bertsekas, D.P. Dynamic Programming and Optimal Control—Approximate Dynamic Programming, 4th ed.; Athena Scientific: Nashua, NH, USA, 2012; Volume II. [Google Scholar]
- Gittins, J.C. Bandit processes and dynamic allocation indices (with discussion). J. Roy. Statist. Soc. Ser. B
**1979**, 41, 148–177. [Google Scholar] - Gittins, J.C. Multi-Armed Bandit Allocation Indices; Wiley: Chichester, UK, 1989. [Google Scholar]
- Gittins, J.C.; Glazebrook, K.; Weber, R. Multi-Armed Bandit Allocation Indices, 2nd ed.; Wiley: Chichester, UK, 2011. [Google Scholar]
- Whittle, P. Multi-armed bandits and the Gittins index. J. Roy. Statist. Soc. Ser. B
**1980**, 42, 143–149. [Google Scholar] [CrossRef] - Varaiya, P.P.; Walrand, J.C.; Buyukkoc, C. Extensions of the multiarmed bandit problem: The discounted case. IEEE Trans. Automat. Control
**1985**, 30, 426–439. [Google Scholar] [CrossRef] - Weber, R. On the Gittins index for multiarmed bandits. Ann. Appl. Probab.
**1992**, 2, 1024–1033. [Google Scholar] [CrossRef] - Tsitsiklis, J.N. A short proof of the Gittins index theorem. Ann. Appl. Probab.
**1994**, 4, 194–199. [Google Scholar] [CrossRef] - Bertsimas, D.; Niño-Mora, J. Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems. Math. Oper. Res.
**1996**, 21, 257–306. [Google Scholar] [CrossRef] - Whittle, P. Arm-acquiring bandits. Ann. Probab.
**1981**, 9, 284–292. [Google Scholar] [CrossRef] - Dumitriu, I.; Tetali, P.; Winkler, P. On playing golf with two balls. SIAM J. Discret. Math.
**2003**, 16, 604–615. [Google Scholar] [CrossRef][Green Version] - Bao, W.; Cai, X.; Wu, X. A general theory of multiarmed bandit processes with constrained arm switches. SIAM J. Control Optim.
**2021**, 59, 4666–4688. [Google Scholar] [CrossRef] - Klimov, G.P. Time-sharing service systems. I. Theory Probab. Appl.
**1974**, 19, 532–551. [Google Scholar] [CrossRef] - Meilijson, I.; Weiss, G. Multiple feedback at a single-server station. Stoch. Process. Appl.
**1977**, 5, 195–205. [Google Scholar] [CrossRef][Green Version] - Weiss, G. Branching bandit processes. Probab. Eng. Inform. Sci.
**1988**, 2, 269–278. [Google Scholar] [CrossRef] - Niño-Mora, J. Klimov’s model. In Wiley Encyclopedia of Operations Research and Management Science; Cochran, J.J., Cox, L.A., Jr., Keskinocak, P., Kharoufeh, J.P., Smith, J.C., Eds.; Wiley: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
- O’Flaherty, B. Some results on two-armed bandits when both projects vary. J. Appl. Probab.
**1989**, 26, 655–658. [Google Scholar] [CrossRef] - Papadimitriou, C.H.; Tsitsiklis, J.N. The complexity of optimal queuing network control. Math. Oper. Res.
**1999**, 24, 293–305. [Google Scholar] [CrossRef][Green Version] - Guha, S.; Munagala, K.; Shi, P. Approximation algorithms for restless bandit problems. J. ACM
**2010**, 58, 3. [Google Scholar] [CrossRef] - Liu, K.; Zhao, Q. A restless bandit formulation of opportunistic access: Indexability and index policy. In Proceedings of the 5th IEEE Annual Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks Workshops, San Francisco, CA, USA, 16–20 June 2008; pp. 1–5. [Google Scholar]
- Liu, K.; Zhao, Q. Indexability of restless bandit problems and optimality of Whittle index for dynamic multichannel access. IEEE Trans. Inform. Theory
**2010**, 56, 5547–5567. [Google Scholar] [CrossRef] - Le Ny, J.; Dahleh, M.; Feron, E. Multi-UAV dynamic routing with partial observations using restless bandit allocation indices. In Proceedings of the American Control Conference (ACC), Seattle, WA, USA, 11–13 June 2008; pp. 4220–4225. [Google Scholar]
- Wan, P.J.; Xu, X.H. Weighted restless bandit and its applications. In Proceedings of the 35th IEEE International Conference on Distributed Computing Systems (ICDCS), Columbus, OH, USA, 29 June–2 July 2015; pp. 507–516. [Google Scholar]
- Xu, X.H.; Song, M. Approximation algorithms for wireless opportunistic spectrum scheduling in cognitive radio networks. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications (INFOCOM), San Francisco, CA, USA, 10–14 April 2016; pp. 1–7. [Google Scholar]
- Xu, X.H.; Wang, L.X. Efficient algorithm for multi-constrained opportunistic wireless scheduling. In Proceedings of the 16th IEEE International Conference on Mobility, Sensing and Networking (MSN), Tokyo, Japan, 17–19 December 2020; pp. 169–173. [Google Scholar]
- Bertsimas, D.; Niño-Mora, J. Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Oper. Res.
**2000**, 48, 80–90. [Google Scholar] [CrossRef][Green Version] - Hawkins, J.T. A Langrangian Decomposition Approach to Weakly Coupled Dynamic Optimization Problems and Its Applications. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2003. [Google Scholar]
- Adelman, D.; Mersereau, A.J. Relaxations of weakly coupled stochastic dynamic programs. Oper. Res.
**2008**, 56, 712–727. [Google Scholar] [CrossRef][Green Version] - Powell, W.B. Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd ed.; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
- Brown, D.B.; Zhang, J.W. On the strength of relaxations of weakly coupled stochastic dynamic programs. Oper. Res.
**2022**. [Google Scholar] [CrossRef] - Liu, K.; Weber, R.; Zhao, Q. Indexability and Whittle index for restless bandit problems involving reset processes. In Proceedings of the 50th IEEE Conference on Decision and Control and European Control Conference (CDC-ECC), Orlando, FL, USA, 12–15 December 2011; pp. 7690–7696. [Google Scholar]
- Fryer, R.; Harms, P. Two-armed restless bandits with imperfect information: Stochastic control and indexability. Math. Oper. Res.
**2018**, 43, 399–427. [Google Scholar] [CrossRef][Green Version] - Caro, F.; Yoo, O.S. Indexability of bandit problems with response delays. Probab. Eng. Inform. Sci.
**2010**, 24, 349–374. [Google Scholar] [CrossRef] - Whittle, P. Optimal Control: Basics and Beyond; Wiley: Chichester, UK, 1996. [Google Scholar]
- Veatch, M.H.; Wein, L.M. Scheduling a multiclass make-to-stock queue: Index policies and hedging points. Oper. Res.
**1996**, 44, 634–647. [Google Scholar] [CrossRef][Green Version] - Dance, C.R.; Silander, T. When are Kalman-filter restless bandits indexable? In Proceedings of the Conference on Neural Information Processing Systems (NIPS), Montreal, QB, Canada, 7–12 December 2015; Cortes, C., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; MIT Press: Cambridge, MA, USA, 2015; pp. 1711–1719. [Google Scholar]
- Niño-Mora, J. Restless bandits, partial conservation laws and indexability. Adv. Appl. Probab.
**2001**, 33, 76–98. [Google Scholar] [CrossRef][Green Version] - Niño-Mora, J. Dynamic allocation indices for restless projects and queueing admission control: A polyhedral approach. Math. Program.
**2002**, 93, 361–413. [Google Scholar] [CrossRef] - Niño-Mora, J. Restless bandit marginal productivity indices, diminishing returns and optimal control of make-to-order/make-to-stock M/G/1 queues. Math. Oper. Res.
**2006**, 31, 50–84. [Google Scholar] [CrossRef] - Niño-Mora, J. Dynamic priority allocation via restless bandit marginal productivity indices (with discussion). TOP
**2007**, 15, 161–198. [Google Scholar] [CrossRef] - Niño-Mora, J. A verification theorem for threshold-indexability of real-state discounted restless bandits. Math. Oper. Res.
**2020**, 45, 465–496. [Google Scholar] [CrossRef][Green Version] - Dance, C.R.; Silander, T. Optimal policies for observing time series and related restless bandit problems. J. Mach. Learn. Res.
**2019**, 20, 35. [Google Scholar] - Niño-Mora, J. Characterization and computation of restless bandit marginal productivity indices. In Proceedings of the 1st International ICST Workshop on Tools for solving Structured Markov Chains (SMCTools), Nantes, France, 26 October 2007; Buchholz, P., Dayar, T., Eds.; ICST: Brussels, Belgium, 2007. ACM International Conference Proceeding Series. [Google Scholar] [CrossRef][Green Version]
- Niño-Mora, J. A fast-pivoting algorithm for Whittle’s restless bandit index. Mathematics
**2020**, 8, 2226. [Google Scholar] [CrossRef] - Niño-Mora, J. A (2/3)n
^{3}fast-pivoting algorithm for the Gittins index and optimal stopping of a Markov chain. INFORMS J. Comput.**2007**, 19, 596–606. [Google Scholar] [CrossRef][Green Version] - Qian, Y.; Zhang, C.; Krishnamachari, B.; Tambe, M. Restless poachers: Handling exploration-exploitation tradeoffs in security domains. In Proceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Singapore, Singapore, 9–13 May 2016; pp. 123–131. [Google Scholar]
- Akbarzadeh, N.; Mahajan, A. Conditions for indexability of restless bandits and an O(K
^{3}) algorithm to compute Whittle index. Adv. Appl. Probab.**2022**, 54, 1164–1192. [Google Scholar] [CrossRef] - Ehsan, N.; Liu, M. Server allocation with delayed state observation: Sufficient conditions for the optimality of an index policy. IEEE Trans. Wirel. Comm.
**2009**, 8, 1693–1705. [Google Scholar] [CrossRef][Green Version] - Ahmad, S.H.A.; Liu, M.Y.; Javidi, T.; Zhao, Q. Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inform. Theory
**2009**, 55, 4040–4050. [Google Scholar] [CrossRef][Green Version] - Liu, K.; Zhao, Q.; Krishnamachari, B. Dynamic multichannel access with imperfect channel state detection. IEEE Trans. Signal Process.
**2010**, 58, 2795–2808. [Google Scholar] [CrossRef] - Wang, K.H.; Liu, Q.; Chen, L. Optimality of greedy policy for a class of standard reward function of restless multi-armed bandit problem. IET Signal Process.
**2012**, 6, 584–593. [Google Scholar] [CrossRef][Green Version] - Wang, K.H.; Chen, L. On optimality of myopic policy for restless multi-armed bandit problem: An axiomatic approach. IEEE Trans. Signal Proc.
**2012**, 60, 300–309. [Google Scholar] [CrossRef][Green Version] - Wang, K.H.; Chen, L.; Liu, Q. On optimality of myopic policy for opportunistic access with nonidentical channels and imperfect sensing. IEEE Trans. Veh. Tech.
**2014**, 63, 2478–2483. [Google Scholar] [CrossRef] - Wang, K.H.; Chen, L.; Yu, J.H.; Zhang, D.Z. Optimality of myopic policy for multistate channel access. IEEE Comm. Lett.
**2016**, 20, 300–303. [Google Scholar] [CrossRef] - Wang, K.H.; Yu, J.H.; Chen, L.; Zhou, P.; Win, M.Z. Optimal myopic policy for restless bandit: A perspective of eigendecomposition. IEEE J. Sel. Top. Signal Process.
**2022**, 16, 420–433. [Google Scholar] [CrossRef] - Wang, K.H.; Chen, L. Restless Multi-Armed Bandit in Opportunistic Scheduling; Springer: Cham, Switzerland, 2021. [Google Scholar]
- Ouyang, W.Z.; Teneketzis, D. On the optimality of myopic sensing in multi-state channels. IEEE Trans. Inform. Theory
**2014**, 60, 681–696. [Google Scholar] [CrossRef][Green Version] - Blasco, P.; Gündüz, D. Multi-access communications with energy harvesting: A multi-armed bandit model and the optimality of the myopic policy. IEEE J. Sel. Areas Commun.
**2015**, 33, 585–597. [Google Scholar] [CrossRef][Green Version] - Kadota, I.; Sinha, A.; Uysal-Biyikoglu, E.; Singh, R.; Modiano, E. Scheduling policies for minimizing age of information in broadcast wireless networks. IEEE/ACM Trans. Netw.
**2018**, 26, 2637–2650. [Google Scholar] [CrossRef][Green Version] - Weber, R.R.; Weiss, G. On an index policy for restless bandits. J. Appl. Probab.
**1990**, 27, 637–648. [Google Scholar] [CrossRef] - Weber, R.R.; Weiss, G. Addendum to: “On an index policy for restless bandits”. Adv. Appl. Probab.
**1991**, 23, 429–430. [Google Scholar] [CrossRef][Green Version] - Bagheri, S.; Scaglione, A. The restless multi-armed bandit formulation of the cognitive compressive sensing problem. IEEE Trans. Signal Process.
**2015**, 63, 1183–1198. [Google Scholar] [CrossRef] - Larrañaga, M.; Ayesta, U.; Verloop, I.M. Asymptotically optimal index policies for an abandonment queue with convex holding cost. Queueing Syst.
**2015**, 81, 99–169. [Google Scholar] [CrossRef] - Ouyang, W.Z.; Eryilmaz, A.; Shroff, N.B. Downlink scheduling over Markovian fading channels. IEEE/ACM Trans. Netw.
**2016**, 24, 1801–1812. [Google Scholar] [CrossRef][Green Version] - Verloop, I.M. Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann. Appl. Probab.
**2016**, 26, 1947–1995. [Google Scholar] [CrossRef] - Fu, J.; Moran, B.; Guo, J.; Wong, E.W.M.; Zukerman, M. Asymptotically optimal job assignment for energy-efficient processor-sharing server farms. IEEE J. Sel. Areas Commun.
**2016**, 34, 4008–4023. [Google Scholar] [CrossRef] - Fu, J.; Moran, B. Energy-efficient job-assignment policy with asymptotically guaranteed performance deviation. IEEE/ACM Trans. Netw.
**2020**, 28, 1325–1338. [Google Scholar] [CrossRef][Green Version] - Hu, W.; Frazier, P.I. An asymptotically optimal index policy for finite-horizon restless bandits. arXiv
**2017**, arXiv:1707.00205. [Google Scholar] - Zayas-Cabán, G.; Jasin, S.; Wang, G. An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Adv. Appl. Probab.
**2019**, 51, 745–772. [Google Scholar] [CrossRef] - Maatouk, A.; Kriouile, S.; Assaad, M.; Ephremides, A. On the optimality of the Whittle’s index policy for minimizing the age of information. IEEE Trans. Wirel. Comm.
**2021**, 20, 1263–12770. [Google Scholar] [CrossRef] - Kriouile, S.; Assaad, M.; Maatouk, A. On the global optimality of Whittle’s index policy for minimizing the age of information. IEEE Trans. Inf. Theory
**2022**, 68, 572–600. [Google Scholar] [CrossRef] - Brown, D.B.; Smith, J.E. Index policies and performance bounds for dynamic selection problems. Manag. Sci.
**2020**, 66, 3029–3050. [Google Scholar] [CrossRef] - Zhang, X.Y.; Frazier, P.I. Restless bandits with many arms: Beating the Central Limit Theorem. arXiv
**2021**, arXiv:2107.11911. [Google Scholar] - Gast, N.; Gaujal, B.; Yan, Y. LP-based policies for restless bandits: Necessary and sufficient conditions for (exponentially fast) asymptotic optimality. arXiv
**2022**, arXiv:2106.10067. [Google Scholar] - Nash, P. Optimal Allocation of Resources between Research Projects. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 1973. [Google Scholar]
- Brown, D.B.; Smith, J.E. Optimal sequential exploration: Bandits, clairvoyants, and wildcats. Oper. Res.
**2013**, 61, 644–665. [Google Scholar] [CrossRef][Green Version] - Hadfield-Menell, D.; Russell, S. Multitasking: Efficient optimal planning for bandit superprocesses. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence (UAI), Amsterdam, The Netherlands, 12–16 July 2015; Meila, M., Heskes, T., Eds.; AUAI Press: Corvallis, OR, USA, 2015; pp. 345–354. [Google Scholar]
- Weber, R. Comments on: “Dynamic priority allocation via restless bandit marginal productivity indices” [TOP
**15**(2007),161–198] by J. Niño-Mora. TOP**2007**, 15, 211–216. [Google Scholar] [CrossRef] - Niño-Mora, J. An index policy for multiarmed multimode restless bandits. In Proceedings of the 3rd International Conference on Performance Evaluation Methodologies and Tools (ValueTools), Athens, Greece, 20–24 October 2008; Baras, J., Courcoubetis, C., Eds.; ICST: Brussels, Belgium, 2008. ACM International Conference Proceedings Series. [Google Scholar] [CrossRef][Green Version]
- Glazebrook, K.D.; Hodge, D.J.; Kirkbride, C. General notions of indexability for queueing control and asset management. Ann. Appl. Probab.
**2011**, 21, 876–907. [Google Scholar] [CrossRef][Green Version] - Niño-Mora, J. Index-based dynamic energy management in a multimode sensor network. In Proceedings of the 6th International Conference on Network Games, Control and Optimization (NetGCooP), Avignon, France, 28–30 November 2012; pp. 92–95. Available online: https://ieeexplore.ieee.org/document/6486131 (accessed on 1 January 2023).
- Niño-Mora, J. Multi-gear bandits, partial conservation laws, and indexability. Mathematics
**2022**, 10, 2497. [Google Scholar] [CrossRef] - Killian, J.A.; Perrault, A.; Tambe, M. Beyond “to act or not to act”: Fast Lagrangian approaches to general multi-action restless bandits. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), Online, 3–7 May 2021; Endriss, U., Nowé, A., Dignum, F., Lomuscio, A., Eds.; International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, USA, 2021; pp. 710–718. [Google Scholar]
- Killian, J.A.; Biswas, A.; Shah, S.; Tambe, M. Q-learning Lagrange policies for multi-action restless bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, Singapore, 14–18 August 2021; pp. 871–881. [Google Scholar]
- Xiong, G.J.; Li, J.; Singh, R. Reinforcement learning augmented asymptotically optimal index policy for finite-horizon restless bandits. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Seoul, Republic of Korea, 22 February–1 March 2022; pp. 8726–8734. [Google Scholar]
- Xiong, G.J.; Wang, S.; Li, J. Learning infinite-horizon average-reward restless multi-action bandits via index awareness. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Oh, A.H., Agarwal, A., Belgrave, D., Cho, K., Eds.; Neural Information Processing Systems (NIPS): La Jolla, CA, USA, 2022. Advances in Neural Information Processing Systems. [Google Scholar]
- Caro, F.; Gallien, J. Dynamic assortment with demand learning for seasonal consumer goods. Manag. Sci.
**2007**, 53, 276–292. [Google Scholar] [CrossRef][Green Version] - Brown, D.B.; Zhang, J.W. Dynamic programs with shared resources and signals: Dynamic fluid policies and asymptotic optimality. Oper. Res.
**2022**, 70, 3015–3033. [Google Scholar] [CrossRef] - Hao, L.L.; Xu, Y.J.; Tong, L. Asymptotically optimal Lagrangian priority policy for deadline scheduling with processing rate limits. IEEE Trans. Automat. Control
**2022**, 67, 236–250. [Google Scholar] [CrossRef] - Watkins, C.; Dayan, P. Q-learning. Mach. Learn.
**1992**, 8, 279–292. [Google Scholar] [CrossRef] - Powell, W.B. Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions; Wiley: Hoboken, NJ, USA, 2022. [Google Scholar]
- Fu, J.; Nazarathy, Y.; Moka, S.; Taylor, P.G. Towards Q-learning the Whittle index for restless bandits. In Proceedings of the Australian & New Zealand Control Conference (ANZCC), Auckland, New Zealand, 27–29 November 2019; pp. 249–254. [Google Scholar] [CrossRef]
- Wu, S.; Zhao, J.; Tian, G.; Wang, J. State-aware value function approximation with attention mechanism for restless multi-armed bandits. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI), Online, 19–26 August 2021; Zhou, Z.H., Ed.; IJCAI: Cape Town, South Africa, 2021; pp. 458–464. [Google Scholar] [CrossRef]
- Li, M.S.; Gao, J.; Zhao, L.; Shen, X.M. Adaptive computing scheduling for edge-assisted autonomous driving. IEEE Trans. Veh. Tech.
**2021**, 70, 5318–5331. [Google Scholar] [CrossRef] - Biswas, A.; Aggarwal, G.; Varakantham, P.; Tambe, M. Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI), Online, 19–26 August 2021; Zhou, Z.H., Ed.; pp. 4039–4046. [Google Scholar] [CrossRef]
- Avrachenkov, K.E.; Borkar, V.S. Whittle index based Q-learning for restless bandits with average reward. Automatica
**2022**, 139, 110186. [Google Scholar] [CrossRef] - Nakhleh, K.; Ganji, S.; Hsieh, P.C.; Hou, I.H.; Shakkottai, S. NeurWIN: Neural Whittle index network for restless bandits via deep RL. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021; Advances in Neural Information Processing Systems. Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Wortman Vaughan, J., Eds.; Neural Information Processing Systems (NIPS): La Jolla, CA, USA, 2021; Volume 34. [Google Scholar]
- Nakhleh, K.; Hou, I.H. DeepTOP: Deep threshold-optimal policy for MDPs and RMABs. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Advances in Neural Information Processing Systems. Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Neural Information Processing Systems (NIPS): La Jolla, CA, USA, 2022; Volume 35. [Google Scholar]
- Lai, T.L.; Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math.
**1985**, 6, 4–22. [Google Scholar] [CrossRef][Green Version] - Anantharam, V.; Varaiya, P.; Walrand, J. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays. II. Markovian rewards. IEEE Trans. Automat. Control
**1987**, 32, 977–982. [Google Scholar] [CrossRef] - Zhao, Q. Multi-Armed Bandits: Theory and Applications to Online Learning in Networks; Morgan & Claypool: San Rafael, CA, USA, 2020. [Google Scholar]
- Filippi, S.; Cappé, O.; Garivier, A. Optimally sensing a single channel without prior information: The tiling algorithm and regret bounds. IEEE J. Sel. Top. Signal Process.
**2011**, 5, 68–76. [Google Scholar] [CrossRef][Green Version] - Tekin, C.; Liu, M.Y. Online learning of rested and restless bandits. IEEE Trans. Inf. Theory
**2012**, 58, 5588–5611. [Google Scholar] [CrossRef][Green Version] - Ortner, R.; Ryabko, D.; Auer, P.; Munos, R. Regret bounds for restless Markov bandits. Theoret. Comput. Sci.
**2014**, 558, 62–76. [Google Scholar] [CrossRef] - Garivier, A.; Moulines, E. On upper-confidence bound policies for switching bandit problems. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT), Espoo, Finland, 5–7 October 2011; Lecture Notes in Artificial Intelligence. Kivinen, J., Szepesvári, C., Ukkonen, E., Zeugmann, T., Eds.; Springer: Berlin, Germany, 2011; Volume 6925, pp. 174–188. [Google Scholar]
- Gupta, N.; Granmo, O.C.; Agrawala, A. Thompson sampling for dynamic multi-armed bandits. In Proceedings of the 10th International Conference on Machine Learning and Applications and Workshops, Honolulu, HI, USA, 18–21 December 2011; pp. 484–489. [Google Scholar] [CrossRef]
- Dai, W.H.R.; Gai, Y.; Krishnamachari, B.; Zhao, Q. The non-Bayesian restless multi-armed bandit: A case of near-logarithmic regret. In Proceedings of the 36th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 2940–2943. [Google Scholar]
- Liu, H.Y.; Liu, K.Q.; Zhao, Q. Learning in a changing world: Restless multiarmed bandit with unknown dynamics. IEEE Trans. Inf. Theory
**2013**, 59, 1902–1916. [Google Scholar] [CrossRef][Green Version] - Modi, N.; Mary, P.; Moy, C. QoS driven channel selection algorithm for cognitive radio network: Multi-user multi-armed bandit approach. IEEE Trans. Cogn. Commun. Netw.
**2017**, 3, 49–66. [Google Scholar] [CrossRef][Green Version] - Grünewälder, S.; Khaleghi, A. Approximations of the restless bandit problem. J. Mach. Learn. Res.
**2019**, 20, 14. [Google Scholar] - Agrawal, H.; Asawa, K. Decentralized learning for opportunistic spectrum access: Multiuser restless multiarmed bandit formulation. IEEE Syst. J.
**2020**, 14, 2485–2496. [Google Scholar] [CrossRef] - Gafni, T.; Cohen, K. Learning in restless multiarmed bandits via adaptive arm sequencing rules. IEEE Trans. Automat. Control
**2021**, 66, 5029–5036. [Google Scholar] [CrossRef] - Xu, J.Y.; Chen, L.J.; Tang, O. An online algorithm for the risk-aware restless bandit. Eur. J. Oper. Res.
**2021**, 290, 622–639. [Google Scholar] [CrossRef] - Gafni, T.; Yemini, M.; Cohen, K. Learning in restless bandits under exogenous global Markov process. IEEE Trans. Signal Process.
**2022**, 70, 5679–5693. [Google Scholar] [CrossRef] - Gafni, T.; Cohen, K. Distributed learning over Markovian fading channels for stable spectrum access. IEEE Access
**2022**, 10, 46652–46669. [Google Scholar] [CrossRef] - Banks, J.S.; Sundaram, R.K. Switching costs and the Gittins index. Econometrica
**1994**, 62, 687–694. [Google Scholar] [CrossRef] - Asawa, M.; Teneketzis, D. Multi-armed bandits with switching penalties. IEEE Trans. Automat. Control
**1996**, 41, 328–348. [Google Scholar] [CrossRef] - Niño-Mora, J. Computing an index policy for bandits with switching penalties. In Proceedings of the 1st International ICST Workshop on Tools for solving Structured Markov Chains (SMCTools), Nantes, France, 26 October 2007; Buchholz, P., Dayar, T., Eds.; ICST: Brussels, Belgium, 2007. ACM International Conference Proceedings Series. [Google Scholar] [CrossRef]
- Niño-Mora, J. A faster index algorithm and a computational study for bandits with switching costs. INFORMS J. Comput.
**2008**, 20, 255–269. [Google Scholar] [CrossRef] - Niño-Mora, J. Fast two-stage computation of an index policy for multi-armed bandits with setup delays. Mathematics
**2021**, 9, 52. [Google Scholar] [CrossRef] - Niño-Mora, J. Marginal productivity index policies for scheduling restless bandits with switching penalties. In Algorithms for Optimization with Incomplete Information; Dagstuhl Seminar Proceedings, 16–21 January 2005; Albers, S., Möhring, R.H., Pflug, G.C., Schultz, R., Eds.; Schloss Dagstuhl—Leibniz-Zentrum für Informatik: Dagstuhl, Germany, 2005; Volume 05031. [Google Scholar] [CrossRef]
- Le Ny, J.; Feron, E. Restless bandits with switching costs: Linear programming relaxations, performance bounds and limited lookahead policies. In Proceedings of the 25th American Control Conference (ACC), Minneapolis, MN, USA, 14–16 June 2006; pp. 1587–1592. [Google Scholar]
- Arlotto, A.; Chick, S.E.; Gans, N. Optimal hiring and retention policies for heterogeneous workers who learn. Manag. Sci.
**2014**, 60, 110–129. [Google Scholar] [CrossRef][Green Version] - Niño-Mora, J. A marginal productivity index policy for the finite-horizon multiarmed bandit problem. In Proceedings of the Joint 44th IEEE Conference on Decision and Control, and European Control Conference (CDC-ECC), Seville, Spain, 12–15 December 2005; pp. 1718–1722. [Google Scholar]
- Niño-Mora, J. Computing a classic index for finite-horizon bandits. INFORMS J. Comput.
**2011**, 23, 173–330. [Google Scholar] [CrossRef][Green Version] - Dayanik, S.; Powell, W.; Yamazaki, K. Index policies for discounted bandit problems with availability constraints. Adv. Appl. Probab.
**2008**, 40, 377–400. [Google Scholar] [CrossRef][Green Version] - Ansell, P.S.; Glazebrook, K.D.; Niño Mora, J.; O’Keeffe, M. Whittle’s index policy for a multi-class queueing system with convex holding costs. Math. Meth. Oper. Res.
**2003**, 57, 21–39. [Google Scholar] - Niño-Mora, J. Marginal productivity index policies for admission control and routing to parallel multi-server loss queues with reneging. In Proceedings of the 1st EuroFGI Conference on Network Control and Optimization (NETCOOP), Avignon, France, 5–7 June 2007; Lecture Notes in Computer Science. Chahed, T., Tuffin, B., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4465, pp. 138–149. [Google Scholar]
- Niño-Mora, J. Admission and routing of soft real-time jobs to multiclusters: Design and comparison of index policies. Comput. Oper. Res.
**2012**, 39, 3431–3444. [Google Scholar] [CrossRef] - Niño-Mora, J. Towards minimum loss job routing to parallel heterogeneous multiserver queues via index policies. Eur. J. Oper. Res.
**2012**, 220, 705–715. [Google Scholar] [CrossRef] - Niño-Mora, J. Resource allocation and routing in parallel multi-server queues with abandonments for cloud profit maximization. Comput. Oper. Res.
**2019**, 103, 221–236. [Google Scholar] [CrossRef] - Raissi-Dehkordi, M.; Baras, J.S. Broadcast scheduling in information delivery systems. In Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM), Taipei, Taiwan, 17–21 November 2002; pp. 2935–2939. [Google Scholar]
- Dusonchet, F.; Hongler, M.O. Continuous-time restless bandit and dynamic scheduling for make-to-stock production. IEEE Trans. Robot. Automat.
**2003**, 19, 977–990. [Google Scholar] [CrossRef] - Goyal, M.; Kumar, A.; Sharma, V. A stochastic control approach for scheduling multimedia transmissions over a polled multiaccess fading channel. Wirel. Netw.
**2006**, 12, 605–621. [Google Scholar] [CrossRef] - Niño-Mora, J. Marginal productivity index policies for scheduling a multiclass delay-/loss-sensitive queue. Queueing Syst.
**2006**, 54, 281–312. [Google Scholar] [CrossRef][Green Version] - Cao, J.H.; Nyberg, C. Linear programming relaxations and marginal productivity index policies for the buffer sharing problem. Queueing Syst.
**2008**, 60, 247–269. [Google Scholar] [CrossRef] - Borkar, V.S.; Pattathil, S. Whittle indexability in egalitarian processor sharing systems. Ann. Oper. Res.
**2022**, 317, 417–437. [Google Scholar] [CrossRef][Green Version] - O’Meara, T.; Patel, A. A topic-specific web robot model based on restless bandits. IEEE Internet Comput.
**2001**, 5, 27–35. [Google Scholar] [CrossRef] - Niño-Mora, J. A dynamic page-refresh index policy for web crawlers. In Proceedings of the 21st International Conference on Analytical and Stochastic Modelling Techniques and Applications (ASMTA), Budapest, Hungary, 30 June–2 July 2014; Lecture Notes in Computer Science. Sericola, B., Telek, M., Horváth, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8499, pp. 46–60. [Google Scholar]
- Avrachenkov, K.E.; Borkar, V.S. Whittle index policy for crawling ephemeral content. IEEE Trans. Control Netw. Syst.
**2018**, 5, 446–455. [Google Scholar] [CrossRef][Green Version] - Deo, S.; Iravani, S.; Jiang, T.T.; Smilowitz, K.; Samuelson, S. Improving health outcomes through better capacity allocation in a community-based chronic care model. Oper. Res.
**2013**, 61, 1277–1294. [Google Scholar] [CrossRef][Green Version] - Ayer, T.; Zhang, C.; Bonifonte, A.; Spaulding, A.C.; Chhatwal, J. Prioritizing hepatitis C treatment in US prisons. Oper. Res.
**2019**, 67, 853–873. [Google Scholar] [CrossRef][Green Version] - Mate, A.; Perrault, A.; Tambe, M. Risk-aware interventions in public health: Planning with restless multi-armed bandits. In Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Online, 3–7 May 2021; Endriss, U., Nowé, A., Dignum, F., Lomuscio, A., Eds.; IFAAMAS: Richland, CA, USA, 2021; pp. 12017–12025. [Google Scholar]
- Mate, A.; Madaan, L.; Taneja, A.; Madhiwalla, N.; Verma, S.; Singh, G.; Hegde, A.; Varakantham, P.; Tambe, M. A field study in deploying restless multi-armed bandits: Assisting non-profits in improving maternal and child health. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), Online, 22 February–1 March 2022; pp. 12017–12025. [Google Scholar]
- Wei, Y.; Yu, F.R.; Song, M. Distributed optimal relay selection in wireless cooperative networks with finite-state Markov channels. IEEE Trans. Veh. Tech.
**2010**, 59, 2149–2158. [Google Scholar] - Wei, X.H.; Neely, M.J. Power-aware wireless file downloading: A Lyapunov indexing approach to a constrained restless bandit problem. IEEE/ACM Trans. Netw.
**2016**, 24, 2264–2277. [Google Scholar] [CrossRef] - Aalto, S.; Lassila, P.; Osti, P. Whittle index approach to size-aware scheduling for time-varying channels with multiple states. Queueing Syst.
**2016**, 83, 195–225. [Google Scholar] [CrossRef] - Borkar, V.S.; Kasbekar, G.S.; Pattathil, S.; Shetty, P.Y. Opportunistic scheduling as restless bandits. IEEE Trans. Control Netw. Syst.
**2018**, 5, 1952–1961. [Google Scholar] [CrossRef][Green Version] - Sun, Y.; Feng, G.; Qin, S.; Sun, S.S. Cell association with user behavior awareness in heterogeneous cellular networks. IEEE Trans. Veh. Tech.
**2018**, 67, 4589–4601. [Google Scholar] [CrossRef][Green Version] - Aalto, S.; Lassila, P.; Taboada, I. Whittle index approach to opportunistic scheduling with partial channel information. Perform. Eval.
**2019**, 136, 102052. [Google Scholar] [CrossRef] - Wang, K.H.; Yu, J.H.; Chen, L.; Zhou, P.; Ge, X.H.; Win, M.Z. Opportunistic scheduling revisited using restless bandits: Indexability and index policy. IEEE Trans. Wirel. Comm.
**2019**, 18, 4997–5010. [Google Scholar] [CrossRef] - Sun, J.Z.; Jiang, Z.Y.; Krishnamachari, B.; Zhou, S.; Niu, Z.S. Closed-form Whittle’s index-enabled random access for timely status update. IEEE Trans. Comm.
**2020**, 68, 1538–1551. [Google Scholar] [CrossRef] - Chen, G.P.; Liew, S.C.; Shao, Y.L. Uncertainty-of-information scheduling: A restless multiarmed bandit framework. IEEE Trans. Inform. Theory
**2022**, 68, 6151–6173. [Google Scholar] [CrossRef] - Singh, S.K.; Borkar, V.S.; Kasbekar, G.S. User association in dense mmWave networks as restless bandits. IEEE Trans. Veh. Tech.
**2022**, 71, 7919–7929. [Google Scholar] [CrossRef] - Huberman, B.A.; Wu, F. The economics of attention: Maximizing user value in information-rich environments. Adv. Complex Syst.
**2008**, 11, 487–496. [Google Scholar] [CrossRef][Green Version] - Glazebrook, K.D.; Niño Mora, J.; Ansell, P.S. Index policies for a class of discounted restless bandits. Adv. Appl. Probab.
**2002**, 34, 754–774. [Google Scholar] [CrossRef][Green Version] - Kumar, U.D.; Saranga, H. Optimal selection of obsolescence mitigation strategies using a restless bandit model. Eur. J. Oper. Res.
**2010**, 200, 170–180. [Google Scholar] [CrossRef] - Temple, T.; Frazzoli, E. Whittle-indexability of the cow path problem. In Proceedings of the American Control Conference (ACC), Baltimore, MD, USA, 30 June–2 July 2010; pp. 4152–4158. [Google Scholar]
- He, T.; Chen, S.Y.; Kim, H.; Tong, L.; Lee, K.W. Scheduling parallel tasks onto opportunistically available cloud resources. In Proceedings of the IEEE 5th International Conference on Cloud Computing (CLOUD), Honolulu, HI, USA, 24–29 June 2012; pp. 180–187. [Google Scholar]
- Taylor, J.A.; Mathieu, J.L. Index policies for demand response. IEEE Trans. Power Syst.
**2014**, 29, 1287–1295. [Google Scholar] [CrossRef] - Sun, J.; Ma, H. Heterogeneous-belief based incentive schemes for crowd sensing in mobile social networks. J. Netw. Comput. Appl.
**2014**, 42, 189–196. [Google Scholar] [CrossRef] - Lin, S.; Zhang, J.J.; Hauser, J.R. Learning from experience, simply. Mark. Sci.
**2015**, 34, 1–19. [Google Scholar] [CrossRef][Green Version] - Guo, X.Y.; Singh, R.; Kumar, P.R.; Niu, Z.S. A risk-sensitive approach for packet inter-delivery time optimization in networked cyber-physical systems. IEEE/ACM Trans. Netw.
**2018**, 26, 1976–1989. [Google Scholar] [CrossRef] - Yu, Z.; Xu, Y.J.; Tong, L. Deadline scheduling as restless bandits. IEEE Trans. Automat. Control
**2018**, 63, 2343–2358. [Google Scholar] [CrossRef][Green Version] - Avrachenkov, K.E.; Borkar, V.S.; Pattathil, S. Controlling G-AIMD by index policy. In Proceedings of the 56th IEEE Conference on Decision and Control (CDC), Melbourne, Australia, 12–15 December 2017; pp. 120–125. [Google Scholar]
- Borkar, V.S.; Ravikumar, K.; Saboo, K. An index policy for dynamic pricing in cloud computing under price commitments. Appl. Math.
**2017**, 44, 215–245. [Google Scholar] [CrossRef] - Menner, M.; Zeilinger, M.N. A user comfort model and index policy for personalizing discrete controller decisions. In Proceedings of the 16th European Control Conference (ECC), Limassol, Cyprus, 12–15 June 2018; pp. 1759–1765. [Google Scholar]
- Jhunjhunwala, P.R.; Moharir, S.; Manjunath, D.; Gopalan, A. On a class of restless multi-armed bandits with deterministic policies. In Proceedings of the International Conference on Signal Processing and Communications (SPCOM), Bangalore, India, 16–19 July 2018; pp. 487–491. [Google Scholar] [CrossRef]
- Abbou, A.; Makis, V. Group maintenance: A restless bandits approach. INFORMS J. Comput.
**2019**, 31, 719–731. [Google Scholar] [CrossRef] - Gerum, P.C.L.; Altay, A.; Baykal-Gursoy, M. Data-driven predictive maintenance scheduling policies for railways. Transport. Res. Part C Emerg. Technol.
**2019**, 107, 137–154. [Google Scholar] [CrossRef] - Li, D.; Ding, L.; Connor, S. When to switch? Index policies for resource scheduling in emergency response. Prod. Oper. Manag.
**2020**, 29, 241–262. [Google Scholar] [CrossRef] - Fu, J.; Moran, B.; Taylor, P.G. A restless bandit model for resource allocation, competition, and reservation. Oper. Res.
**2022**, 70, 416–431. [Google Scholar] [CrossRef] - Dahiya, A.; Akbarzadeh, N.; Mahajan, A.; Smith, S.L. Scalable operator allocation for multi-robot assistance: A restless bandit approach. IEEE Trans. Control Netw. Syst.
**2022**, 9, 1397–1408. [Google Scholar] [CrossRef] - Ou, H.C.; Siebenbrunner, C.; Killian, J.; Brooks, M.B.; Kempe, D.; Vorobeychik, Y.; Tambe, M. Networked restless multi-armed bandits for mobile interventions. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Online, 9–13 May 2022; Faliszewski, P., Mascardi, V., Pelachaud, C., Taylor, M.E., Eds.; International Foundation for Autonomous Agents and Multiagent Systems: London, UK, 2022. [Google Scholar]
- Krishnamurthy, V. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
- La Scala, B.F.; Moran, B. Optimal target tracking with restless bandits. Digit. Signal Process.
**2006**, 16, 479–487. [Google Scholar] [CrossRef] - Le Ny, J.; Feron, E.; Dahleh, M. Scheduling continuous-time Kalman filters. IEEE Trans. Automat. Control
**2011**, 56, 1381–1394. [Google Scholar] [CrossRef] - Akbarzadeh, N.; Mahajan, A. Partially observable restless bandits with restarts: Indexability and computation of Whittle index. In Proceedings of the 61st IEEE Conference on Decision and Control (CDC), Cancún, Mexico, 6–9 December 2022; pp. 4898–4904. [Google Scholar]
- Gan, X.; Chen, B. A novel sensing scheme for dynamic multichannel access. IEEE Trans. Veh. Tech.
**2012**, 61, 208–221. [Google Scholar] [CrossRef] - He, T.; Anandkumar, A.; Agrawal, D. Index-based sampling policies for tracking dynamic networks under sampling constraints. In Proceedings of the 30th IEEE International Conference on Computer Communications (INFOCOM), Shanghai, China, 10–15 April 2011; pp. 1233–1241. [Google Scholar]
- Meshram, R.; Manjunath, D.; Gopalan, A. A restless bandit with no observable states for recommendation systems and communication link scheduling. In Proceedings of the 54th IEEE Conference on Decision and Control (CDC), Osaka, Japan, 15–18 December 2015; pp. 7820–7825. [Google Scholar]
- Ouyang, W.Z.; Murugesan, S.; Eryilmaz, A.; Shroff, N.B. Exploiting channel memory for joint estimation and scheduling in downlink networks—A Whittle’s indexability analysis. IEEE Trans. Inform. Theory
**2015**, 61, 1702–1719. [Google Scholar] [CrossRef] - Taboada, I.; Liberal, F.; Fajardo, J.O.; Blanco, B. An index rule proposal for scheduling in mobile broadband networks with limited channel feedback. Perform. Eval.
**2017**, 117, 130–142. [Google Scholar] [CrossRef] - Meshram, R.; Manjunath, D.; Gopalan, A. On the Whittle index for restless multiarmed hidden Markov bandits. IEEE Trans. Automat. Control
**2018**, 63, 3046–3053. [Google Scholar] [CrossRef][Green Version] - Elmaghraby, H.M.; Liu, K.Q.; Ding, Z. Femtocell scheduling as a restless multiarmed bandit problem using partial channel state observation. In Proceedings of the IEEE International Conference on Communications (ICC), Kansas City, MO, USA, 20–24 May 2018. [Google Scholar]
- Mehta, V.; Meshram, R.; Kaza, K.; Merchant, S.N.; Desai, U.B. Rested and restless bandits with constrained arms and hidden states: Applications in social networks and 5G networks. IEEE Access
**2018**, 6, 56782–56799. [Google Scholar] [CrossRef] - Kaza, K.; Meshram, R.; Mehta, V.; Merchant, S.N. Sequential decision making with limited observation capability: Application to wireless networks. IEEE Trans. Cogn. Commun. Netw.
**2019**, 5, 237–251. [Google Scholar] [CrossRef][Green Version] - Yang, F.; Luo, X. A restless MAB-based index policy for UL pilot allocation in massive MIMO over Gauss–Markov fading channels. IEEE Trans. Veh. Technol.
**2020**, 69, 3034–3047. [Google Scholar] [CrossRef] - Hsu, Y.P.; Modiano, E.; Duan, L.J. Scheduling algorithms for minimizing age of information in wireless broadcast networks with random arrivals. IEEE Trans. Mob. Comput.
**2020**, 19, 2903–2915. [Google Scholar] [CrossRef] - Wang, J.Z.; Ren, X.Q.; Mo, Y.L.; Shi, L. Whittle index policy for dynamic multichannel allocation in remote state estimation. IEEE Trans. Automat. Control
**2020**, 65, 591–603. [Google Scholar] [CrossRef] - Chen, Y.; Ephremides, A. Scheduling to minimize age of incorrect information with imperfect channel state information. Entropy
**2021**, 23, 1572. [Google Scholar] [CrossRef] [PubMed] - Kang, S.; Joo, C. Index-based update policy for minimizing information mismatch with Markovian sources. J. Commun. Netw.
**2021**, 23, 488–498. [Google Scholar] [CrossRef] - Li, D.; Varakantham, P. Efficient resource allocation with fairness constraints in restless multi-armed bandits. In Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence (UAI), Eindhoven, The Netherlands, 1–5 August 2022; Cussens, J., Zhang, K., Eds.; PMLR: Birmingham, UK, 2022; pp. 1158–1167. [Google Scholar]
- Tong, J.W.; Fu, L.Q.; Han, Z. Age-of-information oriented scheduling for multichannel IoT systems with correlated sources. IEEE Trans. Wirel. Comm.
**2022**, 21, 9775–9790. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Niño-Mora, J. Markovian Restless Bandits and Index Policies: A Review. *Mathematics* **2023**, *11*, 1639.
https://doi.org/10.3390/math11071639

**AMA Style**

Niño-Mora J. Markovian Restless Bandits and Index Policies: A Review. *Mathematics*. 2023; 11(7):1639.
https://doi.org/10.3390/math11071639

**Chicago/Turabian Style**

Niño-Mora, José. 2023. "Markovian Restless Bandits and Index Policies: A Review" *Mathematics* 11, no. 7: 1639.
https://doi.org/10.3390/math11071639