# Non Stationary Multi-Armed Bandit: Empirical Evaluation of a New Concept Drift-Aware Algorithm

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background

#### 2.1. Multi-Armed Bandit

- ϵ-greedy. Given a small value $\u03f5$, the agent selects a random arm with probability $\u03f5$ and the best arm a, where $a\leftarrow argma{x}_{k\in \mathcal{K}}\left\{\tilde{\mu}{\left(t\right)}_{k}\right\}$, with probability $1-\u03f5$.
- Upper Confidence Bound. Given a parameter $\alpha $ which controls the exploration-exploitation trade-off, the agent selects the best arm a, where $a\leftarrow argma{x}_{k\in \mathcal{K}}\{\tilde{\mu}{\left(t\right)}_{k}+\alpha \ast S\left(\tilde{\sigma}{\left(t\right)}_{k}\right)\}$, and $S\left(\tilde{\sigma}{\left(t\right)}_{k}\right)$ is proportional to the plug-in standard deviation for the reward distribution of arm k [17,18].
- Thompson Sampling. The agent selects the best arm a, where $a\leftarrow argma{x}_{k\in \mathcal{K}}\left\{\widehat{\mu}{\left(t\right)}_{k}\right\}$, and $\widehat{\mu}{\left(t\right)}_{k}$ is a sample from the unknown reward distribution with the estimated expected value $\tilde{\mu}{\left(t\right)}_{k}$ [19,20,21].

#### 2.2. Concept Drift

- Sudden/Abrupt: the switch of concept $p\left(y\right)$ happens abruptly, from time ${t}_{i}$ to the subsequent time ${t}_{i+1}$.
- Incremental: the switch of concept $p\left(y\right)$ happens incrementally (slowly), with many smaller intermediate changes, from time ${t}_{i}$ to time ${t}_{i+n}$.
- Gradual: the switch of concept $p\left(y\right)$ happens gradually, by switching back and forth to the old and new concept, from time ${t}_{i}$ to time ${t}_{i+n}$, before stabilizing to the new one.
- Reocurring: The switch of concept $p\left(y\right)$ happens abruptly at time ${t}_{i}$, but after time ${t}_{i+n}$ the concept reverses back to the old one. It could be cyclic.

#### 2.3. Non-Stationary Multi-Armed Bandit

## 3. Methodology

#### 3.1. Problem Definition

#### 3.2. f-Discounted-Sliding-Window Thompson Sampling

Algorithm 1:f-Discounted-Sliding-Window TS |

## 4. Experiments

#### 4.1. Experimental Setup

- Max-dsw TS: f-dsw TS with max as an aggregation function f.
- Min-dsw TS: f-dsw TS with min as an aggregation function f.
- Mean-dsw TS: f-dsw TS with mean as an aggregation function f.
- Discounted TS: the TS enhanced with a discount factor, presented in [36]: the parameter $\gamma $ controls the amount of discount.
- Sliding Window TS: the TS with a global sliding window, presented in [15]. The parameter n controls the size of the sliding window.
- Thompson Sampling: the standard Beta-Bernoulli TS.
- Random: a trivial baseline that selects each arm at random.
- Oracle: an oracle always selecting the best action at time t. It is exploited for regret computation.

#### 4.2. Synthetic Datasets

#### 4.2.1. Random Environments

#### 4.2.2. Custom Environments

#### 4.3. Real Datasets

#### 4.3.1. Baltimore Crime

#### 4.3.2. Insects

#### 4.3.3. Local News

#### 4.3.4. Air Microbes

#### 4.4. Parameters Tuning

## 5. Results

#### 5.1. Random Environments

#### 5.2. Custom Environments

#### 5.3. Real-World Environments

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

## References

- Robbins, H. Some aspects of the sequential design of experiments. Bull. Am. Math. Soc.
**1952**, 58, 527–535. [Google Scholar] [CrossRef] [Green Version] - Berry, D.A.; Fristedt, B. Bandit Problems: Sequential Allocation of Experiments (Monographs on Statistics and Applied Probability); Chapman Hall: London, UK, 1985; Volume 5, p. 7. [Google Scholar]
- Kuleshov, V.; Precup, D. Algorithms for multi-armed bandit problems. arXiv
**2014**, arXiv:1402.6028. [Google Scholar] - Villar, S.S.; Bowden, J.; Wason, J. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Stat. Sci. A Rev. J. Inst. Math. Stat.
**2015**, 30, 199. [Google Scholar] [CrossRef] - Ding, W.; Qin, T.; Zhang, X.D.; Liu, T.Y. Multi-Armed Bandit with Budget Constraint and Variable Costs. In Proceedings of the AAAI Conference on Artificial Intelligence, Bellevue, WA, USA, 14–18 July 2013; Volume 13, pp. 232–238. [Google Scholar]
- Schwartz, E.M.; Bradlow, E.T.; Fader, P.S. Customer acquisition via display advertising using multi-armed bandit experiments. Mark. Sci.
**2017**, 36, 500–522. [Google Scholar] [CrossRef] - Le Ny, J.; Dahleh, M.; Feron, E. Multi-UAV dynamic routing with partial observations using restless bandit allocation indices. In Proceedings of the 2008 American Control Conference, Seattle, WA, USA, 11–13 June 2008; pp. 4220–4225. [Google Scholar]
- Scott, S.L. A Modern Bayesian Look at the Multi-Armed Bandit. Appl. Stoch. Model. Bus. Ind.
**2010**, 26, 639–658. [Google Scholar] [CrossRef] - Chapelle, O.; Li, L. An Empirical Evaluation of Thompson Sampling. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS’11), Granada, Spain, 12–14 December 2011; Curran Associates Inc.: Red Hook, NY, USA, 2011; pp. 2249–2257. [Google Scholar]
- Li, L.; Chu, W.; Langford, J.; Schapire, R.E. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW’10), Raleigh, NC, USA, 26–30 April 2010; Association for Computing Machinery: New York, NY, USA, 2010; pp. 661–670. [Google Scholar] [CrossRef] [Green Version]
- Benedetto, G.D.; Bellini, V.; Zappella, G. A Linear Bandit for Seasonal Environments. arXiv
**2020**, arXiv:2004.13576. [Google Scholar] - Besbes, O.; Gur, Y.; Zeevi, A. Stochastic multi-armed-bandit problem with non-stationary rewards. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 199–207. [Google Scholar]
- Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. CSUR
**2014**, 46, 1–37. [Google Scholar] [CrossRef] - Zliobaite, I. Learning under Concept Drift: An Overview. arXiv
**2010**, arXiv:1010.4784. [Google Scholar] - Trovo, F.; Paladino, S.; Restelli, M.; Gatti, N. Sliding-Window Thompson Sampling for Non-Stationary Settings. J. Artif. Intell. Res.
**2020**, 68, 311–364. [Google Scholar] [CrossRef] - Slivkins, A. Introduction to Multi-Armed Bandits. arXiv
**2019**, arXiv:1904.07272. [Google Scholar] - Agrawal, R. Sample mean based index policies with O (log n) regret for the multi-armed bandit problem. Adv. Appl. Probab.
**1995**, 27, 1054–1078. [Google Scholar] [CrossRef] - Auer, P. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res.
**2002**, 3, 397–422. [Google Scholar] - Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika
**1933**, 25, 285–294. [Google Scholar] [CrossRef] - Thompson, W.R. On the theory of apportionment. Am. J. Math.
**1935**, 57, 450–456. [Google Scholar] [CrossRef] - Russo, D.; Roy, B.V.; Kazerouni, A.; Osband, I.; Wen, Z. A Tutorial on Thompson Sampling. arXiv
**2017**, arXiv:1707.02038. [Google Scholar] - Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn.
**2002**, 47, 235–256. [Google Scholar] [CrossRef] - Kaufmann, E.; Korda, N.; Munos, R. Thompson sampling: An asymptotically optimal finite-time analysis. In Proceedings of the International Conference on Algorithmic Learning Theory, Lyon, France, 29–31 October 2012; Springer: Berlin/Heidelberger, Germany, 2012; pp. 199–213. [Google Scholar]
- Agrawal, S.; Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of the Conference on Learning Theory, Edinburgh, UK, 25–27 June 2012. [Google Scholar]
- Russo, D.; Van Roy, B. An Information-Theoretic Analysis of Thompson Sampling. J. Mach. Learn. Res.
**2016**, 17, 2442–2471. [Google Scholar] - Russo, D.; Van Roy, B. Learning to Optimize via Information-Directed Sampling. Oper. Res.
**2018**, 66, 230–252. [Google Scholar] [CrossRef] [Green Version] - Chow, S.C.; Chang, M. Adaptive Design Methods in Clinical Trials, 2nd ed.; CRC Press: New York, NY, USA, 2006; Volume 3. [Google Scholar] [CrossRef]
- Srinivas, N.; Krause, A.; Kakade, S.; Seeger, M. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design; Omnipress: Madison, WI, USA, 2010; pp. 1015–1022. [Google Scholar]
- Brochu, E.; Brochu, T.; de Freitas, N. A Bayesian Interactive Optimization Approach to Procedural Animation Design. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA’10), Madrid, Spain, 2–4 July 2010; Eurographics Association: Goslar, Germany, 2010; pp. 103–112. [Google Scholar]
- Lu, J.; Liu, A.; Dong, F.; Gu, F.; Gama, J.; Zhang, G. Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng.
**2018**, 31, 2346–2363. [Google Scholar] [CrossRef] [Green Version] - Žliobaitė, I.; Pechenizkiy, M.; Gama, J. An overview of concept drift applications. In Big Data Analysis: New Algorithms for a New Society; Springer: Berlin, Germany, 2016; pp. 91–114. [Google Scholar]
- Dries, A.; Rückert, U. Adaptive Concept Drift Detection. Stat. Anal. Data Min.
**2009**, 2, 311–327. [Google Scholar] [CrossRef] [Green Version] - Klinkenberg, R.; Joachims, T. Detecting Concept Drift with Support Vector Machines. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML’00), Standord, CA, USA, 29 June–2 July 2000; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2000; pp. 487–494. [Google Scholar]
- Nishida, K.; Yamauchi, K. Detecting Concept Drift Using Statistical Testing. In Discovery Science; Corruble, V., Takeda, M., Suzuki, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 264–269. [Google Scholar]
- Elwell, R.; Polikar, R. Incremental Learning of Concept Drift in Nonstationary Environments. IEEE Trans. Neural Netw.
**2011**, 22, 1517–1531. [Google Scholar] [CrossRef] [PubMed] - Raj, V.; Kalyani, S. Taming non-stationary bandits: A Bayesian approach. arXiv
**2017**, arXiv:1707.09727. [Google Scholar] - Garivier, A.; Moulines, E. On upper-confidence bound policies for non-stationary bandit problems. arXiv
**2008**, arXiv:0805.3415. [Google Scholar] - Fouché, E.; Komiyama, J.; Böhm, K. Scaling multi-armed bandit algorithms. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1449–1459. [Google Scholar]
- Bifet, A.; Gavalda, R. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007; pp. 443–448. [Google Scholar]
- Hartland, C.; Gelly, S.; Baskiotis, N.; Teytaud, O.; Sebag, M. Multi-armed bandit, dynamic environments and meta-bandits. In Proceedings of the NIPS-2006 Workshop, Online Trading between Exploration and Exploitation, Whistler, BC, Canada, 8 December 2006. [Google Scholar]
- Kaufmann, E.; Cappé, O.; Garivier, A. On Bayesian upper confidence bounds for bandit problems. In Proceedings of the Artificial Intelligence and Statistics, Canary Islands, Spain, 21–23 April 2012; pp. 592–600. [Google Scholar]
- May, B.C.; Korda, N.; Lee, A.; Leslie, D.S. Optimistic Bayesian sampling in contextual-bandit problems. J. Mach. Learn. Res.
**2012**, 13, 2069–2106. [Google Scholar] - Mellor, J.; Shapiro, J. Thompson sampling in switching environments with Bayesian online change detection. In Proceedings of the Artificial Intelligence and Statistics, Scottsdale, AZ, USA, 29 April–1 May 2013; pp. 442–450. [Google Scholar]
- Liu, F.; Lee, J.; Shroff, N. A change-detection based framework for piecewise-stationary multi-armed bandit problem. arXiv
**2017**, arXiv:1711.03539. [Google Scholar] - Besson, L.; Kaufmann, E. The generalized likelihood ratio test meets klucb: An improved algorithm for piece-wise non-stationary bandits. arXiv
**2019**, arXiv:1902.01575. [Google Scholar] - KhudaBukhsh, A.R.; Carbonell, J.G. Expertise drift in referral networks. Auton. Agents Multi-Agent Syst.
**2019**, 33, 645–671. [Google Scholar] [CrossRef] - St-Pierre, D.L.; Liu, J. Differential evolution algorithm applied to non-stationary bandit problem. In Proceedings of the 2014 IEEE Congress on Evolutionary Computation (CEC), Beijing, China, 6–11 July 2014; pp. 2397–2403. [Google Scholar]
- Allesiardo, R.; Féraud, R.; Maillard, O.A. The non-stationary stochastic multi-armed bandit problem. Int. J. Data Sci. Anal.
**2017**, 3, 267–283. [Google Scholar] [CrossRef] [Green Version] - Souza, V.; Reis, D.M.d.; Maletzke, A.G.; Batista, G.E. Challenges in Benchmarking Stream Learning Algorithms with Real-world Data. arXiv
**2020**, arXiv:2005.00113. [Google Scholar] - Sottocornola, G.; Symeonidis, P.; Zanker, M. Session-Based News Recommendations. In Proceedings of the Companion Proceedings of the The Web Conference 2018 (WWW’18), Lyon, France, 23–27 April 2018; International World Wide Web Conferences Steering Committee: Geneva, Switzerland, 2018; pp. 1395–1399. [Google Scholar] [CrossRef] [Green Version]
- Blei, D.M. Probabilistic Topic Models. Commun. ACM
**2012**, 55, 77–84. [Google Scholar] [CrossRef] [Green Version] - Gusareva, E.S.; Acerbi, E.; Lau, K.J.; Luhung, I.; Premkrishnan, B.N.; Kolundžija, S.; Purbojati, R.W.; Wong, A.; Houghton, J.N.; Miller, D.; et al. Microbial communities in the tropical air ecosystem follow a precise diel cycle. Proc. Natl. Acad. Sci. USA
**2019**, 116, 23299–23308. [Google Scholar] [CrossRef] [PubMed] [Green Version]

**Figure 1.**Examples of the four types of concept drift affecting the probability distribution of Class 3.

**Figure 2.**Reward distributions for custom-decreasing (

**a**), custom-increasing (

**b**), and custom-stationary (

**c**) environments. Every function represents the evolution over time t of an arm reward probability in the custom environment.

**Figure 3.**Monthly frequency of crimes in each district for the Baltimore Crime dataset. Every line represents the evolution of the relative frequency of crimes in each district, through the months.

**Figure 6.**Frequency of the classes of air microbial organisms for the Air Microbes dataset. Every line represents the evolution of the relative frequency of a microbe class every two hours for 20 days.

**Figure 7.**Results of the MAB algorithms in environments with abrupt changes. Line plots represent the performance of each algorithm in terms of RCR, when the probability of drift d varies.

**Figure 8.**Results of the MAB algorithms in random environments with incremental changes. Line plots represent the performance of each algorithm in terms of RCR, when the drift probability d varies.

**Figure 9.**Results of the MAB algorithms in the custom-decreasing environment. Line plots represent the regret (

**a**) and the cumulative reward (

**b**) through 1000 steps for each algorithms.

**Figure 10.**Results of the MAB algorithms in the custom-increasing environment. Line plots represent the regret (

**a**) and the cumulative reward (

**b**) through 1000 steps for each algorithms.

**Figure 11.**Results of the MAB algorithms in the custom-stationary environment. Line plots represent the regret (

**a**) and the cumulative reward (

**b**) through 1000 steps for each algorithms.

**Table 1.**Meta-information for each real-world dataset. We provide the number of classes (i.e., the number of actions/arms in the Multi-Armed Bandit (MAB) setting), the number of instances (i.e., the time-steps in the MAB setting), and the actual time span in which data were collected.

Dataset | Classes | Instances | Time Span |
---|---|---|---|

Baltimore Crime | 9 | 321,147 | 6 years |

Insects-Incremental | 6 | 452,045 | 3 months |

Insects-Abrupt | 6 | 355,276 | 3 months |

Insects-Incremental-gradual | 6 | 143,224 | 3 months |

Insects-Incremental-abrupt-reoccurring | 6 | 452,045 | 3 months |

Insects-Incremental-reoccurring | 6 | 452,045 | 3 months |

Local News | 5 | 13,526 | 1 year |

Air Microbes | 10 | 28,560 | 20 days |

**Table 2.**Best parameters selected after the tuning phase for random-abrupt and random-incremental scenarios.

Algorithm | Random-Abrupt | Random-Incremental |
---|---|---|

Max-dsw TS | $\gamma =0.99$ $n=25$ | $\gamma =0.99$ $n=50$ |

Min-dsw TS | $\gamma =0.95$ $n=100$ | $\gamma =0.95$ $n=75$ |

Mean-dsw TS | $\gamma =0.95$ $n=25$ | $\gamma =0.99$ $n=50$ |

Discounted TS | $\gamma =0.98$ | $\gamma =0.99$ |

Sliding Window TS | $n=100$ | $n=100$ |

**Table 3.**Best parameters selected after the tuning phase for real-world datasets, that is, Baltimore Crime, Insects, Local News and Air Microbes.

Algorithm | Baltimore Crime | Insects | Local News | Air Microbes |
---|---|---|---|---|

Max-dsw TS | $\gamma =0.9999$ $n=800$ | $\gamma =0.999$ $n=800$ | $\gamma =0.999$ $n=800$ | $\gamma =0.9999$ $n=800$ |

Min-dsw TS | $\gamma =0.999$ $n=800$ | $\gamma =0.99$ $n=200$ | $\gamma =0.95$ $n=800$ | $\gamma =0.99$ $n=800$ |

Mean-dsw TS | $\gamma =0.9999$ $n=800$ | $\gamma =0.999$ $n=800$ | $\gamma =0.99$ $n=400$ | $\gamma =0.9999$ $n=800$ |

D-TS | $\gamma =0.9999$ | $\gamma =0.999$ | $\gamma =0.999$ | $\gamma =0.9999$ |

SW-TS | $n=\mathrm{12,800}$ | $n=3200$ | $n=3200$ | $n=\mathrm{12,800}$ |

**Table 4.**Comparison of performances (%) of MAB algorithms for real-world datasets. * All tasks are evaluated in terms of classification accuracy, except for Air Microbes which is evaluated in terms of cumulative reward relative to the oracle (RCR). The best performance for each dataset is indicated in bold.

Dataset | Max-dsw | Min-dsw | Mean-dsw | D-TS | SW-TS | TS | Rand |
---|---|---|---|---|---|---|---|

Baltimore Crime | 14.18 | 14.61 | 14.48 | 14.11 | 14.18 | 14.55 | 10.85 |

Insects abrupt | 39.30 | 40.24 | 40.21 | 39.22 | 39.94 | 28.8 | 16.65 |

Insects incremental | 38.94 | 40.47 | 39.94 | 38.82 | 39.72 | 35.35 | 16.63 |

Insects incr-abrupt-reoc | 38.54 | 40.10 | 39.44 | 38.45 | 38.77 | 33.05 | 16.69 |

Insects incremental-reoc | 38.58 | 40.30 | 39.53 | 38.49 | 38.63 | 33.48 | 16.68 |

Local News | 50.94 | 53.70 | 53.17 | 51.46 | 50.79 | 51.49 | 23.69 |

Air Microbes * | 81.13 | 86.06 | 82.57 | 80.93 | 81.42 | 81.94 | 26.95 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Cavenaghi, E.; Sottocornola, G.; Stella, F.; Zanker, M.
Non Stationary Multi-Armed Bandit: Empirical Evaluation of a New Concept Drift-Aware Algorithm. *Entropy* **2021**, *23*, 380.
https://doi.org/10.3390/e23030380

**AMA Style**

Cavenaghi E, Sottocornola G, Stella F, Zanker M.
Non Stationary Multi-Armed Bandit: Empirical Evaluation of a New Concept Drift-Aware Algorithm. *Entropy*. 2021; 23(3):380.
https://doi.org/10.3390/e23030380

**Chicago/Turabian Style**

Cavenaghi, Emanuele, Gabriele Sottocornola, Fabio Stella, and Markus Zanker.
2021. "Non Stationary Multi-Armed Bandit: Empirical Evaluation of a New Concept Drift-Aware Algorithm" *Entropy* 23, no. 3: 380.
https://doi.org/10.3390/e23030380