# Designing Reinforcement Learning Algorithms for Digital Interventions: Pre-Implementation Guidelines

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

**Framework for Guiding Design Decisions in RL Algorithms for Digital Interventions:**We provide a framework for evaluating the design of an online RL algorithm to increase confidence that the RL algorithm will improve the digital intervention’s effectiveness in real-life implementation and maintain the intervention’s reproducibility and replicability. Specifically, we extend the PCS (predictability, computability, stability) data science framework of Yu [8] to address specific challenges in the development and evaluation of online RL algorithms for personalizing digital interventions.**Case Study:**This case study concerns the development of an RL algorithm for Oralytics, a mobile health intervention study designed to encourage oral self-care behaviors. The study is planned to go into the field in late 2022. This case study provides a concrete example of implementing the PCS framework to inform the design of an online RL algorithm.

## 2. Review of Online Reinforcement Learning Algorithms

#### 2.1. Decision Times

#### 2.2. State

#### 2.3. Action

#### 2.4. Reward

#### 2.5. Online RL Algorithms

#### 2.6. Update Times

## 3. PCS Framework for Designing RL Algorithms for Digital Intervention Development

#### 3.1. Personalization (P)

**Average of Users’ Average (Across Time) Rewards:**This metric is the average of all N users’ rewards averaged across all T decision times, defined as $\frac{1}{N}{\sum}_{i=1}^{N}\left(\frac{1}{T}{\sum}_{t=1}^{T}{R}_{i,t}\right)$. The metric serves as a global measure of the RL algorithm’s performance.**The 25th Percentile of Users’ Average (Across Time) Rewards:**To compute this metric, first compute the average reward across time for each user, $\frac{1}{T}{\sum}_{t=1}^{T}{R}_{i,t}$ for each $i=1,2,\cdots ,N$; this metric is the lower 25th percentile of these average rewards across the N users. The metric shows how well an RL algorithm performs for the worst-off users, namely users in the lower quartile of average rewards across time.**Average Reward For Multiple Time Points:**This metric is the average users’ rewards across time for multiple time points ${t}_{0}=1,2,...,T$, defined as $\frac{1}{N}{\sum}_{i=1}^{N}\left(\frac{1}{{t}_{0}}{\sum}_{t=1}^{{t}_{0}}{R}_{i,t}\right)$ for each ${t}_{0}$. These metrics can be used to assess the speed at which the RL algorithm learns across weeks in the trial.

#### 3.2. Computability (C)

**Timely Access to Reward and State Information:**The investigators may have an ideal definition of the reward or state features for the algorithm; however, due to delays in communication between sensors, the digital application, and the cloud storage, the investigators’ first choice may not be reliably available. Since RL algorithms for digital interventions must make decisions online, the development team must choose state features that will be reliably available to the algorithm at each decision time. Additionally, the team must also choose rewards that are reliably available to the algorithm at update times.**Engineering Budget:**One should consider the engineering budget, supporting software needed, and time available to deliver a production-ready algorithm. If there are significant constraints, a simpler algorithm may be preferred over a sophisticated one because it is easier to implement, test, and set up monitoring systems for.**Off-Policy Evaluation and Causal Inference Considerations:**The investigative team often not only cares about the RL algorithm’s ability to learn but also about being able to use data collected by the RL algorithm to answer scientific questions after the study is over. These scientific questions can include topics such as off-policy evaluation [15,16] and causal inference [17,18]. Thus, the algorithm may be constrained to select actions probabilistically with probabilities that are bounded away from zero and one. This enhances the ability of investigators to use the resulting data to address scientific questions with sufficient power [19].

#### 3.3. Stability (S)

**User Heterogeneity:**There is likely some amount of user heterogeneity in response to actions, even when users are in the same context. User heterogeneity can be partially due to unobserved user traits (e.g., factors that are stable or change slowly over time, like family composition or personality type). The amount of between-user heterogeneity impacts whether an RL algorithm that pools data (partially or using clusters) across users to select actions will lead to improved rewards.**Non-Stationarity:**Unobserved factors common to all users such as societal changes (e.g., a new wave of the pandemic), and time-varying unobserved treatment burden (e.g., a user’s response to a digital intervention may depend on how many days the user has experienced the intervention) may make the distribution of the reward appear to vary with time, i.e., non-stationary.**High-Noise Environments:**Digital interventions typically deliver treatments to users in highly noisy environments. This is in part because digital interventions deliver treatments to users in daily life, where many unobserved factors (e.g., social context, mood, or stress) can affect a user’s responsiveness to an intervention. If unobserved, these factors produce noise. Moreover, the effect of digital prompts on a near-term reward tends to be small due to the nature of the intervention. Therefore, it is important to evaluate the algorithm’s ability to personalize even in highly noisy, low signal-to-noise ratio environments.

#### 3.4. Simulation Environments for PCS Evaluation

## 4. Related Works

#### 4.1. Digital Intervention Case Studies

#### 4.2. Simulation Environments in Reinforcement Learning

#### 4.3. PCS Framework Extensions

## 5. Case Study: Oral Health

- Once the study is initiated, the trial protocol and algorithm cannot be altered without jeopardizing trial validity.
- We are using an online algorithm, so we may not have timely access to certain desirable state features or rewards.
- We have a limited engineering budget.
- We must answer post-study scientific questions that require causal inference or off-policy evaluation.

#### 5.1. Oralytics

#### 5.2. The Oralytics Sequential Decision-Making Problem

#### 5.3. Designing the RL Algorithm Candidates

#### 5.4. Designing the Simulation Environment

- In general, for mobile health digital interventions, we expect the effect (magnitude of weight) of actions to be smaller than (or on the order of) the effect for baseline features, which include time of day and the user’s previous day brushing duration (all features are specified in Appendix A.1).
- The variance in treatment effects (weights representing the effect of actions) across users should be on the order of the variance in the effect of features across users (i.e., variance in parameters of fitted user-specific models).

## 6. Experiment and Results

**BLR vs. ZIP:**We prefer BLR to ZIP. BLR with cluster size $k=N$ results in higher user rewards than all other RL algorithm candidates in all environments in terms of average reward and 25th percentile reward (Table 2) and for average reward across all user decision times (Figure 3). It is interesting to note that BLR with cluster size $k=4$ performs comparably to ZIP for all cluster sizes k (Table 2, Figure 3). Originally, we hypothesized that ZIP would perform better than BLR because the ZIP-based algorithms can better model the zero-inflated nature of the rewards. We believe that the ZIP-based algorithms suffered in performance because they require fitting more parameters and thus require more data to learn effectively. On the other hand, the BLR model trades off bias and variance more effectively in our data-sparse settings.Beyond considerations of their ability to personalize, we also prefer the BLR-based RL algorithms because they have an easy-to-compute closed-form posterior update (computability and stability). The ZIP-based algorithms involve using approximate posterior sampling, which is more computationally intensive and numerically unstable. In addition, BLR with action centering is robust, namely, it is guaranteed to be unbiased even when the baseline reward model is incorrect [1]. BLR with action centering specifically does not require the knowledge of the baseline features at decision time (See Appendix C.1.1). This means that baseline features only need to be available at update time and we can incorporate more features that were not available in real time at the decision time.**Cluster Size:**RL algorithms with larger cluster sizes k perform better overall, especially for simulation environments with population-level treatment effects (rather than heterogeneous treatment effects). At first glance, one might think that algorithms with smaller cluster sizes may perform better because they can learn more personalized policies for users (especially in environments with heterogeneous treatment effects). Interestingly, though, the algorithms with larger cluster sizes performed better across all environments in terms of average reward (average across users and time) and the 25th-percentile of the average reward (average over time) across users (Table 2); this means that the RL algorithm candidates with larger cluster sizes performed better for both the average user and for the worst off users. The better performance of algorithms with larger cluster sizes is likely due to their ability to reduce noise and learn faster by leveraging the data of multiple users to learn. Even though the algorithms with larger cluster sizes are less able to model and learn the heterogeneity across users, this is outweighed by the immense benefit of sharing data across users to learn faster and reduce noise.

## 7. Discussion and Future Work

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

RL | reinforcement learning |

BLR | Bayesian linear regression |

ZIP | zero-inflated Poisson |

MDP | Markov decision process |

RMSE | root mean squared error |

## Appendix A. Simulation Environments

#### Appendix A.1. Baseline Feature Space of the Environment Base Models

- Bias/Intercept Term $\in \mathbb{R}$
- Time of Day (Morning/Evening) $\in \{0,1\}$
- Prior Day Total Brushing Duration (Normalized) $\in \mathbb{R}$
- Weekend Indicator (Weekday/Weekend) $\in \{0,1\}$
- Proportion of Nonzero Brushing Sessions Over Past 7 Days $\in [0,1]$
- Day in Study (Normalized) $\in [-1,1]$

#### Normalization of State Features

#### Appendix A.2. Environment Base Model

- (1)
- Zero-Inflated Poisson Model for Brushing Duration

- (2)
- Hurdle Model with Square Root Transform for Brushing Duration

- (3)
- Hurdle Model with Log Transform for Brushing Duration

#### Appendix A.3. Fitting the Environment Base Models

#### Selecting the Model Class for Each User

**Table A1.**Definitions of $\widehat{\mathbb{E}}[{D}_{i,t}|{S}_{i,t}]$ for each model class. $\widehat{\mathbb{E}}[{D}_{i,t}|{S}_{i,t}]$ is the mean of user model i fitted using data ${\{({S}_{i,t},{D}_{i,t})\}}_{t=1}^{T}$.

Model Class | $\widehat{\mathbb{E}}[{\mathit{D}}_{\mathit{i},\mathit{t}}|{\mathit{S}}_{\mathit{i},\mathit{t}}]$ |
---|---|

Zero-Inflated Poisson | $\left[1-\mathrm{sigmoid}\left(g{({S}_{i,t})}^{T}{w}_{i,b}\right)\right]\xb7exp\left(g{({S}_{i,t})}^{T}{w}_{i,p}\right)$ |

Hurdle (Square Root) | $\left[1-\mathrm{sigmoid}\left(g{({S}_{i,t})}^{T}{w}_{i,b}\right)\right]\xb7\left[{\sigma}_{i,u}^{2}+{(g{({S}_{i,t})}^{T}{w}_{i,\mu})}^{2}\right]$ |

Hurdle (Log) | $\left[1-\mathrm{sigmoid}(g{({S}_{i,t})}^{T}{w}_{i,b})\right]\xb7exp\left(g{({S}_{i,t})}^{T}{w}_{i,\mu}+\frac{{\sigma}_{i,u}^{2}}{2}\right)$ |

**Table A2.**Definitions of $\widehat{\mathbb{E}}[{D}_{i,t}|{S}_{i,t},{D}_{i,t}>0]$ and $\widehat{\mathrm{Var}}[{D}_{i,t}|{S}_{i,t},{D}_{i,t}>0]$ for each model class. $\widehat{\mathbb{E}}[{D}_{i,t}|{S}_{i,t},{D}_{i,t}>0]$ and $\widehat{\mathrm{Var}}[{D}_{i,t}|{S}_{i,t},{D}_{i,t}>0]$ is the mean and variance of the nonzero component of user model i fitted using data ${\{({S}_{i,t},{D}_{i,t})\}}_{t=1}^{T}$.

Model Class | $\widehat{\mathbb{E}}[{\mathit{D}}_{\mathit{i},\mathit{t}}|{\mathit{S}}_{\mathit{i},\mathit{t}},{\mathit{D}}_{\mathit{i},\mathit{t}}>0]$ |
---|---|

Hurdle (Square Root) | ${\sigma}_{i,u}^{2}+{(g{({S}_{i,t})}^{T}{w}_{i,\mu})}^{2}$ |

Hurdle (Log) | $exp(g{({S}_{i,t})}^{T}{w}_{i,\mu}+\frac{{\sigma}_{i,u}^{2}}{2})$ |

Zero-Inflated Poisson | $\frac{exp(g{({S}_{i,t})}^{T}{w}_{i,p})exp(exp(g{({S}_{i,t})}^{T}{w}_{i,p}))}{exp(exp(g{({S}_{i,t})}^{T}{w}_{i,p}))-1}$ |

$\widehat{\mathrm{Var}}[{D}_{i,t}|{S}_{i,t},{D}_{i,t}>0]$ | |

Hurdle (Square Root) | $g{({S}_{i,t})}^{T}{w}_{i,\mu}^{4}+3{\sigma}_{i,u}^{4}+6{\sigma}_{i,u}^{2}{(g{({S}_{i,t})}^{T}{w}_{i,\mu})}^{2}-\widehat{\mathbb{E}}{[{R}_{i,t}|{S}_{i,t},{R}_{i,t}>0]}^{2}$ |

Hurdle (Log) | $(exp({\sigma}_{i,u}^{2})-1)\xb7exp(2g{({S}_{i,t})}^{T}{w}_{i,\mu}+{\sigma}_{i,u}^{2})$ |

Zero-Inflated Poisson | $\widehat{\mathbb{E}}[{D}_{i,t}|{S}_{i,t},{D}_{i,t}>0]\xb7(1+exp(g{({S}_{i,t})}^{T}{w}_{i,p})-\widehat{\mathbb{E}}[{D}_{i,t}|{S}_{i,t},{D}_{i,t}>0])$ |

Model Class | Stationary | Non-Stationary |
---|---|---|

Hurdle with Square Root Transform | 9 | 7 |

Hurdle with Log Transform | 9 | 8 |

Zero-Inflated Poisson | 14 | 17 |

#### Appendix A.4. Checking the Quality of the Simulation Environment Base Model

#### Appendix A.4.1. Checking Moments

- Proportion of Missed Brushing Windows:$$\frac{1}{N}\sum _{i=1}^{N}\frac{1}{T}\sum _{t=1}^{T}\mathbb{I}[{D}_{i,t}=0]$$
- Average Nonzero Brushing Duration:$$\frac{1}{N}\sum _{i=1}^{N}\frac{1}{{\sum}_{t=1}^{T}\mathbb{I}[{D}_{i,t}>0]}\sum _{t=1}^{T}\mathbb{I}[{D}_{i,t}>0]{D}_{i,t}$$
- Variance of Nonzero Brushing Durations:Let $\widehat{\mathrm{Var}}({\{{X}_{k}\}}_{k=1}^{K})$ represent the empirical variance of ${X}_{1},{X}_{2},...,{X}_{K}$.$$\widehat{\mathrm{Var}}\left({\left\{{D}_{i,t}:t\in [1:T],{D}_{i,t}>0\right\}}_{i=1}^{N}\right)$$
- Variance of Average User Brushing Durations:This metric measures the degree of between-user variance in average brushing.$$\widehat{\mathrm{Var}}\left({\left\{\frac{1}{T}\sum _{t=1}^{T}{D}_{i,t}\right\}}_{i=1}^{N}\right)$$
- Average of Variances of Within-User Brushing Durations:This metric measures the average amount of within-user variance.$$\frac{1}{N}\sum _{i=1}^{N}\widehat{\mathrm{Var}}\left({\{{D}_{i,t}\}}_{t=1}^{T}\right)$$

**Table A4.**Comparing Moments Between Base Models and ROBAS 2 Data Set. Above, we use BDs to abbreviate Brushing Durations. Values for the Stationary and Nonstationary base models are averaged across 100 trials.

Metrics | ROBAS 2 | Stationary | Non-Stationary |
---|---|---|---|

Proportion of Missed Brushing Windows | 0.376674 | 0.403114 | 0.397812 |

Average Nonzero BDs | 137.768129 | 131.308445 | 134.676955 |

Variance of Nonzero BDs | 2326.518304 | 2392.955018 | 2253.177853 |

Variance of Average User BDs | 1415.920148 | 1699.126897 | 1399.615330 |

Average of Variances of Within-User BDs | 1160.723506 | 1405.944459 | 1473.239769 |

#### Appendix A.4.2. Measuring If a Base Model Captures the Variance in the Data

**Table A5.**Statistic ${U}_{i}$ for Capturing Variance in the Data. Values are rounded to the nearest 3 decimal places.

Metric | Stationary | Non-Stationary |
---|---|---|

Equation (A1) $\overline{U}$ | 0.811 | 0.792 |

Equation (A1) $\overline{{\sigma}_{U}}$ | 0.146 | 0.150 |

Equation (A1) Confidence Interval | (0.760, 0.861) | (0.739, 0.844) |

Equation (A2) $\overline{U}$ | 3.579 | 3.493 |

Equation (A2) $\overline{{\sigma}_{U}}$ | 4.861 | 4.876 |

Equation (A2) Confidence Interval | (1.895, 5.263) | (1.803, 5.182) |

#### Appendix A.5. Imputing Treatment Effect Sizes for Simulation Environments

#### Appendix A.5.1. Treatment Effect Feature Space

- Bias/Intercept Term $\in \mathbb{R}$

- 2.
- Time of Day (Morning/Evening) $\in \{0,1\}$

- 3.
- Prior Day Total Brushing Duration (Normalized) $\in \mathbb{R}$

- 4.
- Weekend Indicator (Weekday/Weekend) $\in \{0,1\}$

- 5.
- Day in Study (Normalized) $\in \mathbb{R}$

#### Appendix A.5.2. Imputation Approach

#### Appendix A.5.3. Heterogeneous versus Population-Level Effect Size

- ${\mathsf{\Delta}}_{B}={\mu}_{B,\mathrm{avg}}$ where ${\mu}_{B,\mathrm{avg}}=\frac{1}{4}{\sum}_{d\in [2:5]}\frac{1}{N}{\sum}_{i=1}^{N}|{w}_{i,b}^{(d)}|$.
- ${\mathsf{\Delta}}_{N}={\mu}_{N,\mathrm{avg}}$ where ${\mu}_{N,\mathrm{avg}}=\frac{1}{4}{\sum}_{d\in [2:5]}\frac{1}{N}{\sum}_{i=1}^{N}|{w}_{i,p}^{(d)}|$.

- ${\mathsf{\Delta}}_{B}={\mu}_{B,\mathrm{avg}}$ where ${\mu}_{B,\mathrm{avg}}=\frac{1}{4}{\sum}_{d\in [2:5]}\frac{1}{N}{\sum}_{i=1}^{N}|{w}_{i,b}^{(d)}|$.
- ${\mathsf{\Delta}}_{N}={\mu}_{N,\mathrm{avg}}$ where ${\mu}_{N,\mathrm{avg}}=\frac{1}{4}{\sum}_{d\in [2:5]}\frac{1}{N}{\sum}_{i=1}^{N}|{w}_{i,\mu}^{(d)}|$.

^{th}dimension of the vector ${w}_{i,b},{w}_{i,p},{w}_{i,u}$ respectively; we take the minimum over all dimensions excluding d = 1, which represents the weight for the bias/intercept term.

- ${\sigma}_{B}$ is the empirical standard deviation over ${\{{\mu}_{i,B}\}}_{i=1}^{N}$ where ${\mu}_{i,B}=\frac{1}{4}{\sum}_{d\in [2:5]}|{w}_{i,b}^{(d)}|$.
- ${\sigma}_{N}$ is the empirical standard deviation over ${\{{\mu}_{i,N}\}}_{i=1}^{N}$ where ${\mu}_{i,N}=\frac{1}{4}{\sum}_{d\in [2:5]}|{w}_{i,p}^{(d)}|$.

- ${\sigma}_{B}$ is the empirical standard deviation over ${\{{\mu}_{i,B}\}}_{i=1}^{N}$ where ${\mu}_{i,B}=\frac{1}{4}{\sum}_{d\in [2:5]}|{w}_{i,b}^{(d)}|$.
- ${\sigma}_{N}$ is the empirical standard deviation over ${\{{\mu}_{i,N}\}}_{i=1}^{N}$ where ${\mu}_{i,N}=\frac{1}{4}{\sum}_{d\in [2:5]}|{w}_{i,\mu}^{(d)}|$.

**Figure A1.**Effect sizes ${\mathsf{\Delta}}_{i,B}$’s, ${\mathsf{\Delta}}_{i,N}$’s, ${\mu}_{B},{\mu}_{N}$ for each base model class. The effect sizes are used to generate rewards under action $A=1$ for the simulation environment. (

**a**) Bernoulli component (hurdle); (

**b**) Nonzero component (square root); (

**c**) Nonzero component (log); (

**d**) Bernoulli component (ZIP); (

**e**) Poisson component (ZIP).

## Appendix B. RL Algorithm Candidates

#### Appendix B.1. Feature Space for the RL Algorithm Candidates

- 1.
- Bias/Intercept Term $\in \mathbb{R}$
- 2.
- Time of Day (Morning/Evening) $\in \{0,1\}$
- 3.
- Prior Day Total Brushing Duration (Normalized) $\in \mathbb{R}$

- 4.
- Weekend Indicator (Weekday/Weekend) $\in \{0,1\}$

#### Appendix B.2. Decision 1: Reward Approximating Function

#### Appendix B.2.1. Bayesian Linear Regression Model

#### Appendix B.2.2. Zero-Inflated Poisson Regression Model

#### Appendix B.3. Decision 2: Cluster Size

## Appendix C. RL Algorithm Posterior Updates and Posterior Sampling Action Selection

#### Appendix C.1. Posterior Updates to the RL Algorithm at Update Time

#### Appendix C.1.1. Bayesian Linear Regression Model

#### Appendix C.1.2. Zero-Inflated Poisson Regression Model

- Posterior Density:

- Proposal Distribution:

- Metropolis-Hastings Acceptance Ratio:

#### Appendix C.2. Action Selection at Decision Time

#### Appendix C.2.1. Posterior Sampling

#### Appendix C.2.2. Clipping to Form Action Selection Probabilities

## References

- Liao, P.; Greenewald, K.H.; Klasnja, P.V.; Murphy, S.A. Personalized HeartSteps: A Reinforcement Learning Algorithm for Optimizing Physical Activity. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.
**2020**, 4, 1–22. [Google Scholar] [CrossRef] [PubMed] - Yom-Tov, E.; Feraru, G.; Kozdoba, M.; Mannor, S.; Tennenholtz, M.; Hochberg, I. Encouraging physical activity in patients with diabetes: Intervention using a reinforcement learning system. J. Med. Internet Res.
**2017**, 19, e338. [Google Scholar] [CrossRef] [PubMed] - Forman, E.M.; Kerrigan, S.G.; Butryn, M.L.; Juarascio, A.S.; Manasse, S.M.; Ontañón, S.; Dallal, D.H.; Crochiere, R.J.; Moskow, D. Can the artificial intelligence technique of reinforcement learning use continuously-monitored digital data to optimize treatment for weight loss? J. Behav. Med.
**2019**, 42, 276–290. [Google Scholar] [CrossRef] [PubMed] - Allen, S. Stanford Computational Policy Lab Pretrial Nudges. 2022. Available online: https://policylab.stanford.edu/projects/nudge.html (accessed on 1 June 2022).
- Cai, W.; Grossman, J.; Lin, Z.J.; Sheng, H.; Wei, J.T.Z.; Williams, J.J.; Goel, S. Bandit algorithms to personalize educational chatbots. Mach. Learn.
**2021**, 110, 2389–2418. [Google Scholar] [CrossRef] - Qi, Y.; Wu, Q.; Wang, H.; Tang, J.; Sun, M. Bandit Learning with Implicit Feedback. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
- Bezos, J.P. 1997 Letter to Amazon Shareholders. 1997. Available online: https://www.sec.gov/Archives/edgar/data/1018724/000119312516530910/d168744dex991.htm (accessed on 1 June 2022).
- Yu, B.; Kumbier, K. Veridical data science. Proc. Natl. Acad. Sci. USA
**2020**, 117, 3920–3929. [Google Scholar] [CrossRef] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- den Hengst, F.; Grua, E.M.; el Hassouni, A.; Hoogendoorn, M. Reinforcement learning for personalization: A systematic literature review. Data Sci.
**2020**, 3, 107–147. [Google Scholar] [CrossRef] - Wang, C.C.; Kulkarni, S.R.; Poor, H.V. Bandit problems with side observations. IEEE Trans. Autom. Control
**2005**, 50, 338–355. [Google Scholar] [CrossRef] - Langford, J.; Zhang, T. The epoch-greedy algorithm for contextual multi-armed bandits. Adv. Neural Inf. Process. Syst.
**2007**, 20, 96–103. [Google Scholar] - Tewari, A.; Murphy, S.A. From ads to interventions: Contextual bandits in mobile health. In Mobile Health; Springer: Berlin/Heidelberg, Germany, 2017; pp. 495–517. [Google Scholar]
- Fan, H.; Poole, M.S. What is personalization? Perspectives on the design and implementation of personalization in information systems. J. Organ. Comput. Electron. Commer.
**2006**, 16, 179–202. [Google Scholar] [CrossRef] - Thomas, P.; Brunskill, E. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA, 20–22 June 2016; pp. 2139–2148. [Google Scholar]
- Levine, S.; Kumar, A.; Tucker, G.; Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv
**2020**, arXiv:2005.01643. [Google Scholar] - Boruvka, A.; Almirall, D.; Witkiewitz, K.; Murphy, S.A. Assessing time-varying causal effect moderation in mobile health. J. Am. Stat. Assoc.
**2018**, 113, 1112–1121. [Google Scholar] [CrossRef] [PubMed] - Hadad, V.; Hirshberg, D.A.; Zhan, R.; Wager, S.; Athey, S. Confidence intervals for policy evaluation in adaptive experiments. Proc. Natl. Acad. Sci. USA
**2021**, 118, e2014602118. [Google Scholar] [CrossRef] [PubMed] - Yao, J.; Brunskill, E.; Pan, W.; Murphy, S.; Doshi-Velez, F. Power Constrained Bandits. In Proceedings of the 6th Machine Learning for Healthcare Conference, PMLR, Virtual, 6–7 August 2021; Jung, K., Yeung, S., Sendak, M., Sjoding, M., Ranganath, R., Eds.; 2021; Volume 149, pp. 209–259. [Google Scholar]
- Murnane, E.L.; Huffaker, D.; Kossinets, G. Mobile Health Apps: Adoption, Adherence, and Abandonment. In Proceedings of the Adjunct Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2015 ACM International Symposium on Wearable Computers, Osaka, Japan, 7–11 September 2015; UbiComp/ISWC’15 Adjunct. Association for Computing Machinery: New York, NY, USA, 2015; pp. 261–264. [Google Scholar] [CrossRef]
- Dennison, L.; Morrison, L.; Conway, G.; Yardley, L. Opportunities and Challenges for Smartphone Applications in Supporting Health Behavior Change: Qualitative Study. J. Med. Internet Res.
**2013**, 15, e86. [Google Scholar] [CrossRef] [PubMed] - Agarwal, A.; Alomar, A.; Alumootil, V.; Shah, D.; Shen, D.; Xu, Z.; Yang, C. PerSim: Data-Efficient Offline Reinforcement Learning with Heterogeneous Agents via Personalized Simulators. arXiv
**2021**, arXiv:2102.06961. [Google Scholar] - Figueroa, C.A.; Aguilera, A.; Chakraborty, B.; Modiri, A.; Aggarwal, J.; Deliu, N.; Sarkar, U.; Jay Williams, J.; Lyles, C.R. Adaptive learning algorithms to optimize mobile applications for behavioral health: Guidelines for design decisions. J. Am. Med. Inform. Assoc.
**2021**, 28, 1225–1234. [Google Scholar] [CrossRef] - Wei, H.; Chen, C.; Liu, C.; Zheng, G.; Li, Z. Learning to simulate on sparse trajectory data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2020; pp. 530–545. [Google Scholar]
- Ie, E.; Hsu, C.W.; Mladenov, M.; Jain, V.; Narvekar, S.; Wang, J.; Wu, R.; Boutilier, C. RecSim: A Configurable Simulation Platform for Recommender Systems. arXiv
**2019**, arXiv:cs.LG/1909.04847. [Google Scholar] - Santana, M.R.O.; Melo, L.C.; Camargo, F.H.F.; Brandão, B.; Soares, A.; Oliveira, R.M.; Caetano, S. MARS-Gym: A Gym framework to model, train, and evaluate Recommender Systems for Marketplaces. arXiv
**2020**, arXiv:cs.IR/2010.07035. [Google Scholar] - Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv
**2016**, arXiv:cs.LG/1606.01540. [Google Scholar] - Wang, S.; Zhang, C.; Kröse, B.; van Hoof, H. Optimizing Adaptive Notifications in Mobile Health Interventions Systems: Reinforcement Learning from a Data-driven Behavioral Simulator. J. Med. Syst.
**2021**, 45, 1–8. [Google Scholar] [CrossRef] - Singh, A.; Halpern, Y.; Thain, N.; Christakopoulou, K.; Chi, E.; Chen, J.; Beutel, A. Building healthy recommendation sequences for everyone: A safe reinforcement learning approach. In Proceedings of the FAccTRec Workshop, Online, 26–27 September 2020. [Google Scholar]
- Korzepa, M.; Petersen, M.K.; Larsen, J.E.; Mørup, M. Simulation Environment for Guiding the Design of Contextual Personalization Systems in the Context of Hearing Aids. In Proceedings of the Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization, Genoa, Italy, 14–17 July 2020; UMAP ’20 Adjunct. Association for Computing Machinery: New York, NY, USA, 2020; pp. 293–298. [Google Scholar] [CrossRef]
- Hassouni, A.E.; Hoogendoorn, M.; van Otterlo, M.; Barbaro, E. Personalization of Health Interventions Using Cluster-Based Reinforcement Learning. In PRIMA 2018: Principles and Practice of Multi-Agent Systems; Miller, T., Oren, N., Sakurai, Y., Noda, I., Savarimuthu, B.T.R., Cao Son, T., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 467–475. [Google Scholar]
- Hassouni, A.E.; Hoogendoorn, M.; van Otterlo, M.; Eiben, A.E.; Muhonen, V.; Barbaro, E. A clustering-based reinforcement learning approach for tailored personalization of e-Health interventions. arXiv
**2018**, arXiv:1804.03592. [Google Scholar] - Dwivedi, R.; Tan, Y.S.; Park, B.; Wei, M.; Horgan, K.; Madigan, D.; Yu, B. Stable Discovery of Interpretable Subgroups via Calibration in Causal Studies. Int. Stat. Rev.
**2020**, 88, S135–S178. [Google Scholar] [CrossRef] - Ward, O.G.; Huang, Z.; Davison, A.; Zheng, T. Next waves in veridical network embedding. Stat. Anal. Data Min. ASA Data Sci. J.
**2021**, 14, 5–17. [Google Scholar] [CrossRef] - Margot, V.; Luta, G. A new method to compare the interpretability of rule-based algorithms. AI
**2021**, 2, 621–635. [Google Scholar] [CrossRef] - Shetty, V.; Morrison, D.; Belin, T.; Hnat, T.; Kumar, S. A Scalable System for Passively Monitoring Oral Health Behaviors Using Electronic Toothbrushes in the Home Setting: Development and Feasibility Study. JMIR Mhealth Uhealth
**2020**, 8, e17347. [Google Scholar] [CrossRef] [PubMed] - Jiang, N.; Kulesza, A.; Singh, S.; Lewis, R. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, Istanbul, Turkey, 4–8 May 2015; pp. 1181–1189. [Google Scholar]
- Russo, D.; Roy, B.V.; Kazerouni, A.; Osband, I. A Tutorial on Thompson Sampling. Available online: http://xxx.lanl.gov/abs/1707.02038 (accessed on 1 June 2022).
- Zhu, F.; Guo, J.; Xu, Z.; Liao, P.; Yang, L.; Huang, J. Group-driven reinforcement learning for personalized mhealth intervention. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2018; pp. 590–598. [Google Scholar]
- Tomkins, S.; Liao, P.; Klasnja, P.; Murphy, S. IntelligentPooling: Practical Thompson sampling for mHealth. Mach. Learn.
**2021**, 110, 2685–2727. [Google Scholar] [CrossRef] [PubMed] - Deshmukh, A.A.; Dogan, U.; Scott, C. Multi-task learning for contextual bandits. Adv. Neural Inf. Process. Syst.
**2017**, 30, 4848–4856. [Google Scholar] - Vaswani, S.; Schmidt, M.; Lakshmanan, L. Horde of bandits using gaussian markov random fields. In Proceedings of the Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 690–699. [Google Scholar]
- Feng, C.X. A comparison of zero-inflated and hurdle models for modeling zero-inflated count data. J. Stat. Distrib. Appl.
**2021**, 8, 1–19. [Google Scholar] [CrossRef] - Cole, S.R.; Platt, R.W.; Schisterman, E.F.; Chu, H.; Westreich, D.; Richardson, D.; Poole, C. Illustrating bias due to conditioning on a collider. Int. J. Epidemiol.
**2010**, 39, 417–420. [Google Scholar] [CrossRef] - Luque-Fernandez, M.A.; Schomaker, M.; Redondo-Sanchez, D.; Jose Sanchez Perez, M.; Vaidya, A.; Schnitzer, M.E. Educational Note: Paradoxical collider effect in the analysis of non-communicable disease epidemiological data: A reproducible illustration and web application. Int. J. Epidemiol.
**2019**, 48, 640–653. [Google Scholar] [CrossRef] - Elwert, F.; Winship, C. Endogenous selection bias: The problem of conditioning on a collider variable. Annu. Rev. Sociol.
**2014**, 40, 31–53. [Google Scholar] [CrossRef]

**Figure 1.**Decision times and actions for the RL algorithm in the Oralytics study. There will be two decision times per day (one in the morning and one in the evening); in total, each user will have 140 decision times, which we index by t. At a given decision time t, the RL algorithm receives state information for each user ${S}_{i,t}$ (this will include information on whether it is a morning or evening decision time and whether it is a weekday or weekend, as well as information on the user’s previous day brushing). Given the state, the RL algorithm decides whether or not to send the user an engagement message (selects action ${A}_{i,t}\in \{0,1\}$). After an action is taken, the RL algorithm receives reward ${R}_{i,t}$, which is the user’s subsequent brushing duration in seconds during the brushing window following the decision time. See Section 5 for a discussion on the challenges of designing an RL algorithm for Oralytics.

**Figure 2.**Histogram of brushing durations in seconds for all user brushing sessions in ROBAS 2. The ROBAS 2 study had 32 users total and each user had 56 brushing windows (2 brushing windows per day for 28 days). If a user did not brush during a brushing window, their brushing duration is recorded as zero seconds. Note in the figure above that across all users and brushing windows, about $40\%$ of brushing sessions had no brushing, that is, a brushing duration of zero seconds. The ROBAS 2 brushing durations are highly zero-inflated.

**Figure 3.**Average User Rewards Over Time. Above, we show simulation results of the six candidate algorithms (BLR and ZIP respectively for different cluster sizes k) across the four simulation environments. The y-axis is the mean and $\pm 1.96\xb7\mathrm{standard}\phantom{\rule{4.pt}{0ex}}\mathrm{error}$ of the average user rewards ($\overline{R}=\frac{1}{72}{\sum}_{i=1}^{72}\frac{1}{{t}_{0}}{\sum}_{s=1}^{{t}_{0}}{R}_{i,s}$) for decision times ${t}_{0}\in [20,40,60,80,100,120,140]$ across 100 Monte Carlo simulated trials. Standard error is $\frac{\widehat{\sigma}}{\sqrt{100}}$ where $\widehat{\sigma}$ is the sample variance of the 100 $\overline{R}$s. (

**a**) Stationary Base Model and Heterogeneous Effect Size; (

**b**) Nonstationary Base Model and Heterogeneous Effect Size; (

**c**) Stationary Base Model and Population Effect Size; (

**d**) Nonstationary Base Model and Population Effect Size.

**Table 1.**Four Environment Variants. We consider two environment base models (stationary and nonstationary) and two effect sizes (population effect size, heterogeneous effect size).

S_Pop: Stationary Base Model, Population Effect Size | NS_Pop: Nonstationary Base Model, Population Effect Sizes |

S_Het: Stationary Base Model, Heterogeneous Effect Size | NS_Het: Nonstationary Base Model, Heterogeneous Effect Sizes |

**Table 2.**Average and 25th Percentile Rewards. Average and 25th percentile rewards are defined in Section 3.1. The naming convention for environment variants is found in Table 1. “k” refers to the cluster size. Average rewards are averaged across time, users, and 100 trials. For the 25th percentile rewards, we average rewards for each user across time, find the lower 25th percentile across $N=72$ users, and then average that across 100 trials. The value in the parenthesis is the standard error of the mean. The best performing algorithm candidate in each environment variant is bolded. BLR ($k=N$) performs better than other algorithm candidates across all simulated environments. Notice that the average rewards are lower than the 120-s dentist-recommended brushing duration. This is because of the zero-inflated nature of our setting (i.e., the user does not brush).

RL Algorithm Candidates | ||||
---|---|---|---|---|

Average Rewards | ||||

RL Algorithm | S_Het | NS_Het | S_Pop | NS_Pop |

ZIP $k=1$ | 100.038 (0.597) | 102.566 (0.526) | 107.184 (0.626) | 109.379 (0.552) |

ZIP $k=4$ | 100.463 (0.586) | 103.035 (0.539) | 108.217 (0.609) | 110.242 (0.562) |

ZIP $k=N$ | 100.791 (0.596) | 103.391 (0.546) | 108.410 (0.617) | 110.542 (0.554) |

BLR $k=1$ | 97.196 (0.585) | 99.691 (0.527) | 103.692 (0.615) | 105.590 (0.546) |

BLR $k=4$ | 99.772 (0.590) | 102.310 (0.547) | 107.568 (0.619) | 109.454 (0.547) |

BLR $k=N$ | 101.267 (0.590) | 104.024 (0.542) | 108.974 (0.610) | 111.201 (0.546) |

25th Percentile Rewards | ||||

RL Algorithm | S_Het | NS_Het | S_Pop | NS_Pop |

ZIP $k=1$ | 67.907 (1.150) | 73.830 (0.403) | 74.898 (1.016) | 78.651 (0.556) |

ZIP $k=4$ | 68.865 (1.067) | 73.836 (0.464) | 75.933 (1.114) | 80.413 (0.629) |

ZIP $k=N$ | 69.448 (1.201) | 74.580 (0.475) | 76.312 (1.122) | 80.424 (0.648) |

BLR $k=1$ | 65.600 (1.139) | 70.703 (0.457) | 70.915 (1.024) | 74.782 (0.596) |

BLR $k=4$ | 68.045 (1.122) | 73.322 (0.505) | 75.766 (1.097) | 79.809 (0.622) |

BLR $k=N$ | 69.757 (1.171) | 75.393 (0.427) | 77.272 (1.096) | 81.675 (0.583) |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Trella, A.L.; Zhang, K.W.; Nahum-Shani, I.; Shetty, V.; Doshi-Velez, F.; Murphy, S.A.
Designing Reinforcement Learning Algorithms for Digital Interventions: Pre-Implementation Guidelines. *Algorithms* **2022**, *15*, 255.
https://doi.org/10.3390/a15080255

**AMA Style**

Trella AL, Zhang KW, Nahum-Shani I, Shetty V, Doshi-Velez F, Murphy SA.
Designing Reinforcement Learning Algorithms for Digital Interventions: Pre-Implementation Guidelines. *Algorithms*. 2022; 15(8):255.
https://doi.org/10.3390/a15080255

**Chicago/Turabian Style**

Trella, Anna L., Kelly W. Zhang, Inbal Nahum-Shani, Vivek Shetty, Finale Doshi-Velez, and Susan A. Murphy.
2022. "Designing Reinforcement Learning Algorithms for Digital Interventions: Pre-Implementation Guidelines" *Algorithms* 15, no. 8: 255.
https://doi.org/10.3390/a15080255