# Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- (i)
- variable selection;
- (ii)
- estimation of causal effects;
- (iii)
- propensity score weighting;
- (iv)
- missing data.

## 2. Review of Methods

#### 2.1. CART

#### 2.2. Random Forest

#### 2.3. Boosting

#### 2.4. BART

## 3. Utilities of Tree-Based Methods

#### 3.1. Variable Selection

#### 3.2. Counterfactual Prediction

#### 3.3. Propensity Score Weighting

#### 3.4. Missing Data

## 4. Case Studies of Tree-Based Methods

#### 4.1. Confounder Selection

#### 4.2. Comparative Effectiveness Analysis

#### 4.3. Propensity Score Weight Estimator

#### 4.4. Handing Missing Data

## 5. Discussion

## 6. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Hernández, B.; Pennington, S.R.; Parnell, A.C. Bayesian methods for proteomic biomarker development. EuPA Open Proteom.
**2015**, 9, 54–64. [Google Scholar] [CrossRef] - Hu, L.; Gu, C.; Lopez, M.; Ji, J.; Wisnivesky, J. Estimation of causal effects of multiple treatments in observational studies with a binary outcome. Stat. Methods Med. Res.
**2020**, 29, 3218–3234. [Google Scholar] [CrossRef] [PubMed] - Hu, L.; Gu, C. Estimation of causal effects of multiple treatments in healthcare database studies with rare outcomes. Health Serv. Outcomes Res. Methodol.
**2021**, 21, 287–308. [Google Scholar] [CrossRef] - Mazumdar, M.; Lin, J.Y.J.; Zhang, W.; Li, L.; Liu, M.; Dharmarajan, K.; Sanderson, M.; Isola, L.; Hu, L. Comparison of statistical and machine learning models for healthcare cost data: A simulation study motivated by Oncology Care Model (OCM) data. BMC Health Serv. Res.
**2020**, 20, 350. [Google Scholar] [CrossRef][Green Version] - Hu, L.; Liu, B.; Ji, J.; Li, Y. Tree-Based Machine Learning to Identify and Understand Major Determinants for Stroke at the Neighborhood Level. J. Am. Heart Assoc.
**2020**, 9, e016745. [Google Scholar] [CrossRef] - Hu, L.; Liu, B.; Li, Y. Ranking sociodemographic, health behavior, prevention, and environmental factors in predicting neighborhood cardiovascular health: A Bayesian machine learning approach. Prev. Med.
**2020**, 141, 106240. [Google Scholar] [CrossRef] - Liu, Y.; Traskin, M.; Lorch, S.A.; George, E.I.; Small, D. Ensemble of trees approaches to risk adjustment for evaluating a hospital’s performance. Health Care Manag. Sci.
**2015**, 18, 58–66. [Google Scholar] [CrossRef] - Lin, J.Y.J.; Hu, L.; Huang, C.; Jiayi, J.; Lawrence, S.; Govindarajulu, U. A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data. BMC Med. Res. Methodol.
**2022**, 22, 132. [Google Scholar] [CrossRef] - Hu, L.; Ji, J.; Ennis, R.D.; Hogan, J.W. A flexible approach for causal inference with multiple treatments and clustered survival outcomes. Stat. Med. 2022; in press. [Google Scholar] [CrossRef]
- Hu, L.; Ji, J. CIMTx: An R package for causal inference with multiple treatments using observational data. R J. 2022; in press. [Google Scholar]
- Hu, L.; Ji, J.; Liu, H.; Ennis, R. A flexible approach for assessing heterogeneity of causal treatment effects on patient survival using large datasets with clustered observations. Int. J. Environ. Res. Public Health
**2022**, 19, 14903. [Google Scholar] - Hu, L.; Ji, J.; Li, F. Estimating heterogeneous survival treatment effect in observational data using machine learning. Stat. Med.
**2021**, 40, 4691–4713. [Google Scholar] [CrossRef] - Hu, L.; Hogan, J.W.; Mwangi, A.W.; Siika, A. Modeling the causal effect of treatment initiation time on survival: Application to HIV/TB co-infection. Biometrics
**2018**, 74, 703–713. [Google Scholar] [CrossRef] - Hu, L.; Hogan, J.W. Causal comparative effectiveness analysis of dynamic continuous-time treatment initiation rules with sparsely measured outcomes and death. Biometrics
**2019**, 75, 695–707. [Google Scholar] [CrossRef][Green Version] - Little, R.J.; D’Agostino, R.; Cohen, M.L.; Dickersin, K.; Emerson, S.S.; Farrar, J.T.; Frangakis, C.; Hogan, J.W.; Molenberghs, G.; Murphy, S.A.; et al. The prevention and treatment of missing data in clinical trials. N. Engl. J. Med.
**2012**, 367, 1355–1360. [Google Scholar] [CrossRef][Green Version] - Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley Sons: New York, NY, USA, 2004. [Google Scholar]
- Hu, L.; Lin, J.; Ji, J. Variable selection with missing data in both covariates and outcomes: Imputation and machine learning. Stat. Methods Med. Res.
**2021**, 30, 2651–2671. [Google Scholar] [CrossRef] - Hu, L.; Zou, J.; Gu, C.; Ji, J.; Lopez, M.; Kale, M. A flexible sensitivity analysis approach for unmeasured confounding with multiple treatments and a binary outcome with application to SEER-Medicare lung cancer data. Ann. Appl. Stat.
**2022**, 16, 1014–1037. [Google Scholar] [CrossRef] - Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef][Green Version] - Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Chipman, H.A.; George, E.I.; McCulloch, R.E. BART: Bayesian additive regression trees. Ann. Appl. Stat.
**2010**, 4, 266–298. [Google Scholar] [CrossRef] - Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; ChapmanHall CRC: Boca Raton, FL, USA, 2017. [Google Scholar]
- Breiman, L. Bagging predictors. Mach. Learn.
**1996**, 24, 123–140. [Google Scholar] [CrossRef][Green Version] - Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013. [Google Scholar]
- Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat.
**2000**, 28, 337–407. [Google Scholar] [CrossRef] - Hu, L.; Lin, J.Y.; Sigel, K.; Kale, M. Estimating heterogeneous survival treatment effects of lung cancer screening approaches: A causal machine learning analysis. Ann. Epidemiol.
**2021**, 62, 36–42. [Google Scholar] [CrossRef] - Dorie, V.; Hill, J.; Shalit, U.; Scott, M.; Cervone, D. Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Stat. Sci.
**2019**, 34, 43–68. [Google Scholar] [CrossRef] - Bleich, J.; Kapelner, A.; George, E.I.; Jensen, S.T. Variable selection for BART: An application to gene regulation. Ann. Appl. Stat.
**2014**, 8, 1750–1781. [Google Scholar] [CrossRef][Green Version] - Hapfelmeier, A.; Ulm, K. A new variable selection approach using random forests. Comput. Stat. Data Anal.
**2013**, 60, 50–69. [Google Scholar] [CrossRef] - Díaz-Uriarte, R.; Alvarez de Andrés, S. Gene selection and classification of microarray data using random forest. BMC Bioinform.
**2006**, 7, 3. [Google Scholar] [CrossRef][Green Version] - Lee, B.K.; Lessler, J.; Stuart, E.A. Improving propensity score weighting using machine learning. Stat. Med.
**2010**, 29, 337–346. [Google Scholar] [CrossRef][Green Version] - Hill, J.L. Bayesian nonparametric modeling for causal inference. J. Comput. Graph. Stat.
**2011**, 20, 217–240. [Google Scholar] [CrossRef] - Wager, S.; Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc.
**2018**, 113, 1228–1242. [Google Scholar] [CrossRef][Green Version] - Hu, L.; Li, F.; Ji, J.; Joshi, H.; Scott, E. Estimating the causal effects of multiple intermittent treatments with application to COVID-19. arXiv
**2022**, arXiv:2109.13368. [Google Scholar] - Hu, L. A new tool for clustered survival data and multiple treatments: Estimation of treatment effect heterogeneity and variable selection. arXiv
**2022**, arXiv:2206.08271. [Google Scholar] - Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc.
**1952**, 47, 663–685. [Google Scholar] [CrossRef] - Van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Softw.
**2011**, 45, 1–67. [Google Scholar] [CrossRef][Green Version] - Stekhoven, D.J.; Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics
**2012**, 28, 112–118. [Google Scholar] [CrossRef] [PubMed] - Waljee, A.K.; Mukherjee, A.; Singal, A.G.; Zhang, Y.; Warren, J.; Balis, U.; Marrero, J.; Zhu, J.; Higgins, P.D. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open
**2013**, 3, e002847. [Google Scholar] [CrossRef] [PubMed] - Xu, D.; Daniels, M.J.; Winterstein, A.G. Sequential BART for imputation of missing covariates. Biostatistics
**2016**, 17, 589–602. [Google Scholar] [CrossRef] [PubMed][Green Version] - Mickey, R.M.; Greenland, S. The impact of confounder selection criteria on effect estimation. Am. J. Epidemiol.
**1989**, 129, 125–137. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**An illustrating classification tree diagram. Y indicates a case and N indicates a non-case.

**Figure 2.**Visualization of the BART variable selection algorithm. The vertical lines are the threshold levels determined from the “null” distributions for variable inclusion proportions computed from 100 permutated data. Variable inclusion proportions from the original (unpermutated) data passing this threshold are displayed as solid dots. Open dots correspond to variables that are not selected.

**Figure 3.**Distributions of the inverse probability of treatment weights estimated by BART, random forest, and XGBoost.

**Figure 4.**A comparison of the distributions of values for total hip bone mineral density and total spine bone mineral density among the imputed values and among the complete cases.

**Table 1.**Variables selected by each method, and 5-fold cross-validated area under the receiver operating characteristics curve using each model with selected variables.

Methods | Selected Variables | AUC |
---|---|---|

BART | Chalson comorbidity score, gender, married, histology, year of diagnosis | 0.85 |

XGBoost | Age, year of diagnosis | 0.72 |

RF | Chalson comorbidity score, histology | 0.74 |

**Table 2.**Causal inferences about average treatment effects of three surgical approaches on postoperative respiratory complications based on the relative risk, using the SEER-Medicare lung cancer data. The 95% uncertainty intervals are displayed in parentheses. All 14 potential confounders were used. RAS: robotic-assisted surgery; VATS: video-assisted thoracic surgery; OT: open thoracotomy.

Methods | RAS vs. OT | RAS vs. VATS | OT vs. VATS |
---|---|---|---|

BART | 0.94 (0.72, 1.16) | 1.09 (0.84, 1.34) | 1.12 (0.87, 1.37) |

XGBoost | 0.91 (0.64, 1.13) | 1.04 (0.79, 1.28) | 1.08 (0.84, 1.33) |

RF | 0.90 (0.63, 1.14) | 1.03 (0.78, 1.29) | 1.06 (0.82, 1.35) |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hu, L.; Li, L.
Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series. *Int. J. Environ. Res. Public Health* **2022**, *19*, 16080.
https://doi.org/10.3390/ijerph192316080

**AMA Style**

Hu L, Li L.
Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series. *International Journal of Environmental Research and Public Health*. 2022; 19(23):16080.
https://doi.org/10.3390/ijerph192316080

**Chicago/Turabian Style**

Hu, Liangyuan, and Lihua Li.
2022. "Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series" *International Journal of Environmental Research and Public Health* 19, no. 23: 16080.
https://doi.org/10.3390/ijerph192316080