# Using Machine Learning Techniques for Asserting Cellular Damage Induced by High-LET Particle Radiation

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. LQ Model

**$S\left(D\right)={e}^{-(\alpha D+\beta {D}^{2})}$**where S represents the fraction of cells that survived dose D. In Figure 1 below, a typical response curve of a cell population to ionizing radiation based on the LQ model is presented as an example.

#### 1.2. Radiation Properties

#### 1.3. Biophysical Background

#### 1.4. Machine Learning Approach

## 2. Results

- True vs. predicted graphs show the evolutions of key statistics in every ML model. These graphs consist of pairs of actual and predicted values from the test set, on which the model was not trained (holdout set). The closer these points are to the $y=x$ line, the better the predictive performance of the estimator. Additionally, a linear regression line was plotted based on the points in order to quantify the distance from the identity line, with the corresponding equation being given on the plot. p-values represent the probability that, given the sampled data, the slope of the regression line is zero—i.e., that the two variables are unrelated.
- Mean and standard deviation graphs depict the ways in which the means and the standard deviations (STD) of error change. The pairs of true and predicted values were sorted with respect to the true value and they were separated into four groups containing equal numbers of pairs. For each group, the means and the STD of distance between the two parts of the pair (i.e., error) were calculated. This graph aims to examine whether the error of the ML model follows a specific trend. The x-axis corresponds to the four groups of errors, while the y-axis has the same units as the target studied.

#### 2.1. Predictive Performance

#### 2.1.1. $\alpha $ and $\beta $ Parameters

#### 2.1.2. DNA Damage

#### 2.2. Interpretation

#### 2.2.1. Feature Importance

#### 2.2.2. Partial Dependence

#### 2.2.3. Local Interpretation

#### 2.3. Significance Analysis

## 3. Materials and Methods

#### 3.1. The Data

#### 3.2. Monte Carlo Damage Simulations (MCDS)

- DSB+ (DSB accompanied by one (or more) additional SB within 10 bp separation);
- DSB++ (more than one DSB whether within the 10bp separation or further apart);
- SSBc (fraction of complex damage (SSB+ and 2SSB) among SSBs);
- SSBcb (fraction of complex damage (SSB+ and 2SSB) among SSBs; base damage included);
- DSBc (fraction of complex damage (DSB+ and DSB++) among DSBs);
- DSBcb (fraction of complex damage (DSB+ and DSB++) among DSBs; base damage included).

#### 3.3. Model Building Process

- Find oxygen concentration from literature for each study in PIDE dataset.
- Perform Monte Carlo simulations to assess DNA Damage and complement dataset with corresponding metrics.
- Map categorical features to numerical values (categorical encoding) and remove null values.
- To optimize the model, i.e., to find the optimal hyper-parameters, we perform 5-fold cross-validation using a grid of possible hyper-parameters.
- Hyperopt algorithm is used to find the best parameters.
- Split the dataset to train and test subsets at 80/20 ratio.
- Calculate performance metrics of the optimized model and provide interpretations.
- Fit model to train set and compare results to test set, so as to assess performance and provide interpretation.

#### 3.4. The Algorithm

#### 3.5. Model Optimization

#### 3.6. Categorical Encoding

#### 3.7. Interpretation Frameworks

#### 3.7.1. Feature Importance

#### 3.7.2. Partial Dependence

#### 3.7.3. Local Interpretations

#### 3.7.4. Bootstrap Confidence Intervals

#### 3.7.5. Statistical Significance

## 4. Discussion

## Supplementary Materials

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

ML | Machine Learning |

IR | Ionizing Radiation |

LQ | Linear Quadratic |

LET | Linear Energy Transfer |

bp | Base Pairs |

PIDE | Particle Irradiation Data Ensemble |

RRMSE | Relative Root Mean Squared Error |

RMSE | Root Mean Squared Error |

STD | Standard Deviation |

DSB | Double Strand Break |

GLM | Generalized Linear Model |

LIME | Local Interpetable Model-Agnostic Explanations |

## References

- Semenenko, V.; Stewart, R. Fast Monte Carlo simulation of DNA damage formed by electrons and light ions. Phys. Med. Biol.
**2006**, 51, 1693–1706. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Friedrich, T.; Scholz, U.; ElsASser, T.; Durante, M.; Scholz, M. Systematic analysis of RBE and related quantities using a database of cell survival experiments with ion beam irradiation. J. Radiat. Res.
**2012**, 54, 494–514. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Douglas, B.G.; Fowler, J.F. The effect of multiple small doses of x rays on skin reactions in the mouse and a basic interpretation. Radiat. Res.
**1976**, 66, 401–426. [Google Scholar] [CrossRef] [PubMed] - McMahon, S.J. The linear quadratic model: Usage, interpretation and challenges. Phys. Med. Biol.
**2018**, 64, 01TR01. [Google Scholar] [CrossRef] [PubMed] - Obe, G.; Johannes, S.F.D.C. DNA double-strand breaks induced by sparsely ionizing radiation and endonucleases as critical lesions foe cell death, chromosomal aberrations, mutations and oncogenic transformation. Mutagenesis
**1992**, 7, 3–12. [Google Scholar] [CrossRef] [PubMed] - Nikitaki, Z.; Hellweg, C.E.; Georgakilas, A.G.; Ravanat, J.L. Stress-induced DNA damage biomarkers: Applications and limitations. Front. Chem.
**2015**, 3, 35. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] [Green Version] - Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI-95, Montreal, QC, Canada, 20–25 August 1995; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1995; Volume 2, pp. 1137–1143. [Google Scholar]
- Nelder, J.A.; Wedderburn, R.W.M. Generalized Linear Models. J. R. Stat. Soc. Ser. (Gen.)
**1972**, 135, 370–384. [Google Scholar] [CrossRef] - Borchani, H.; Varando, G.; Bielza, C.; Larrañaga, P. A survey on multi-output regression. Wires Data Min. Knowl. Discov.
**2015**, 5, 216–233. [Google Scholar] [CrossRef] [Green Version] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Martinez-Cantin, R. Bayesopt: A bayesian optimization library for nonlinear optimization, experimental design and bandits. J. Mach. Learn. Res.
**2014**, 15, 3735–3739. [Google Scholar] - Bergstra, J.S.; Bardenet, R.; Bengio, Y.; Kegl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation: Vancouver, BC, Canada, 2011; pp. 2546–2554. [Google Scholar]
- Bergstra, J.; Yamins, D.; Cox, D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; pp. 115–123. [Google Scholar]
- Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. Ijcai
**1995**, 4, 1137–1143. [Google Scholar] - Weinberger, K.; Dasgupta, A.; Langford, J.; Smola, A.; Attenberg, J. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1113–1120. [Google Scholar]
- Micci-Barreca, D. A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl.
**2001**, 3, 27–32. [Google Scholar] [CrossRef] - Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. arXiv
**2016**, arXiv-1602. [Google Scholar] - DiCiccio, T.J.; Efron, B. Bootstrap Confidence Intervals. Stat. Sci.
**1996**, 11, 189–212. [Google Scholar] - Van Leeuwen, C.; Oei, A.; Crezee, J.; Bel, A.; Franken, N.; Stalpers, L.; Kok, H. The alfa and beta of tumours: A review of parameters of the linear-quadratic model, derived from clinical radiotherapy studies. Radiat. Oncol.
**2018**, 13, 1–11. [Google Scholar] [CrossRef] [PubMed] - Nytko, K.J.; Thumser-Henner, P.; Weyland, M.S.; Scheidegger, S.; Bley, C.R. Cell line-specific efficacy of thermoradiotherapy in human and canine cancer cells in vitro. PLoS ONE
**2019**, 14, e0216744. [Google Scholar] [CrossRef] [PubMed] - Jones, B. A simpler energy transfer efficiency model to predict relative biological effect for protons and heavier ions. Front. Oncol.
**2015**, 5, 184. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Averbeck, N.B.; Ringel, O.; Herrlitz, M.; Jakob, B.; Durante, M.; Taucher-Scholz, G. DNA end resection is needed for the repair of complex lesions in G1-phase human cells. Cell Cycle
**2014**, 13, 2509–2516. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Ando, K.; Goodhead, D.T. Dependence and independence of survival parameters on linear energy transfer in cells and tissues. J. Radiat. Res.
**2016**, 57, 596–606. [Google Scholar] [CrossRef] [PubMed] - Wedenberg, M. From Cell Survival to Dose Response: Modeling Biological Effects in Radiation Therapy; Department of Oncology-Pathology, Karolinska Instutet, Universitetsservice US-AB: Stockholm, Sweden, 2013; Drottning Kristinas väg 53B; Chapter 2; p. 21. [Google Scholar]

**Figure 2.**(

**A**) α predictive performance; (

**B**) evolution of the difference between true and predicted values.

**Figure 3.**(

**A**) β predictive performance; (

**B**) evolution of the difference between true and predicted values.

**Figure 4.**(

**A**) All clusters hypoxic (All Clusters_hy) predictive performance, (

**B**) evolution of difference between true and predicted values. Hypoxic refers to 3% oxygen conditions.

**Figure 5.**(

**A**) All clusters oxic (All Clusters_ox) predictive performance, (

**B**) evolution of difference between true and predicted values. Oxic refers to normal 20% oxygen conditions.

**Figure 6.**Predictive performance and difference between true and predicted values. (

**A**,

**B**) Double-stranded breaks (DSBs) hypoxic (DSBs_hy) and (

**C**,

**D**) oxic (DSBs_ox). An interesting fact is that the uncertainty decreases for bigger values of both DSBs, meaning that DSBs are more predictable and exhibit a less stochastic behavior while they increase.

**Figure 8.**(

**A**) Feature importance for all clusters (3% oxygen). (

**B**) Feature importance for all clusters (20% oxygen).

**Figure 12.**Two way partial dependence (energy—LET). (

**A**) α, (

**B**) all clusters (hypoxic—3% oxygen), (

**C**) all clusters (oxic—20% oxygen).

**Figure 13.**Local interpretations. (

**A**) α, (

**B**) DSBs oxic, (

**C**) all clusters (hypoxic 3% oxygen), (

**D**) all clusters (oxic—20% oxygen). Explanation of a single prediction. The impact of each feature is depicted as part of the predicted value. On each bar in the charts, the values that features take in the specific sample are displayed, and value ranges for continuous variables that the local interpretable model-agnostic explanations (LIME) algorithm has decided are important for the predicted outcome. The sum of the weights of features amounts to the value predicted by the linear regression model which was trained in the neighborhood of sample instance. Local prediction usually is very close to the value predicted by the the global random forest model.

Targets | Lower Percentile | Median | Upper Percentile |
---|---|---|---|

a_paper | 0.13 | 0.18 | 0.31 |

b_paper | 0.41 | 0.75 | 1.0 |

a_fit | 0.21 | 0.33 | 0.5 |

b_fit | 0.63 | 0.85 | 1.0 |

Targets | Lower Percentile | Median | Upper Percentile |
---|---|---|---|

All Clusters (hypoxic) | 0.076 | 0.089 | 0.100 |

All Clusters (oxic) | 0.0005 | 0.0013 | 0.0029 |

DSBs (hypoxic) | 0.059 | 0.066 | 0.073 |

DSBs (oxic) | 0.031 | 0.034 | 0.036 |

Targets | Prediction | True Value |
---|---|---|

$\alpha $ parameter | 1.92 | 2.2 |

DSBs oxic | 13.7 | 13.7 |

All Clusters (hypoxic cells) | 231.8 | 231.0 |

All Clusters (oxic cells) | 367.1 | 367.8 |

**Table 4.**Statistical significance. H0: the means of the error distribution that ML model produces are significantly different from that of a classic statistical model.

Targets | p-Value | Result |
---|---|---|

$\alpha $ parameter | 1.8 × 10^{−5} | reject H0 |

$\beta $ parameter | 9.4 × 10^{−1} | fail to reject H0 |

All Clusters (hypoxic cells) | 1.5 × 10^{−7} | reject H0 |

All Clusters (oxic cells) | 4.2 × 10^{−11} | reject H0 |

DSBs (hypoxic cells) | 5.4 × 10^{−6} | reject H0 |

DSBs (oxic cells) | 1.0 × 10^{−9} | reject H0 |

Features | |||||||||
---|---|---|---|---|---|---|---|---|---|

Cell | Type | Origin | Phase | Genl | Ion | Charge | Irmods | LET | E |

HF19 | n | h | a | 6.0 | 4He | 2.0 | m | 20.0 | 8.80 |

R-1 | t | r | a | 5.6 | 12C | 6.0 | s | 11.0 | 389.00 |

V79 | n | r | a | 5.6 | 1H | 1.0 | m | 31.0 | 0.76 |

Targets | |||
---|---|---|---|

$\mathbf{\alpha}$ | $\mathbf{\beta}$ | All Clusters Hypoxic$({\mathit{Gray}}^{-\mathbf{1}}\ast {\mathit{Gbp}}^{-\mathbf{1}})$ | All Clusters Oxic$({\mathit{Gray}}^{-\mathbf{1}}\ast {\mathit{Gbp}}^{-\mathbf{1}})$ |

1.090 | 0.000 | 354.5 | 355.9 |

0.402 | 0.054 | 407.5 | 412.6 |

1.030 | 0.000 | 394.0 | 398.0 |

Var_Name | Description | Categorical\Numeric |
---|---|---|

Cells | Name of cell line | categorical |

Type | Tumor cells (t) or normal cells (n) | categorical |

Origin | human (h) or rodent cells (r) | categorical |

Phase | Cell cycle phase (phases are given explicitly, or a for asynchronous) | categorical |

Genl | Genomic length of diploid cells (in 10 bp, 5.6 for rodent and 6 for human cells) | numeric |

Ion | Ion species | categorical |

Charge | charge of ions | numeric |

Irrmods | Irradiation modalities: monoenergetic (m) or spread out Bragg peak (s) | categorical |

LET | Linear energy transfer in water (in keV/$\mathsf{\mu}$m, for irradiation in spread out Bragg peak dose mean or track averaged LET) | numeric |

E | Specific energy of ions (in MeV/u), evaluated at the target | numeric |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Papakonstantinou, D.; Zanni, V.; Nikitaki, Z.; Vasileiou, C.; Kousouris, K.; Georgakilas, A.G.
Using Machine Learning Techniques for Asserting Cellular Damage Induced by High-LET Particle Radiation. *Radiation* **2021**, *1*, 45-64.
https://doi.org/10.3390/radiation1010005

**AMA Style**

Papakonstantinou D, Zanni V, Nikitaki Z, Vasileiou C, Kousouris K, Georgakilas AG.
Using Machine Learning Techniques for Asserting Cellular Damage Induced by High-LET Particle Radiation. *Radiation*. 2021; 1(1):45-64.
https://doi.org/10.3390/radiation1010005

**Chicago/Turabian Style**

Papakonstantinou, Dimitris, Vaso Zanni, Zacharenia Nikitaki, Christina Vasileiou, Konstantinos Kousouris, and Alexandros G. Georgakilas.
2021. "Using Machine Learning Techniques for Asserting Cellular Damage Induced by High-LET Particle Radiation" *Radiation* 1, no. 1: 45-64.
https://doi.org/10.3390/radiation1010005