# On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2—Applicability Domain and Outliers

^{*}

## Abstract

**:**

## 1. Introduction

**A condition for a point to belong to AD in leverage method:**

## 2. Overview of the Methods for AD Definition/Outlier Detection in Descriptor-Space and Their Compatibility with High-Dimensional Data

#### 2.1. Classical Approaches

**Hat matrix H and leverage values ${h}_{ii}$, ${h}_{new}$:**

**Standardized residuals:**

#### 2.2. Approaches for High-Dimensional Data

## 3. Dataset and Methods

#### 3.1. Dataset and ML/QSPR Models

#### 3.2. AD Visualization

#### 3.3. AD Definition as a Data Preprocessing Method (Substudy 1)

- Layout 1: simultaneous elimination of X-outliers and Model-outliers.
- Layout 2: elimination of Model-outliers followed by elimination of X-outliers.
- Layout 3: elimination of X-outliers followed by elimination of Model-outliers.

#### 3.4. AD Definition during ML Model Construction (Substudy 2)

#### 3.5. AD Definition during ML Model Deployment (Substudy 3)

## 4. Results

#### 4.1. AD Definition as a Data Preprocessing Method (Substudy 1)

#### 4.1.1. Outlier Detection

#### 4.1.2. Effects of X-outliers on the Model Performance

#### 4.1.3. Effects of Model-Outliers on the Model Performance

#### 4.1.4. Effects of X-outliers and Model-outliers on the Model Performance

#### 4.1.5. Effects of y-outliers on the Model Performance

#### 4.2. AD Definition during ML Model Construction (Substudy 2)

#### 4.2.1. Dimensionality Reduction on the Preprocessed Data without Outliers

#### 4.2.2. AD Visualization and Evaluation

#### 4.2.3. On the Understanding of the High Prediction Errors for Test Molecules

#### 4.3. AD Definition during ML Model Deployment (Substudy 3)

## 5. Conclusions and Perspectives

## Supplementary Materials

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

AD | Applicability domain. |

DIPPR | Design institute for physical properties. |

DFT | Density functional theory. |

H | Enthalpy for ideal gas at 298.15 K and 1 bar. |

IQR | Interquartile range. |

GA | Genetic algorithm. |

GNN | Graph neural network. |

GP | Gaussian processes. |

iForest | Isolation forest. |

kNN | k-nearest neighbors. |

Lasso | Least absolute shrinkage and selection operator. |

$MAE$ | Mean absolute error. |

ML | Machine learning. |

OECD | Organisation for economic co-operation and development. |

PCs | Principal components. |

PCA | Principal component analysis. |

QSAR | Quantitative structure–activity relationship. |

QSPR | Quantitative structure–property relationship. |

${R}^{2}$ | Coefficient of determination. |

REACH | Registration, Evaluation, Authorization and Restriction of CHemicals. |

RF | Random forest. |

RF confidence | Random forest prediction confidence. |

$RMSE$ | Root mean square error. |

S | Absolute entropy of ideal gas at 298.15 K and 1 bar. |

tSNE | t-distributed stochastic neighbor embedding. |

## Appendix A. Identification of Model-outliers and X-outliers in the Preprocessed Data for Enthalpy

Absolute Standardized | Chemid | Family |
---|---|---|

Residual ${\mathit{r}}_{\mathit{i}}$ | ||

${r}_{i}\ge 10$ | 3608 | Halogen Compounds |

2628 | Halogen Compounds | |

3862 | Polyfunctional Compounds | |

7886 | Polyfunctional Compounds | |

7887 | Polyfunctional Compounds | |

$5\le {r}_{i}<10$ | 1840 | Sulfur Compounds |

2254 | Organic Acids | |

1950 | Inorganic Compounds | |

1969 | Silicon Compounds | |

$3\le {r}_{i}<5$ | 2283 | Organic Acids |

3931 | Silicon Compounds | |

3948 | Silicon Compounds | |

3929 | Silicon Compounds | |

4994 | Silicon Compounds | |

9858 | Organic Salts | |

2619 | Halogen Compounds | |

6851 | Other Compounds | |

2995 | Silicon Compounds | |

3991 | Silicon Compounds | |

3958 | Silicon Compounds | |

2370 | Esters/Ethers | |

2653 | Halogen Compounds | |

2877 | Polyfunctional Compounds | |

3974 | Silicon Compounds | |

3898 | Inorganic Compounds | |

6850 | Organic Compounds | |

9866 | Nitrogen Compounds |

N° | iForest | RF Confidence | tSNE2D/kNN | |||
---|---|---|---|---|---|---|

Chemid | Family | Chemid | Family | Chemid | Family | |

1 | 3933 | Silicon Compounds | 2624 | Halogen Compounds | 3977 | Silicon Compounds |

2 | 3932 | Silicon Compounds | 3933 | Silicon Compounds | 1509 | Halogen Compounds |

3 | 2624 | Halogen Compounds | 1627 | Halogen Compounds | 1866 | Halogen Compounds |

4 | 3931 | Silicon Compounds | 1631 | Halogen Compounds | 9879 | Nitrogen Compounds |

5 | 2991 | Silicon Compounds | 1626 | Halogen Compounds | 9877 | Nitrogen Compounds |

6 | 1631 | Halogen Compounds | 3881 | Polyfunctional Compounds | 90 | Alkanes |

7 | 3877 | Other Compounds | 1930 | Inorganic Compounds | 5878 | Organic Acids |

8 | 2995 | Silicon Compounds | 1864 | Halogen Compounds | 1097 | Ketones/Aldehydes |

9 | 3348 | Esters/Ethers | 3932 | Silicon Compounds | 3056 | Ketones/Aldehydes |

10 | 1627 | Halogen Compounds | 834 | Alkanes | 7883 | Nitrogen Compounds |

## Appendix B. Most Represented Families in the Eliminated Outliers (Layout 3) for Enthalpy and Entropy

Enthalpy | Entropy | ||
---|---|---|---|

Halogen Compounds | 24% | Esters/Ethers | 20% |

Silicon Compounds | 16% | Silicon Compounds | 13% |

Esters/Ethers | 10% | Polyfunctional Compounds | 9% |

Nitrogen Compounds | 9% | Nitrogen Compounds | 9% |

Inorganic Compounds | 7% | Aromatics | 7% |

Total | 66% | Total | 58% |

## Appendix C. AD Definition for Entropy During Model Construction

**Figure A1.**Parity plots for

**entropy**after outlier elimination (Layout 3) and dimensionality reduction (GA).

**Figure A2.**Visualization of the AD for

**entropy**after outlier elimination (Layout 3) and dimensionality reduction (GA).

**Figure A3.**tSNE 2D representations for

**entropy**after outlier elimination (Layout 3) and dimensionality reduction (GA),

**splits 2 and 4**.

**Figure A4.**tSNE 2D representations for

**entropy**after outlier elimination (Layout 3) and dimensionality reduction (GA),

**split 5**.

**Figure A6.**Analysis of the causes of the high prediction errors for

**entropy**after outlier elimination (Layout 3) and dimensionality reduction (GA).

## References

- Netzeva, T.I.; Worth, A.P.; Aldenberg, T.; Benigni, R.; Cronin, M.T.; Gramatica, P.; Jaworska, J.S.; Kahn, S.; Klopman, G.; Marchant, C.A.; et al. Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. ATLA Altern. Lab. Anim.
**2005**, 33, 155–173. [Google Scholar] [CrossRef] - McCartney, M.; Haeringer, M.; Polifke, W. Comparison of Machine Learning Algorithms in the Interpolation and Extrapolation of Flame Describing Functions. J. Eng. Gas Turbines Power
**2020**, 142, 061009. [Google Scholar] [CrossRef] - Cao, X.; Yousefzadeh, R. Extrapolation and AI transparency: Why machine learning models should reveal when they make decisions beyond their training. Big Data Soc.
**2023**, 10, 20539517231169731. [Google Scholar] [CrossRef] - European Commission Environment Directorate General. Guidance Document on the Validation of (Quantitative)Structure-Activity Relationships [(Q)Sar] Models; OECD: Paris, France, 2014; pp. 1–154. [Google Scholar]
- Dearden, J.C.; Cronin, M.T.; Kaiser, K.L. How not to develop a quantitative structure-activity or structure-property relationship (QSAR/QSPR). SAR QSA Environ. Res.
**2009**, 20, 241–266. [Google Scholar] [CrossRef] [PubMed] - Singh, M.M.; Smith, I.F.C. Extrapolation with machine learning based early-stage energy prediction models. In Proceedings of the 2023 European Conference on Computing in Construction and the 40th International CIB W78 Conference, Crete, Greece, 10–12 July 2023; Volume 4. [Google Scholar] [CrossRef]
- Muckley, E.S.; Saal, J.E.; Meredig, B.; Roper, C.S.; Martin, J.H. Interpretable models for extrapolation in scientific machine learning. Digit. Discov.
**2023**, 2, 1425–1435. [Google Scholar] [CrossRef] - Hoaglin, D.C.; Kempthorne, P.J. Influential Observations, High Leverage Points, and Outliers in Linear Regression: Comment. Stat. Sci.
**1986**, 1, 408–412. [Google Scholar] [CrossRef] - Aggarwal, C.C.; Yu, P.S. Outlier detection for high dimensional data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, 21–24 May 2001; pp. 37–46. [Google Scholar] [CrossRef]
- Akoglu, L.; Tong, H.; Koutra, D. Graph based anomaly detection and description: A survey. Data Min. Knowl. Discov.
**2015**, 29, 626–688. [Google Scholar] [CrossRef] - Souiden, I.; Omri, M.N.; Brahmi, Z. A survey of outlier detection in high dimensional data streams. Comput. Sci. Rev.
**2022**, 44, 100463. [Google Scholar] [CrossRef] - Smiti, A. A critical overview of outlier detection methods. Comput. Sci. Rev.
**2020**, 38, 100306. [Google Scholar] [CrossRef] - Cao, D.S.; Liang, Y.Z.; Xu, Q.S.; Li, H.D.; Chen, X. A New Strategy of Outlier Detection for QSAR/QSPR. J. Comput. Chem.
**2010**, 31, 592–602. [Google Scholar] [CrossRef] - De Maesschalck, R.; Estienne, F.; Verdu-Andres, J.; Candolfi, A.; Centner, V.; Despagne, F.; Jouan-Rimbaud, D.; Walczak, B.; Massart, D.L.; De Jong, S.; et al. The development of calibration models for spectroscopic data using principal component regression [Review]. Internet J. Chem.
**1999**, 2, 1–21. [Google Scholar] - Trinh, C.; Tbatou, Y.; Lasala, S.; Herbinet, O.; Meimaroglou, D. On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties. Part 1—From Data Collection to Model Construction: Understanding of the Methods and their Effects. Processes
**2023**, 11, 3325. [Google Scholar] [CrossRef] - Sahigara, F.; Mansouri, K.; Ballabio, D.; Mauri, A.; Consonni, V.; Todeschini, R. Comparison of different approaches to define the applicability domain of QSAR models. Molecules
**2012**, 17, 4791–4810. [Google Scholar] [CrossRef] [PubMed] - Jaworska, J.; Nikolova-Jeliazkova, N.; Aldenberg, T. QSAR applicability domain estimation by projection of the training set in descriptor space: A review. ATLA Altern. Lab. Anim.
**2005**, 33, 445–459. [Google Scholar] [CrossRef] [PubMed] - Mathea, M.; Klingspohn, W.; Baumann, K. Chemoinformatic Classification Methods and their Applicability Domain. Mol. Inform.
**2016**, 35, 160–180. [Google Scholar] [CrossRef] [PubMed] - Roy, K.; Kar, S.; Ambure, P. On a simple approach for determining applicability domain of QSAR models. Chemom. Intell. Lab. Syst.
**2015**, 145, 22–29. [Google Scholar] [CrossRef] - Yalamanchi, K.K.; van Oudenhoven, V.C.; Tutino, F.; Monge-Palacios, M.; Alshehri, A.; Gao, X.; Sarathy, S.M. Machine Learning to Predict Standard Enthalpy of Formation of Hydrocarbons. J. Phys. Chem. A
**2019**, 123, 8305–8313. [Google Scholar] [CrossRef] - Yalamanchi, K.K.; Monge-Palacios, M.; van Oudenhoven, V.C.; Gao, X.; Sarathy, S.M. Data Science Approach to Estimate Enthalpy of Formation of Cyclic Hydrocarbons. J. Phys. Chem. A
**2020**, 124, 6270–6276. [Google Scholar] [CrossRef] - Aldosari, M.N.; Yalamanchi, K.K.; Gao, X.; Sarathy, S.M. Predicting entropy and heat capacity of hydrocarbons using machine learning. Energy AI
**2021**, 4, 100054. [Google Scholar] [CrossRef] - Aouichaoui, A.R.; Fan, F.; Abildskov, J.; Sin, G. Application of interpretable group-embedded graph neural networks for pure compound properties. Comput. Chem. Eng.
**2023**, 176, 108291. [Google Scholar] [CrossRef] - Balestriero, R.; Pesenti, J.; LeCun, Y. Learning in High Dimension Always Amounts to Extrapolation. arXiv
**2021**, arXiv:2110.09485. [Google Scholar] - Ghorbani, H. Mahalanobis Distance and Its Application for detecting multivariate outliers. Ser. Math. Inform.
**2019**, 34, 583–595. [Google Scholar] [CrossRef] - De Maesschalck, R.; Jouan-Rimbaud, D.; Massart, D.L. The Mahalanobis distance. Chemom. Intell. Lab. Syst.
**2000**, 50, 1–18. [Google Scholar] [CrossRef] - Aouichaoui, A.R.; Fan, F.; Mansouri, S.S.; Abildskov, J.; Sin, G. Combining Group-Contribution Concept and Graph Neural Networks Toward Interpretable Molecular Property Models. J. Chem. Inf. Model.
**2023**, 63, 725–744. [Google Scholar] [CrossRef] [PubMed] - Mauri, A.; Bertola, M. Alvascience: A New Software Suite for the QSAR Workflow Applied to the Blood–Brain Barrier Permeability. Int. J. Mol. Sci.
**2022**, 23, 12882. [Google Scholar] [CrossRef] [PubMed] - Huoyu, R.; Zhiqiang, Z.; Zhanggao, L.; Zhenzhen, X. Quantitative structure–property relationship for the critical temperature of saturated monobasic ketones, aldehydes, and ethers with molecular descriptors. Int. J. Quantum Chem.
**2022**, 122, 1–10. [Google Scholar] [CrossRef] - Cao, L.; Zhu, P.; Zhao, Y.; Zhao, J. Using machine learning and quantum chemistry descriptors to predict the toxicity of ionic liquids. J. Hazard. Mater.
**2018**, 352, 17–26. [Google Scholar] [CrossRef] - Yousefinejad, S.; Mahboubifar, M.; Eskandari, R. Quantitative structure-activity relationship to predict the anti-malarial activity in a set of new imidazolopiperazines based on artificial neural networks. Malar. J.
**2019**, 18, 1–17. [Google Scholar] [CrossRef] - Asadollahi, T.; Dadfarnia, S.; Shabani, A.M.H.; Ghasemi, J.B.; Sarkhosh, M. QSAR models for cxcr2 receptor antagonists based on the genetic algorithm for data preprocessing prior to application of the pls linear regression method and design of the new compounds using in silico virtual screening. Molecules
**2011**, 16, 1928–1955. [Google Scholar] [CrossRef] - Kim, M.G. Sources of High Leverage in Linear Regression Model. J. Appl. Math. Inform.
**2004**, 16, 509–513. [Google Scholar] - Leys, C.; Klein, O.; Dominicy, Y.; Ley, C. Detecting multivariate outliers: Use a robust variant of the Mahalanobis distance. J. Exp. Soc. Psychol.
**2018**, 74, 150–156. [Google Scholar] [CrossRef] - Gramatica, P. Principles of QSAR models validation: Internal and external. QSAR Comb. Sci.
**2007**, 26, 694–701. [Google Scholar] [CrossRef] - Varamesh, A.; Hemmati-Sarapardeh, A.; Dabir, B.; Mohammadi, A.H. Development of robust generalized models for estimating the normal boiling points of pure chemical compounds. J. Mol. Liq.
**2017**, 242, 59–69. [Google Scholar] [CrossRef] - Sabando, M.V.; Ponzoni, I.; Soto, A.J. Neural-based approaches to overcome feature selection and applicability domain in drug-related property prediction. Appl. Soft Comput. J.
**2019**, 85, 105777. [Google Scholar] [CrossRef] - Huang, J.; Fan, X. Reliably assessing prediction reliability for high dimensional QSAR data. Mol. Divers.
**2013**, 17, 63–73. [Google Scholar] [CrossRef] [PubMed] - Rakhimbekova, A.; Madzhidov, T.; Nugmanov, R.I.; Baskin, I.; Varnek, A.; Rakhimbekova, A.; Madzhidov, T.; Nugmanov, R.I.; Gimadiev, T.; Baskin, I. Comprehensive Analysis of Applicability Domains of QSPR Models for Chemical Reactions. Int. J. Mol. Sci.
**2021**, 21, 5542. [Google Scholar] [CrossRef] [PubMed] - Kaneko, H.; Funatsu, K. Applicability domain based on ensemble learning in classification and regression analyses. J. Chem. Inf. Model.
**2014**, 54, 2469–2482. [Google Scholar] [CrossRef] - Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
- Sushko, I. Applicability Domain of QSAR Models. Ph.D. Thesis, Technical University of Munich, Munich, Germany, 2011. [Google Scholar]
- Kamalov, F.; Leung, H.H. Outlier Detection in High Dimensional Data. J. Inf. Knowl. Manag.
**2020**, 19, 1–15. [Google Scholar] [CrossRef] - Riahi-Madvar, M.; Nasersharif, B.; Azirani, A.A. Subspace outlier detection in high dimensional data using ensemble of PCA-based subspaces. In Proceedings of the 26th International Computer Conference, Computer Society of Iran, CSICC 2021, Tehran, Iran, 3–4 March 2021. [Google Scholar] [CrossRef]
- Kriegel, H.P.; Kr, P.; Schubert, E.; Zimek, A. Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data; Springer: Berlin Heidelberg, Germany, 2009; Volume 1, pp. 831–838. [Google Scholar]
- Filzmoser, P.; Maronna, R.; Werner, M. Outlier identification in high dimensions. Comput. Stat. Data Anal.
**2008**, 52, 1694–1711. [Google Scholar] [CrossRef] - Angiulli, F.; Pizzuti, C. Fast outlier detection in high dimensional spaces. In Proceedings of the Principles of Data Mining and Knowledge Discovery, 6th European Conference PKDD, Helsinki, Finland, 19–23 August 2002; pp. 29–41. [Google Scholar]
- Kriegel, H.P.; Schubert, M.; Zimek, A. Angle-based outlier detection in high-dimensional data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 444–452. [Google Scholar] [CrossRef]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data
**2012**, 6, 1–39. [Google Scholar] [CrossRef] - Thudumu, S.; Branch, P.; Jin, J.; Singh, J.J. A comprehensive survey of anomaly detection techniques for high dimensional big data. J. Big Data
**2020**, 7, 42. [Google Scholar] [CrossRef] - Xu, X.; Liu, H.; Li, L.; Yao, M. A comparison of outlier detection techniques for high-dimensional data. Int. J. Comput. Intell. Syst.
**2018**, 11, 652–662. [Google Scholar] [CrossRef] - Zimek, A.; Schubert, E.; Kriegel, H.P. A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min.
**2012**, 5, 363–387. [Google Scholar] [CrossRef] - Erfani, S.M.; Rajasegarar, S.; Karunasekera, S.; Leckie, C. High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recognit.
**2016**, 58, 121–134. [Google Scholar] [CrossRef] - Alvascience, AlvaDesc (Software for Molecular Descriptors Calculation), Version 2.0.8. 2021. Available online: https://www.alvascience.com (accessed on 1 January 2023).
- Mauri, A. alvaDesc: A tool to calculate and analyze molecular descriptors and fingerprints. Methods Pharmacol. Toxicol.
**2020**, 2, 801–820. [Google Scholar] [CrossRef] - van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res.
**2008**, 9, 2579–2605. [Google Scholar] - Gaussian, Inc. Gaussian 09; Gaussian, Inc.: Wallingford, CT, USA, 2010. [Google Scholar]
- Montgomery, J.A.; Frisch, M.J.; Ochterski, J.W.; Petersson, G.A. A complete basis set model chemistry. VI. Use of density functional geometries and frequencies. J. Chem. Phys.
**1999**, 110, 2822–2827. [Google Scholar] [CrossRef] - Becke, A.D. Thermochemistry. III. The role of exact exchange. J. Chem. Phys.
**1993**, 98, 5648–5652. [Google Scholar] [CrossRef] - Miyoshi, A. GPOP Software, Rev. 2022.01.20m1. Available online: http://akrmys.com/gpop/ (accessed on 1 January 2023).
- Non-Positive Definite Covariance Matrices. Available online: https://www.value-at-risk.net/non-positive-definite-covariance-matrices (accessed on 1 June 2023).
- Cruz-Monteagudo, M.; Medina-Franco, J.L.; Pérez-Castillo, Y.; Nicolotti, O.; Cordeiro, M.N.D.; Borges, F. Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? Drug Discov. Today
**2014**, 19, 1069–1080. [Google Scholar] [CrossRef] - Fechner, U.; Franke, L.; Renner, S.; Schneider, P.; Schneider, G. Comparison of correlation vector methods for ligand-based similarity searching. J. -Comput.-Aided Mol. Des.
**2003**, 17, 687–698. [Google Scholar] [CrossRef] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Cao, D.; Liang, Y.; Xu, Q.; Yun, Y.; Li, H. Toward better QSAR/QSPR modeling: Simultaneous outlier detection and variable selection using distribution of model features. J.-Comput.-Aided Mol. Des.
**2011**, 25, 67–80. [Google Scholar] [CrossRef] [PubMed] - Insolia, L.; Kenney, A.; Chiaromonte, F.; Felici, G. Simultaneous feature selection and outlier detection with optimality guarantees. Biometrics
**2022**, 78, 1592–1603. [Google Scholar] [CrossRef] [PubMed] - Menjoge, R.S.; Welsch, R.E. A diagnostic method for simultaneous feature selection and outlier identification in linear regression. Comput. Stat. Data Anal.
**2010**, 54, 3181–3193. [Google Scholar] [CrossRef] - Kim, S.S.; Park, S.H.; Krzanowski, W.J. Simultaneous variable selection and outlier identification in linear regression using the mean-shift outlier model. J. Appl. Stat.
**2008**, 35, 283–291. [Google Scholar] [CrossRef] - Jimenez, F.; Lucena-Sanchez, E.; Sanchez, G.; Sciavicco, G. Multi-Objective Evolutionary Simultaneous Feature Selection and Outlier Detection for Regression. IEEE Access
**2021**, 9, 135675–135688. [Google Scholar] [CrossRef] - Park, J.S.; Park, C.G.; Lee, K.E. Simultaneous outlier detection and variable selection via difference-based regression model and stochastic search variable selection. Commun. Stat. Appl. Methods
**2019**, 26, 149–161. [Google Scholar] [CrossRef] - Wiegand, P.; Pell, R.; Comas, E. Simultaneous variable selection and outlier detection using a robust genetic algorithm. Chemom. Intell. Lab. Syst.
**2009**, 98, 108–114. [Google Scholar] [CrossRef] - Tolvi, J. Genetic algorithms for outlier detection and variable selection in linear regression models. Soft Comput.
**2004**, 8, 527–533. [Google Scholar] [CrossRef] - Wen, M.; Deng, B.C.; Cao, D.S.; Yun, Y.H.; Yang, R.H.; Lu, H.M.; Liang, Y.Z. The model adaptive space shrinkage (MASS) approach: A new method for simultaneous variable selection and outlier detection based on model population analysis. Analyst
**2016**, 141, 5586–5597. [Google Scholar] [CrossRef] - t-SNE: The Effect of Various Perplexity Values on the Shape. Available online: https://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html (accessed on 1 June 2023).
- Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst.
**2021**, 32, 4–24. [Google Scholar] [CrossRef] [PubMed] - Xu, K.; Jegelka, S.; Hu, W.; Leskovec, J. How powerful are graph neural networks? In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019; pp. 1–17. [Google Scholar]
- Jiang, D.; Wu, Z.; Hsieh, C.Y.; Chen, G.; Liao, B.; Wang, Z.; Shen, C.; Cao, D.; Wu, J.; Hou, T. Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J. Cheminform.
**2021**, 13, 1–23. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Classical methods for AD definition: (

**a**) chemical-based methods, (

**b**) range-based and geometric methods, (

**c**) distance-based methods/k-nearest neighbors, (

**d**) probability distribution-based methods.

**Figure 2.**AD plot employed in this study. The anomaly score on the x-axis is computed via different methods: iForest, RF confidence or tSNE2D/kNN.

**Figure 3.**Comparison of three X-outlier detection methods on the preprocessed data for

**enthalpy**: (

**a**) iForest, (

**b**) tSNE2D/kNN, (

**c**) RF confidence, (

**d**) RF confidence combined with H values. y-outliers are identified via the IQR method.

**Figure 5.**Evolution of Lasso model performance in predicting the

**enthalpy**under different proportions of X-outliers removed from the preprocessed data: (

**a**) iForest, (

**b**) RF confidence, (

**c**) tSNE2D/kNN.

**Figure 6.**Comparison of Lasso model performance in predicting the

**enthalpy**under different proportions of X-outliers removed from the preprocessed data via the three tested methods on the (

**a**) train and (

**b**) test datasets.

**Figure 7.**Comparison of the dataset (

**a**) before and (

**b**) after the removal of X-outliers via the RF confidence method for

**enthalpy**(the molecules with anomaly scores above 200 kJ/mol in (

**a**) are removed). In red, the molecules that were not outliers in (

**a**) become outliers in (

**b**).

**Figure 8.**Comparison of the dataset (

**a**) before and (

**b**) after the removal of Model-outliers for

**enthalpy**(elimination of the molecules with absolute standardized residuals above 3 kJ/mol in (

**a**)). In red, the molecules that were not outliers in (

**a**) become outliers in (

**b**).

**Figure 9.**Evolution of Lasso model performance in predicting the

**enthalpy**under different thresholds for Model-outlier elimination in the preprocessed data.

**Figure 11.**Effect of different scenarios of outlier elimination on Lasso model performance in predicting the

**enthalpy**.

**Figure 13.**Parity plots for

**enthalpy**after outlier elimination (Layout 3) and dimensionality reduction (GA).

**Figure 14.**Visualisation of the AD for

**enthalpy**after outlier elimination (Layout 3) and dimensionality reduction (GA).

**Figure 15.**tSNE 2D representations for

**enthalpy**after outlier elimination (Layout 3) and dimensionality reduction (GA),

**splits 2 and 3**. Subfigures (

**a**) and (

**d**) show the original data, distinguishing train and test sets. Subfigures (

**b**) and (

**e**) show the same respective data, distinguishing chemical families (color codes given in Figure A5). Subfigures (

**c**) and (

**f**) show the same respective data, distinguishing the reference H values from the DIPPR.

**Figure 16.**tSNE 2D representations for

**enthalpy**after outlier elimination (Layout 3) and dimensionality reduction (GA),

**splits 4 and 5**. Subfigures (

**a**) and (

**d**) show the original data, distinguishing train and test sets. Subfigures (

**b**) and (

**e**) show the same respective data, distinguishing chemical families (color codes given in Figure A5). Subfigures (

**c**) and (

**f**) show the same respective data, distinguishing the reference H values from the DIPPR.

**Figure 17.**Analysis of the causes of the high prediction errors for

**enthalpy**after outlier elimination (Layout 3) and dimensionality reduction (GA).

**Figure 20.**Visualization of the new species in

**Cat. A and B**with respect to the AD of the

**enthalpy**model for Split 1. Subfigures (

**a**) and (

**d**) show the standardized residuals vs. the RF confidence anomaly scores, distinguishing training data and new species. The other subfigures show the tSNE 2D representation of the same data, distinguishing chemical families (color codes given in Figure A5) for subfigures (

**b**) and (

**e**), or distinguishing the reference H values for subfigures (

**c**) and (

**f**).

**Figure 21.**Visualization of the new species in

**Cat. C and D**with respect to the AD of the

**enthalpy**model for Split 1. Subfigures (

**a**) and (

**d**) show the standardized residuals vs. the RF confidence anomaly scores, distinguishing training data and new species. The other subfigures show the tSNE 2D representation of the same data, distinguishing chemical families (color codes given in Figure A5) for subfigures (

**b**) and (

**e**), or distinguishing the reference H values for subplots (

**c**) and (

**f**).

**Figure 22.**Visualization of the new species in

**Cat. A and B**with respect to the AD of the

**entropy**model for Split 1. Subfigures (

**a**) and (

**d**) show the standardized residuals vs. the RF confidence anomaly scores, distinguishing training data and new species. The other subfigures show the tSNE 2D representation of the same data, distinguishing chemical families (color codes given in Figure A5) for subfigures (

**b**) and (

**e**), or distinguishing the reference S values for subfigures (

**c**) and (

**f**).

**Figure 23.**Visualization of the new species in

**Cat. C and D**with respect to the AD of the

**entropy**model for Split 1. Subfigures (

**a**) and (

**d**) show the standardized residuals vs. the RF confidence anomaly scores, distinguishing training data and new species. The other subplots show the tSNE 2D representation of the same data, distinguishing chemical families (color codes given in Figure A5) for subplots (

**b**) and (

**e**), or distinguishing the reference S values for subfigures (

**c**) and (

**f**).

Configuration | Methods | Thresholds |
---|---|---|

1. Effect of X-outlier elimination | iForest, RF confidence, tSNE2D/kNN | Elimination of 0%, 10%, 30%, 50% of the molecules with the highest anomaly scores |

2. Effect of Model-outlier elimination | Standardized residuals | Elimination of the molecules with absolute standardized residuals above 1, 2, 3, 5 and 10 |

3. Effect of X-outlier and Model-outlier elimination | Layouts: 1-Simultaneous elimination 2-Model-outlier then X-outlier 3-X-outlier then Model-outlier (RF confidence for X-outlier, Standardized residuals for Model-outlier) | RF confidence: 150 kJ/mol for enthalpy, 50 J/mol/K for entropy. Standardized residuals: 2 kJ/mol or J/mol/K. |

4. Effect of y-outlier elimination | IQR | Elimination of the molecules with y-values outside the thresholds of $Q1-1.5\times IQR$ and $Q3+1.5\times IQR$ |

**Table 2.**Comparison of the performance of the models in absence and in presence of y-outliers. MAE and RMSE are in kJ/mol and J/mol/K for the

**enthalpy**and the

**entropy**, respectively.

Property | Outliers | Mol. | Desc. | ${\mathit{R}}^{2}$ | ${\mathit{R}}^{2}$ | $\mathit{MAE}$ | $\mathit{MAE}$ | $\mathit{RMSE}$ | $\mathit{RMSE}$ |
---|---|---|---|---|---|---|---|---|---|

Elimination | Train | Test | Train | Test | Train | Test | |||

H | Layout 3, with y-outliers | 1531 | 2506 | 0.998 | 0.995 | 9.51 | 11.88 | 12.72 | 19.41 |

Layout 3, without y-outliers | 1525 | 2506 | 0.998 | 0.996 | 9.39 | 11.69 | 12.61 | 17.56 | |

S | Layout 3, with y-outliers | 1514 | 2479 | 0.996 | 0.995 | 7.49 | 8.19 | 9.90 | 11.40 |

Layout 3, without y-outliers | 1431 | 2479 | 0.991 | 0.988 | 7.56 | 8.28 | 10.00 | 11.49 |

**Table 3.**Effects of outlier elimination (Layout 3) and dimensionality reduction (GA) on the model performance in predicting

**enthalpy**(kJ/mol).

Configuration | Outlier | Dimensionality | Mol. | Desc. | ${\mathit{R}}^{2}$ | ${\mathit{R}}^{2}$ | $\mathit{MAE}$ | $\mathit{MAE}$ | $\mathit{RMSE}$ | $\mathit{RMSE}$ |
---|---|---|---|---|---|---|---|---|---|---|

Elimination | Reduction | Train | Test | Train | Test | Train | Test | |||

A | No | No | 1785 | 2506 | 0.997 | 0.978 | 16.08 | 26.96 | 30.90 | 71.76 |

B | Yes | No | 1531 | 2506 | 0.998 | 0.995 | 9.51 | 11.88 | 12.72 | 19.41 |

C | No | Yes | 1785 | 100 | 0.995 | 0.978 | 15.45 | 24.16 | 36.90 | 70.77 |

D | Yes | Yes | 1531 | 100 | 0.998 | 0.994 | 9.27 | 11.89 | 14.09 | 21.50 |

**Table 4.**Effects of outlier elimination (Layout 3) and dimensionality reduction (GA) on the model performance in predicting

**entropy**(J/mol/K).

Configuration | Outlier | Dimensionality | Mol. | Desc. | ${\mathit{R}}^{2}$ | ${\mathit{R}}^{2}$ | $\mathit{MAE}$ | $\mathit{MAE}$ | $\mathit{RMSE}$ | $\mathit{RMSE}$ |
---|---|---|---|---|---|---|---|---|---|---|

Elimination | Reduction | Train | Test | Train | Test | Train | Test | |||

A | No | No | 1747 | 2479 | 0.983 | 0.970 | 14.87 | 18.71 | 26.94 | 35.17 |

B | Yes | No | 1514 | 2479 | 0.996 | 0.995 | 7.49 | 8.19 | 9.90 | 11.40 |

C | No | Yes | 1747 | 100 | 0.980 | 0.969 | 14.09 | 17.37 | 29.23 | 35.49 |

D | Yes | Yes | 1514 | 100 | 0.996 | 0.995 | 7.09 | 8.01 | 9.78 | 11.49 |

**Table 5.**Number of non-computable descriptors (among the 100 descriptors selected by GA) for each category of new species.

Property | Category of Species | Split 1 | Split 2 | Split 3 | Split 4 | Split 5 |
---|---|---|---|---|---|---|

H | A | 1 | 0 | 0 | 1 | 2 |

B | 7 | 8 | 5 | 4 | 4 | |

C | 18 | 9 | 13 | 12 | 21 | |

D | 12 | 16 | 11 | 12 | 9 | |

S | A | 9 | 6 | 5 | 9 | 11 |

B | 0 | 0 | 0 | 1 | 1 | |

C | 26 | 23 | 26 | 23 | 33 | |

D | 25 | 19 | 20 | 20 | 24 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Trinh, C.; Lasala, S.; Herbinet, O.; Meimaroglou, D.
On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2—Applicability Domain and Outliers. *Algorithms* **2023**, *16*, 573.
https://doi.org/10.3390/a16120573

**AMA Style**

Trinh C, Lasala S, Herbinet O, Meimaroglou D.
On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2—Applicability Domain and Outliers. *Algorithms*. 2023; 16(12):573.
https://doi.org/10.3390/a16120573

**Chicago/Turabian Style**

Trinh, Cindy, Silvia Lasala, Olivier Herbinet, and Dimitrios Meimaroglou.
2023. "On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2—Applicability Domain and Outliers" *Algorithms* 16, no. 12: 573.
https://doi.org/10.3390/a16120573