# Large-Scale Predictions of Compound Potency with Original and Modified Activity Classes Reveal General Prediction Characteristics and Intrinsic Limitations of Conventional Benchmarking Calculations

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Results

#### 2.1. Study Concept

#### 2.2. Large-Scale Predictions

#### 2.3. Potency Range Balancing

#### 2.4. Removal of Nearest Neighbors

#### 2.5. Analog Series-Based Data Partitioning

## 3. Discussion

## 4. Materials and Methods

#### 4.1. Compound Activity Data

_{50}) and a numerical specified potency value (standard relation ‘=’) were retrieved. Potency values were recorded as the negative decadic logarithm. Only compounds with direct interactions (target relationship type: “D”) against human proteins at the highest confidence level (target confidence score: 9) and pIC

_{50}values ranging from 5 to 11 were considered. Additionally, measurements labeled “potential transcription error” and “potential author error’’ were removed. In addition, potential assay interference compounds were removed using public filters and tools [14,15,16].

#### 4.2. Compound Sets with Balanced Potency Distribution

#### 4.3. Model Building and Implementation

#### 4.3.1. Support Vector Regression

#### 4.3.2. k-Nearest Neighbor Regression

#### 4.3.3. Median Regression

#### 4.3.4. Hyperparameter Optimization

#### 4.4. Molecular Representation

#### 4.5. Performance Metric

#### 4.6. Statistical Significance Testing

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Lewis, R.A.; Wood, D. Modern 2D QSAR for Drug Discovery. WIREs Comput. Mol. Sci.
**2014**, 4, 505–522. [Google Scholar] [CrossRef] - Guedes, I.A.; Pereira, F.S.S.; Dardenne, L.E. Empirical Scoring Functions for Structure-Based Virtual Screening: Applications, Critical Aspects, and Challenges. Front. Pharmacol.
**2018**, 9, e1089. [Google Scholar] [CrossRef] [PubMed] - Williams-Noonan, B.J.; Yuriev, E.; Chalmers, D.K. Free Energy Methods in Drug Design: Prospects of “Alchemical Perturbation” In Medicinal Chemistry. J. Med. Chem.
**2018**, 61, 61638–61649. [Google Scholar] [CrossRef] - Gleeson, M.P.; Gleeson, D. QM/MM Calculations in Drug Discovery: A Useful Method for Studying Binding Phenomena? J. Chem. Inf. Model.
**2009**, 49, 670–677. [Google Scholar] [CrossRef] - Vamathevan, J.; Clark, D.; Czodrowski, P.; Dunham, I.; Ferran, E.; Lee, G.; Li, B.; Madabhushi, A.; Shah, P.; Spitzer, M.; et al. Applications of Machine Learning in Drug Discovery and Development. Nat. Rev. Drug. Discov.
**2019**, 18, 463–477. [Google Scholar] [CrossRef] [PubMed] - Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.C.; Sheridan, R.P.; Feuston, B.P. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci.
**2003**, 43, 1947–1958. [Google Scholar] [CrossRef] - Drucker, H.; Burges, C. Support Vector Regression Machines. Adv. Neural Inform. Proc. Syst.
**1997**, 9, 155–161. [Google Scholar] - Smola, A.J.; Schölkopf, B. A Tutorial on Support Vector Regression. Stat. Comput.
**2004**, 14, 199–222. [Google Scholar] [CrossRef] [Green Version] - Hou, F.; Wu, Z.; Hu, Z.; Xiao, Z.; Wang, L.; Zhang, X.; Li, G. Comparison Study on the Prediction of Multiple Molecular Properties by Various Neural Networks. J. Phys. Chem. A
**2018**, 122, 9128–9134. [Google Scholar] [CrossRef] - Feinberg, E.N.; Sur, D.; Wu, Z.; Husic, B.E.; Mai, H.; Li, Y.; Sun, S.; Yang, J.; Ramsundar, B.; Pande, V.S. PotentialNet for Molecular Property Prediction. ACS Cent. Sci.
**2018**, 4, 1520–1530. [Google Scholar] [CrossRef] - Walters, W.P.; Barzilay, R. Applications of Deep Learning in Molecule Generation and Molecular Property Prediction. Acc. Chem. Res.
**2020**, 54, 263–270. [Google Scholar] [CrossRef] - Janela, T.; Bajorath, J. Simple Nearest Neighbor Analysis Meets the Accuracy of Compound Potency Predictions Using Complex Machine Learning Models. Nat. Mach. Intell.
**2022**, 4, 1246–1255. [Google Scholar] [CrossRef] - Bento, A.P.; Gaulton, A.; Hersey, A.; Bellis, L.J.; Chambers, J.; Davies, M.; Krüger, F.A.; Light, Y.; Mak, L.; McGlinchey, S.; et al. The ChEMBL Bioactivity Database: An Update. Nucleic Acids Res.
**2014**, 42, D1083–D1090. [Google Scholar] [CrossRef] [Green Version] - Baell, J.B.; Holloway, G.A. New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for their Exclusion in Bioassays. J. Med. Chem.
**2010**, 53, 2719–2740. [Google Scholar] [CrossRef] [Green Version] - Bruns, R.F.; Watson, I.A. Rules for Identifying Potentially Reactive or Promiscuous Compounds. J. Med. Chem.
**2012**, 55, 9763–9772. [Google Scholar] [CrossRef] - Irwin, J.J.; Duan, D.; Torosyan, H.; Doak, A.K.; Ziebart, K.T.; Sterling, T.; Tumanian, G.; Shoichet, B.K. An Aggregation Advisor for Ligand Discovery. J. Med. Chem.
**2015**, 58, 7076–7087. [Google Scholar] [CrossRef] [Green Version] - Naveja, J.J.; Vogt, M.; Stumpfe, D.; Medina-Franco, J.L.; Bajorath, J. Systematic Extraction of Analogue Series from Large Compound Collections Using a New Computational Compound-Core Relationship Method. ACS Omega
**2019**, 4, 1027–1032. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Ralaivola, L.; Swamidass, S.J.; Saigo, H.; Baldi, P. Graph Kernels for Chemical Informatics. Neural Netw.
**2005**, 18, 1093–1110. [Google Scholar] [CrossRef] [PubMed] - Altman, N.S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Am. Stat.
**1992**, 46, 175–185. [Google Scholar] - Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model.
**2010**, 50, 742–754. [Google Scholar] [CrossRef] - RDKit: Cheminformatics and Machine Learning Software. 2013. Available online: http://www.rdkit.org (accessed on 1 July 2022).
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Conover, W.J. On Methods of Handling Ties in the Wilcoxon Signed-Rank Test. J. Am. Stat. Assoc.
**1973**, 68, 985–988. [Google Scholar] [CrossRef]

**Figure 1.**Compounds from selected activity classes. For eight activity classes, exemplary compounds are shown with their logarithmic potency values (pIC

_{50}). For each class, the target name and ChEMBL ID (in parentheses) are provided.

**Figure 2.**Prediction accuracy. Boxplots report the distribution of MAE values for potency predictions over 10 independent trials on the eight activity classes in Figure 1 using 1NN, kNN, SVR, and MR models (applying a training/test set compound split of 80:20%). In boxplots, the upper and lower whiskers indicate maximum and minimum values, the boundaries of the box represent the upper and lower quartiles, values classified as statistical outliers are shown as diamonds, and the median value is indicated by a horizontal line.

**Figure 3.**Prediction accuracy for activity classes with balanced potency value distributions. Boxplots report the distribution of MAE values over 10 independent trials for the eight activity classes after balancing their potency value distributions. As a control, results are reported for the original data sets that were reduced by random compound removal to the same size as the balanced sets. In boxplots, the upper and lower whiskers indicate maximum and minimum values, the boundaries of the box represent the upper and lower quartiles, values classified as statistical outliers are shown as diamonds, and the median value is indicated by a horizontal line.

**Figure 4.**Prediction accuracy after removal of nearest neighbor relationships. Boxplots report the distribution of MAE values over 10 independent trials for the eight activity classes after removal of 50% of nearest neighbors and control data sets after random removal of 50% of the compounds. In boxplots, the upper and lower whiskers indicate maximum and minimum values, the boundaries of the box represent the upper and lower quartiles, values classified as statistical outliers are shown as diamonds, and the median value is indicated by a horizontal line.

**Figure 5.**Prediction accuracy after analog series partitioning. Boxplots report the distribution of MAE values over 10 independent trials for the eight activity classes using training and test sets (~80:20% compound split) consisting of distinct analog series. As a control, results are reported for original training and test sets of exactly the same size. In boxplots, the upper and lower whiskers indicate maximum and minimum values, the boundaries of the box represent the upper and lower quartiles, values classified as statistical outliers are shown as diamonds, and the median value is indicated by a horizontal line.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Janela, T.; Bajorath, J.
Large-Scale Predictions of Compound Potency with Original and Modified Activity Classes Reveal General Prediction Characteristics and Intrinsic Limitations of Conventional Benchmarking Calculations. *Pharmaceuticals* **2023**, *16*, 530.
https://doi.org/10.3390/ph16040530

**AMA Style**

Janela T, Bajorath J.
Large-Scale Predictions of Compound Potency with Original and Modified Activity Classes Reveal General Prediction Characteristics and Intrinsic Limitations of Conventional Benchmarking Calculations. *Pharmaceuticals*. 2023; 16(4):530.
https://doi.org/10.3390/ph16040530

**Chicago/Turabian Style**

Janela, Tiago, and Jürgen Bajorath.
2023. "Large-Scale Predictions of Compound Potency with Original and Modified Activity Classes Reveal General Prediction Characteristics and Intrinsic Limitations of Conventional Benchmarking Calculations" *Pharmaceuticals* 16, no. 4: 530.
https://doi.org/10.3390/ph16040530