Antimalarial Drug Predictions Using Molecular Descriptors and Machine Learning against Plasmodium Falciparum

Mswahili, Medard Edmund; Martin, Gati Lother; Woo, Jiyoung; Choi, Guang J.; Jeong, Young-Seob

doi:10.3390/biom11121750

Open AccessArticle

Antimalarial Drug Predictions Using Molecular Descriptors and Machine Learning against Plasmodium Falciparum

by

Medard Edmund Mswahili

¹

,

Gati Lother Martin

¹,

Jiyoung Woo

^1,*,

Guang J. Choi

² and

Young-Seob Jeong

^3,*

¹

Department of ICT Convergence, Soonchunhyang University, Asan-si 31538, Korea

²

Department of Pharmaceutical Engineering, Soonchunhyang University, Asan-si 31538, Korea

³

Department of Computer Engineering, Chungbuk National University, Cheongju 28644, Korea

^*

Authors to whom correspondence should be addressed.

Biomolecules 2021, 11(12), 1750; https://doi.org/10.3390/biom11121750

Submission received: 20 October 2021 / Revised: 12 November 2021 / Accepted: 17 November 2021 / Published: 24 November 2021

(This article belongs to the Special Issue Translational Approaches for the Detection and Treatment of Malaria)

Download

Browse Figures

Versions Notes

Abstract

:

Malaria remains by far one of the most threatening and dangerous illnesses caused by the plasmodium falciparum parasite. Chloroquine (CQ) and first-line artemisinin-based combination treatment (ACT) have long been the drug of choice for the treatment and controlling of malaria; however, the emergence of CQ-resistant and artemisinin resistance parasites is now present in most areas where malaria is endemic. In this work, we developed five machine learning models to predict antimalarial bioactivities of a drug against plasmodium falciparum from the features (i.e., molecular descriptors values) obtained from PaDEL software from SMILES of compounds and compare the machine learning models by experiments with our collected data of 4794 instances. As a consequence, we found that three models amongst the five, namely artificial neural network (ANN), extreme gradient boost (XGB), and random forest (RF), outperform the others in terms of accuracy while observing that, using roughly a quarter of the promising descriptors picked by the feature selection algorithm, the five models achieved equivalent and comparable performance. Nevertheless, the contribution of all molecular descriptors in the models was investigated through the comparison of their rank values by the feature selection algorithm and found that the most potent and relevant descriptors which come from the ‘Autocorrelation’ module contributed more while the ‘Atom type electrotopological state’ contributed the least to the model.

Keywords:

antimalarial drug; machine learning; plasmodium falciparum; molecular descriptor; drug discovery; feature selection; PaDEL

1. Introduction

Regardless of the fact that COVID-19 is by far the most serious current threat tragedy known as a global pandemic with hundreds of millions confirmed cases of COVID-19, including millions deaths, reported to the World Health Organization (WHO) in 2021, still approximately millions of people, especially Africans, died of malaria, tuberculosis, and HIV-related illnesses. These three diseases can be prevented or treated with timely access to appropriate and affordable medicines, vaccines, and other health services. However, less than 2% of drugs consumed in Africa are produced on the continent, meaning that a huge number of sick patients do not have access to locally produced drugs and may not afford to buy the imported ones. Without reliable access to medicines, more people, especially in Africa and a few parts of Asia, are susceptible to the three big killer diseases on their respective continents. Globally, 50% of children under five who die of pneumonia, diarrhea, measles, HIV, tuberculosis, and malaria are in Africa, according to the WHO. Although the organization continues to struggle with making medicine more conveniently, in order to be accessible, such as having medicines be continuously available and inexpensive at designated and authorized health facilities located within a reasonable distance of the people, malaria remains by far the most threatening and dangerous illness due to its profoundly negative impact and detrimental influence on global communities in terms of social, political, and economical growth particularly in developing countries [1,2].

Malaria is a life-threatening disease caused by plasmodium parasites that are transmitted to people through the bites of infected female anopheles mosquitoes, called malaria vectors. There are five well known existing parasite species that cause malaria in humans according to [3], and plasmodium falciparum among them is known to cause the most severe form of the disease whereby those who contract this form of malaria have a higher risk of death, so the majority of deaths due to malaria are caused by the plasmodium falciparum [4,5,6,7], and it is susceptible to naturally acquired host immunity. Notably the main burden of Malaria disease falls on young children [7]. Despite the organization’s current elimination struggle, which includes taking into account all possible controllable measures, the effectiveness of malaria prevention, control, and treatment is dependent on the sustained clinical efficacy of first-line artemisinin-based combination treatment (ACT), which is constantly threatened by the establishment of emergence and spread of drug resistance [8,9].

Chloroquine (CQ) has long been the drug of choice for the treatment of malaria; however, CQ-resistant parasites are now present in most areas where malaria is endemic [10,11]. Moreover, recent alarming reports observed the emergence of artemisinin-resistant parasites in Southeast Asia [12,13], which could derail the current elimination/eradication efforts, and again foster an increase in malaria cases and deaths [14,15,16]. Observation of this study indicated the emergence of artemisinin resistance of Plasmodium falciparum not only in Southeast Asia but also in Sub-Saharan Africa, Tanzania being the case of study [17]. Resistance has emerged to all classes of antimalarial drugs which have lost their clinical effectiveness [11,18,19,20,21]. Resistance to these gold standard drugs represents a serious threat for malaria eradication, which causes a tremendous increase in the number of deaths annually, with excess medical costs and productivity losses of about 146 and 385 million US$ per year, respectively [15,22]. In addition, drug discovery and development are extremely long (time-consuming), costly (expensive), complex due to the challenges and obstacles that emerge during the drug development process, an outrageous failure that led to enormous financial damage, and an inefficient process that typically costs about 2.6 billion US dollars and takes an average of 10 to 15 years from essential pre-clinical testing to market approval, remarkably clinical trials being by far the most expensive factor during the development process [23].

To tackle the task of drug discovery, various approaches have been proposed. Quantitative structure–activity relationship (QSAR) is a computational or mathematical modeling method to reveal relationships between physicochemical properties of chemical substances and their biological activities to obtain a reliable statistical model for the prediction of the activities of new chemical entities. The underlying principle is that variations in structural properties cause different biological activities [24], where structural properties refer to physico-chemical properties, and biological activities correspond to pharmacokinetic properties such as absorption, distribution, metabolism, excretion, and toxicity. High-throughput screening (HTS) is another scientific experimentation approach especially used in drug discovery that involves the use of automated equipment to rapidly test thousands to millions of samples for biological activity at the model organism, cellular, pathway, or molecular level for identifying potential drug candidates [25,26,27]. QSAR modeling is an essential, paramount tool, and an alternative method that can assist in the selection of lead molecules by using the information from reference active and inactive compounds during the model implementation and development for drug discovery process, since the screening of chemical libraries with traditional methods, such as HTS, is expensive and time consuming [28].

Machine learning (ML) models have emerged in recent years as a promising and potentially appropriate tool for data-driven predictions in pharmaceutical science research, such as quantitative structure–activity/property relationships (QSAR/QSPR), drug–drug interactions, drug repurposing, and pharmacogenomics [29]; hence, certainly, the drug discovery area is undoubtedly one of the sectors that will profit greatly and tremendously gain benefits from the success of ML [30]. For example, Ref. [31] addressed the major crucial and critical fundamental problems (i.e., poor solubility, bioavailability, and efficacy of drugs) that hinder the drug development process through improving specific physicochemical and biopharmaceutical properties of active pharmaceutical ingredients (APIs), by applying ML models to predict which pair of API and coformer will successfully result in the new cocrystal formation that eventually becomes new drug and medicine after the Food and Drug Administration (FDA) approval, from a set of chemical experiments between API and the coformer since the essential and difficult phase in cocrystal production as an auxiliary state-of-the-art form to boost and enhance drug development is the screening of suitable coformers for an API. Danishuddin et al. [9] established the development and rigorous validation of antimalarial predictive models using machine learning approaches and ultimately achieved an accuracy of ∼85.00%. Egieyeh et al. [6] achieved an accuracy of 85.94% with the support vector machine (SVM), where the dataset was a combination of molecular descriptors and fingerprints of natural products with antiplasmodial activity (NAA). Liu et al. [32] used general regression neural networks (GRNN) for predicting the antimalarial activity against plasmodium falciparum, and achieved the accuracy of 88.90%. They inherited the work of [9] notably, the only difference being the number of features (i.e., molecular descriptors). The aforementioned studies have shown successful findings, but they all have a common flaw: they only compared model performance such as accuracy without meticulously looking at feature relevance.

This study focused on the development of machine learning models for predicting anti-malaria drugs. The problem is basically a binary classification on two labels (e.g., ‘active’, ‘inactive’), and we use the dataset of anti-malaria activity against plasmodium falciparum. To generate feature vectors, we use PaDEL-Descriptor software [33], one of the widely-used descriptor calculators that calculates molecular descriptors (MD) and fingerprints; it extracts descriptor values from simplified molecular-input line-entry system (SMILES) strings of the verified experimental anti-malaria drug compounds that were converted from two databases: ChEMBL database [34] and PubChem database [35].

The contributions of this paper can be summarized as follows. To begin, we not only extract descriptor values for compounds, but also analyze and investigate which descriptors are more significant, demonstrating that we can achieve decent results even if only a tiny subset of the descriptors are used. Following that, we conduct experiments to compare ML models and discover that three amongst the implemented models achieved equivalent results (i.e., comparable performance). The last but not least, we make our dataset available online via the website (https://sites.google.com/view/medardemswahili/ (accessed on 8 August 2021)) in the hopes of assisting many other researchers, as a benchmark to easily develop improved models.

2. Materials and Methods

We effectively tackle a binary classification problem by building ML models to predict a label (e.g., “active” or “inactive”) for a given experimentally verified antimalarial drug candidate from public chemical databases. The class label ‘active’ implies that the drug candidate compounds would successfully react against plasmodium falciparum parasite species, while the label ‘inactive’, there would be no reaction against plasmodium falciparum parasite species. Firstly, we obtain attributes (i.e., features) of the experimental antimalarial drug candidates compounds as depicted in Figure 1, from SMILES that were derived from their respective synonyms and Substance IDs (SID). Then, using feature selection algorithms, we choose some promising features, which are fed into the models that discover patterns behind the drug candidates’ compounds.

2.1. Materials

2.1.1. Data

The verified antimalarial drug candidate compounds were downloaded from public chemical databases ChEMBL [34] and PubChem [35] in synonyms and SID format. We converted them into their respective and appropriate SMILES using the PubChem Identifier Exchange Service [36] as depicted in Figure 1.

The classification of active and inactive was done according to the antiplasmodial activities of the compounds with

I C_{50}

of

10 μ

M as a threshold. In general, compounds having an (

I C_{50} \leq 10 μ

M) will likely be ’active,’ implying that there will be a high number of active molecules. However, no experimental platform could possibly produce such a high percentage of active molecules [9]. As a result, the best model should discover molecules with an affinity

> 10 μ

M in order to make the most of expensive experimental validation. The decision boundary for active compounds was determined at

I C_{50} \leq 1 μ

M [9]. The compound with (

I C_{50} \leq 1 μ

M) were set as ‘active’ and ‘inactive’ (

I C_{50}

: > 1

μ

M). The active instances are experimentally verified as active antimalarial drug candidates, whereas the inactive instances are experimentally verified as unsuccessful candidates. After filtering out some duplicated records out, we got a total of 4794 antimalarial drug candidate compounds, where it consists of 2070 and 2724 instances for active and inactive classes, respectively. The dataset is an |D| × 4 matrix, where |D| is the number of total instances. We converted the labels into a numerical form (i.e., ‘active’ = 1, and ‘inactive’ = 0) shown in Table 1 as a few samples. As the SMILES (e.g., ‘Canonical_Isomeric_SMILES’ in Table 1) is just a text, it is converted into real-numbered feature vectors using a certain calculator before it is fed into the models.

2.1.2. Molecular Descriptors

Quantitative structure–property relationships (QSPR) models are frequently developed using molecular descriptors, and PaDEL is amongst the attractive and well-known tools to extract descriptors [33]. There are various tools used in cheminformatics [31] such as Mordred [37], PyDPI [38], Rcpi [39], Dragon [40], and cinfony [41], which is a collection or a wrapper of other libraries such as Open Babel [42], RDKit [31] (http://www.rdkit.org (accesssed on 22 June 2021)), and Chemistry Development Kit (CDK) [43]. We decided to utilize PaDEL because of its advantages: it provides approximately 1875 molecular descriptors within a brief execution time, and it is simple to install and utilize. The process of generating molecular descriptors is as follows: first, we prepare canonical and isomeric SMILES strings for each compound of antimalarial drug that are downloadable from PubChem Identifier Exchange Service. Second, we use the selected tool to obtain the features, as shown in the middle in Figure 1. Thereafter, obtaining a

F_{A L L}

dimensional real-numbered feature vector from each antimalarial compound, we add a label column that resulted in a D feature vectors of

F_{A L L} + 1

dimension. Notably, the only molecular descriptors obtained and used in this study were 1D and 2D descriptors, and the

F_{A L L}

= 1444.

2.2. Methods

As the dataset shown in Table 2 is balanced, we performed 10-fold cross validation while maintaining the balanced ratio; for each cross validation, we had around 4314 and 480 instances for training and testing, respectively. We denote the size of training dataset as

| D_{t r a i n} |

, and the size of test dataset as

| D_{t e s t} |

, where

| D |

=

| D_{t r a i n} |

+

| D_{t e s t} |

. We employ averaged accuracy, precision, recall, and F1 scores throughout all experimental findings.

Before passing the

| D_{t r a i n} | \times F_{A L L} + 1

real-numbered matrix to machine learning models, we scale or standardize the feature values in our data using both scaling methods (i.e., standardization and normalization) and then compared the results of both standardized and normalized data using ANN. Ultimately, the performance obtained when utilizing standardized data was superior to that obtained when using normalized data. Only training data are used in this process; the mean

μ

and standard deviation

σ

are derived using just the training data. We used scikit-learn [44,45] to implement the standardization because we discovered that it is superior to normalization (i.e., 0–1 values scaling) for our dataset. ML models are designed to give labels

y \in {0, 1}^{| D_{t r a i n} |}

where ‘active’ = 1 and ‘inactive’ = 0, based on the standardized matrix

X \in R^{| D_{t r a i n} | \times F_{A L L}}

.

We have implemented various ML models such as artificial neural network (ANN), support vector machine (SVM) [46], random forest (RF) [47], extreme gradient boost (XGB) [48], and Logistic Regression (LR) [49]. The ANN is recognized to be useful in a variety of research fields, including image analysis, natural language processing, and speech recognition; if it has a deep structure, it is a deep learning model (i.e., multiple hidden layers) [31]. The SVM is known to be successful in many classification applications and tasks [50], and it identifies a decision boundary based on boundary examples or instances (i.e., support vectors). The RF and XGB are both standard and common ensemble techniques, although the RF employs a bagging strategy while the XGB uses a boosting strategy [31]. The LR, a model with the sigmoid function often utilized by statisticians to describe properties of population growth in ecology, is rising quickly and maxing out at the carrying capacity of the environment.

Although there have been research studies that used molecular descriptors as features to train ML models [6,9], most of these studies simply provided the descriptors to the models without doing a critical and essential analysis of the descriptors. It is obvious that the performance of ML models strongly depends on the feature definition; wisely chosen molecular descriptors as features may give good performance even if we utilize a much smaller number of features. In this study, feature selection methods are employed to determine the importance of descriptors and then we use a group of promising and potential ones that we discovered.

We denote the number of selected features as

F_{S}

as illustrated in the middle of Figure 1. Two feature selection algorithms are employed: Recursive Feature Elimination (RFE) and K-best algorithm. The K-best is a filter-based algorithm that selects potential features according to a particular function

σ (f, c)

, where f and c are a feature and a label, respectively, while the RFE is a wrapper-based algorithm that treats the feature selection as a search problem [31], and eliminates unpromising features on a regular basis until only the desired number of features remains. The ANN model was used as an estimator of the RFE algorithm and took the ANOVA F-value as the function

σ

.

3. Results

Before we compare several well-known ML models by experimental results, we firstly compare and find the promising feature selection algorithm. The comparison will be fair only if we use the same features for all models; the models are compared with the same features chosen by the best feature selection algorithm.

3.1. Feature Selection Algorithms

Through averaged test set accuracy with the number of features

F_{S}

varying, the two feature selection algorithms (i.e., RFE and K-best) were examined and compared. The results are shown in Figure 2 with

F_{S}

ranging from 50 to 1200, and the classifier employed here is ANN. With greater

F_{S}

, the K-best algorithm seems generally to have slightly greater accuracy than the RFE approach; otherwise, RFE performs better. As a result, we may say that the RFE algorithm is preferable if we seek efficiency (e.g., fewer parameters). In terms of feature dimension, because its dimension is merely a fifth of the total and its precision is equivalent in terms of accuracy,

F_{S} =

300∼400 may be a viable choice.

3.2. Model Comparison

We merged the datasets after downloading them from the aforementioned public databases, resulting in a single dataset D where

| D |

= 4794. Some machine-learning models (e.g., artificial neural networks with random initialization) are known to behave differently even if they are trained using the same dataset, so we randomly shuffled all instances of D and obtained five different datasets having the same size of

| D |

. Specifically, during shuffling, all criteria were taken into account to avoid data linking by ensuring that the total number of instances and features remained the same by keeping track of all the steps performed. All experimental results are averaged across the five datasets. Following that, we performed 10-fold cross validation for each dataset, and computed averaged test set accuracy, precision, recall, and F1 scores. A grid search employing a wee portion (e.g., 10%) of the training set as a validation set is used to find the optimal parameter settings for ML models.

The summarization of the parameter settings is as shown in Table 3. The ANN has two hidden layers of 100 nodes since we observed that it performs better than other complex structures with numerous layers and nodes, all of which were tested using the same standardized data; the reason for this could be the little and limited quantity of the dataset, which could lead to an over-fitting problem due to the high complexity of the model.

Table 4 below summarizes the test set accuracy of ML models. It is worth noting that the comparison of experimental outcomes of the models is the main focus of this section, not the feature selection techniques. The accuracy values are calculated by averaging the aforementioned independent datasets’ results. The XGB delivers the finest accuracy (e.g., 0.8303) amongst the implemented models, but the RF performed better with the number of features ≤ 160. The ANN and RF are comparable to the XGB, and it is the best when

F_{S}

= 361 and

F_{S}

= 1000. Because models function faster when feature dimensions are tiny, the XGB and RF may be preferable if we desire more efficiency without sacrificing or losing much accuracy.

One could argue that, if the model’s sensitivity is not great enough, it is useless. Table 5 and Table 6 are per-label test set precision and recall, respectively. The XGB gives the finest test set recall of ‘success’ label (e.g., 0.8068) without precision being greatly lost (e.g., 0.8477) followed by ANN when considering

F_{S}

= 361 since all models in one way or the other performed remarkably better with this set of features. In terms of the precision, the RF appears the best, with a successful precision (i.e., ‘active’ label) of 0.8583, while the ANN and XGB may be preferred if we want to find as many potential chemical compound candidates as possible.

Table 7 shows the test set F1 scores for each label, and the ANN, RFE, and XGB were shown to be the best of the implemented models. This is a realistic outcome because the best models (e.g., ANN) is known to be successful at detecting underlying patterns and significantly improves classification performance in a variety of classification tasks (e.g., malware detection [53], chatbot intent prediction [54]). We believe that collecting more qualified data will boost performance even further.

4. Discussion

Other than the performance of the ML models, we also investigated the best and worst features (i.e., molecular descriptors) selected by the RFE algorithm, as shown in Table 8. The estimated best pertinent and promising features from a ranking of features are assigned rank 1 [55,56] as shown in the table, so greater values of the rank imply worse features. All molecular descriptors in the PaDEL are grouped into some modules; for example, the molecular descriptor ‘nAcid’ belongs to the ‘acidic group count’ module as shown in the upper left corner of the table.

As we observed, when the number of descriptor values (i.e., selected molecular descriptor values) was 361 molecular descriptors, as shown in Figure 3, all models that were implemented in this research achieved a comparable performance of an accuracy above 81%, with the majority of the selected molecular descriptors coming from the ‘Autocorrelation module’. The ‘Autocorrelation’ module generates atom type autocorrelation descriptor values, and the autocorrelation descriptors are the molecular descriptors encoding both molecular structure and physico-chemical properties of a molecule [57,58,59,60] and also numerical properties assigned and attributed to atoms [59,61]. These descriptors are calculated by Moreau–Broto (ATS), Moran (MATS), and Geary (GATS) algorithms from lag 1 to lag 8 for four different weighting schemes [60,61,62]. The descriptors from the aforementioned module describe how a considered property is distributed in the topological molecular structure, and have a crucial influence on the antimalarial activity prediction [9]. This investigation is consistent with the previous studies of [59,63,64,65,66,67] which discussed the influence of such descriptors on antimalarial activity prediction towards the formation of drugs. It should be noted that the least relevant and worst descriptors come from ‘Atom type electrotopological state’ module, and it does not mean that these descriptors are detrimental to the performance or outcome. This precisely implies that the descriptors from the ‘Atom type electrotopological state’ contributed the least to the model compared to the others, so, due to this, it is reasonable to conclude that they have less influence on the discovery and development on antimalarial drugs.

We observed that, when the number of descriptor values (i.e., selected molecular descriptor values) was 361 molecular descriptors, as shown in Figure 3, all models that were implemented in this research achieved a comparable performance of an accuracy above 81%, with the majority of the selected molecular descriptors coming from the ‘Autocorrelation module’. In accordance with this, such small number of features may be prioritized for more expensive in-vitro antimalarial bioactivity screening and testing. This would result in a contribution of assisting the pharmaceutical chemists during the screening and formulation of a novel anti-malaria drug against Plasmodium falciparum by selecting and taking into account only the few and most promising and potential chemical features (i.e., molecular descriptors) from a pool of a majority of features.

It is worth noting that, in Table 9, the work of Egieyeh et al. reported the slightly higher accuracy compared to ours. This is due to the fact that the amount of data with regard to the number features was genuinely modest. Furthermore, we employed the same test dataset for all Implemented ML models, including the SVM used by Samuel Egieyeh, Although its performance was not superior as compared to the other deployed models in this research.

5. Conclusions

In this study, we used machine learning techniques to build various antimalarial predictive models that predict the bioactivity class of a drug against Plasmodium falciparum parasite. To address this antimalaria drug prediction problem, we employed the PaDEL, a well-known cheminformatics tool to extract the descriptor values following by the preprocessing. Experiments on molecular descriptor values of antimalaria drug chemical compounds retrieved from our collected data compounds revealed that the ANN and XGB models outperformed the other deployed ML models. In particular, XGB had the best recall 0.81 of the ‘active’ label and F1 score of 0.83 followed by ANN with recall of the ‘active’ and F1-score of 0.79 and 0.80, respectively. This implies that the XGB and ANN find about 81% and 79%, respectively, of new anti-malaria drug formation, both without losing too much precision. We believe that this research will assist in the discovery and development of anti-malaria drugs. We will look into gathering and collecting additional data in the near future, as having adequate data is essential for developing better ML models.

Author Contributions

Conceptualization, M.E.M., G.L.M. and G.J.C.; Methodology, M.E.M. and Y.-S.J.; Software, G.L.M.; Validation, J.W., Y.-S.J. and G.J.C.; Formal Analysis, M.E.M., J.W., G.J.C. and Y.-S.J.; Investigation, M.E.M.; Resources, G.J.C.; Data Curation, M.E.M. and Y.-S.J.; Writing—original draft preparation, M.E.M. and Y.-S.J.; writing—review and editing, M.E.M., J.W. and Y.-S.J.; Supervision, Y.-S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-01108, Big data-based development of novel solid forms for P-CAB drugs and establishment of dedicated AI platforms). This work was supported by the Soonchunhyang University Research Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in the website at http://sites.google.com/view/medardemswahili/ (accessed on 8 August 2021).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

WHO	World Health Organization
HTS	High-Throughput Screening
ACT	Artemisinin-based Combination Treatment
CQ	Chloroquine
APIs	Active Pharmaceutical Ingredients
QSAR	Quantitative Structure–Activity relationships
QSPR	Quantitative Structure–Property relationships
RFE	Recursive Feature Elimination
SMILES	Simplified Molecular-Input Line-Entry System
ANN	Artificial Neural Network
SVM	Support Vector Machine
RF	Random Forest
XGB	Extreme Gradient Boost
LR	Logistic Regression
SIDs	Substance Identifier
CIDs	Compound Identifier
InChIs	International Chemical Identifier
InChIKeys	International Chemical Identifier Keys
IUPAC	International Union of Pure and Applied Chemistry
SMO	Sequential Minimization Optimization
BLR	Binary Logistic Regression

References

Nur, E.T. The Impact of Malaria on Labour Use and Efficiency in the Sudan. Soc. Sci. Med. 1993, 37, 1115–1119. [Google Scholar] [CrossRef]
Russell, S. The economic burden of illness for households in developing countries: A review of studies focusing on Malaria, Tuberculosis, and Human Immunodeficiency virus/acquired immunodeficiency syndrome. Am. J. Trop. Med. Hyg. 2004, 71, 147–155. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Available online: https://www.who.int/news-room/fact-sheets/detail/malaria (accessed on 1 April 2021).
Gupta, S.; Hill, A.V.; Kwiatkowski, D.; Greenwood, A.M.; Greenwood, B.M.; Day, K.P. Parasite Virulence and Disease Patterns in Plasmodium falciparum Malaria. Proc. Natl. Acad. Sci. USA 1994, 91, 3715–3719. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kaestli, M.; Cockburn, I.A.; Cortés, A.; Baea, K.; Rowe, J.A.; Beck, H.P. Virulence of Malaria Is Associated with Differential Expression of Plasmodium falciparum var Gene Subgroups in a Case-Control Study. J. Infect. Dis. 2006, 193, 1567–1574. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Egieyeh, S.; Syce, J.; Malan, S.F.; Christoffels, A. Predictive classifier models built from natural products with antimalarial bioactivity using machine learning approach. PLoS ONE 2018, 13, e0204644. [Google Scholar] [CrossRef] [PubMed]
Bull, P.C.; Marsh, K. The role of antibodies to Plasmodium falciparum-infected-erythrocyte surface antigens in naturally acquired immunity to malaria. Trends Microbiol. 2002, 10, 55–58. [Google Scholar] [CrossRef]
Mayo Clinic. Available online: https://www.mayoclinic.org/diseases-conditions/malaria/symptoms-causes/syc-20351184 (accessed on 3 February 2021).
Danishuddin; Madhukar, G.; Malik, M.Z.; Subbarao, N. Development and rigorous validation of antimalarial predictive models using machine learning approaches. SAR QSAR Environ. Res. 2019, 30, 543–560. [Google Scholar] [CrossRef] [PubMed]
Peters, W. Drug Resistance in Malaria Parasites of Animals and Man. Adv. Parasitol. 1998, 41, 1–62. [Google Scholar] [PubMed]
Blasco, B.; Leroy, D.; Fidock, D.A. Antimalarial drug resistance: Linking Plasmodium falciparum parasite biology to the clinic. Nat. Med. 2017, 23, 917–928. [Google Scholar] [CrossRef]
Witkowski, B.; Sokunmalis, K.; Kim, S.; Pheaktra, C.; Sopheakvatey, K.; Kloeung, N.; Khim, N.; Duong, S.; Leang, R.; Ringwald, P.; et al. In vitro phenotype of reduced susceptibility to artemisinin in Plasmodium falciparum isolates from western Cambodia. Int. J. Infect. Dis. 2012, 16, e178. [Google Scholar] [CrossRef] [Green Version]
Witkowski, B.; Khim, N.; Chim, P.; Kim, S.; Ke, S.; Kloeung, N.; Chy, S.; Duong, S.; Leang, R.; Ringwald, P.; et al. Reduced Artemisinin Susceptibility of Plasmodium falciparum Ring Stages in Western Cambodia. Antimicrob. Agents Chemother. 2012, 57, 914–923. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Dondorp, A.M.; Nosten, F.; Yi, P.; Das, D.; Phyo, A.P.; Tarning, J.; Lwin, K.M.; Ariey, F.; Hanpithakpong, W.; Lee, S.J.; et al. Artemisinin Resistance in Plasmodium falciparum Malaria. N. Engl. J. Med. 2009, 361, 455–467. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lubell, Y.; Dondorp, A.; Guérin, P.J.; Drake, T.; Meek, S.; Ashley, E.; Day, N.P.; White, N.J.; White, L.J. Artemisinin resistance—Modelling the potential human and economic costs. Malar J. 2014, 13, 452. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ashley, E.A.; Dhorda, M.; Fairhurst, R.M.; Amaratunga, C.; Lim, P.; Suon, S.; Sreng, S.; Anderson, J.M.; Mao, S.; Sam, B.; et al. Tracking Resistance to Artemisinin Collaboration (TRAC). Spread of Artemisinin Resistance in Plasmodium falciparum Malaria. N. Engl. J. Med. 2014, 371, 411–423. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mugittu, K.; Genton, B.; Mshinda, H.; Beck, H.P. Molecular monitoring of Plasmodium falciparum resistance to artemisinin in Tanzania. Malar J. 2006, 5, 126. [Google Scholar] [CrossRef] [Green Version]
Imwong, M.; Dondorp, A.M.; Nosten, F.; Yi, P.; Mungthin, M.; Hanchana, S.; Das, D.; Phyo, A.P.; Lwin, K.M.; Pukrittayakamee, S.; et al. Exploring the Contribution of Candidate Genes to Artemisinin Resistance in Plasmodium falciparum. Antimicrob. Agents Chemother. 2010, 54, 2886–2892. [Google Scholar] [CrossRef] [Green Version]
Klein, E.Y. Antimalarial drug resistance: A review of the biology and strategies to delay emergence and spread. Int. J. Antimicrob. Agents 2013, 41, 311–317. [Google Scholar] [CrossRef] [Green Version]
Burgess, D. Evolution: Taking advantage of drug resistance. Nat. Rev. Genet. 2014, 15, 147. [Google Scholar]
Severini, C.; Menegon, M. Resistance to antimalarial drugs: An endless world war against Plasmodium that we risk losing. J. Glob. Antimicrob. Resist. 2015, 3, 58–63. [Google Scholar] [CrossRef] [PubMed]
Paloque, L.; Ramadani, A.P.; Mercereau-Puijalon, O.; Augereau, J.M.; Benoit-Vical, F. Plasmodium falciparum: Multifaceted resistance to artemisinins. Malar J. 2016, 15, 149. [Google Scholar] [CrossRef] [Green Version]
Phrma [Brochure]. (n.d.). Available online: http://phrma-docs.phrma.org/sites/default/files/pdf/rd_brochure_022307.pdf (accessed on 20 May 2021).
Verma, J.; Khedkar, V.M.; Coutinho, E.C. 3D-QSAR in Drug Design—A Review. Curr. Top. Med. Chem. 2010, 10, 95–115. [Google Scholar] [CrossRef] [PubMed]
Inglese, J.; Auld, D. High Throughput Screening (HTS) Techniques: Applications in Chemical Biology. Wiley Encycl. Chem. Biol. 2008, 2, 260–274. [Google Scholar]
Macarron, R.; Banks, M.N.; Bojanic, D.; Burns, D.J.; Cirovic, D.A.; Garyantes, T.; Green, D.V.; Hertzberg, R.P.; Janzen, W.P.; Paslay, J.W.; et al. Impact of high-throughput screening in biomedical research. Nat. Rev. Drug Discov. 2011, 10, 188–195. [Google Scholar] [CrossRef] [PubMed]
Wikipedia. The Free Encyclopedia. Available online: https://en.wikipedia.org/w/index.php?title=High-throughput_screening&oldid=1022548030 (accessed on 14 May 2021).
Danishuddin; Khan, A.U. Descriptors and their selection methods in QSAR analysis: Paradigm for drug design. Drug Discov. Today 2016, 21, 1291–1302. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Wang, Y.; Byrne, R.; Schneider, G.; Yang, S. Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery. Chem. Rev. 2019, 119, 10520–10594. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Towards Data Science. Available online: https://towardsdatascience.com/review-deep-learning-in-drug-discovery-f4c89e3321e1 (accessed on 25 May 2020).
Mswahili, M.E.; Lee, M.-J.; Martin, G.L.; Kim, J.; Kim, P.; Choi, G.J.; Jeong, Y.-S. Cocrystal Prediction Using Machine Learning Models and Descriptors. Appl. Sci. 2021, 11, 1323. [Google Scholar] [CrossRef]
Liu, Q.; Deng, J.; Liu, M. Classification models for predicting the antimalarial activity against Plasmodium falciparum. SAR QSAR Environ. Res. 2020, 31, 313–324. [Google Scholar] [CrossRef] [PubMed]
Yap, C.W. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011, 32, 1466–1474. [Google Scholar] [CrossRef] [PubMed]
Gaulton, A.; Bellis, L.J.; Bento, A.P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2011, 40, D1100–D1107. [Google Scholar] [CrossRef] [Green Version]
PubChem—Wikipedia. The Free Encyclopedia. Available online: https://en.wikipedia.org/w/index.php?title=PubChem&oldid=1015726474 (accessed on 29 May 2021).
PubChem IES. Available online: https://pubchem.ncbi.nlm.nih.gov/idexchange.cgi (accessed on 11 June 2021).
Moriwaki, H.; Tian, Y.S.; Kawashita, N.; Takagi, T. Mordred: A molecular descriptor calculator. J. Cheminform. 2018, 10, 4. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Cao, D.S.; Liang, Y.Z.; Yan, J.; Tan, G.S.; Xu, Q.S.; Liu, S. PyDPI: Freely Available Python Package for Chemoinformatics, Bioinformatics, and Chemogenomics Studies. J. Chem. Inf. Model. 2013, 53, 3086–3096. [Google Scholar] [CrossRef] [PubMed]
Cao, D.S.; Xiao, N.; Xu, Q.S.; Chen, A.F. Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics 2015, 31, 279–281. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mauri, A.; Consonni, V.; Pavan, M.; Todeschini, R. Dragon Software: An Easy Approach to molecular descriptor calculations. Match Commun. Math. Comput. Chem. 2006, 56, 237–248. [Google Scholar]
O’Boyle, N.M.; Hutchison, G.R. Cinfony—Combining Open Source cheminformatics toolkits behind a common interface. Chem. Cent. J. 2008, 2, 1–10. [Google Scholar] [CrossRef] [PubMed] [Green Version]
O’Boyle, N.M.; Banck, M.; James, C.A.; Morley, C.V.; Ermeersch, T.; Hutchison, G.R. Open Babel: An open chemical toolbox. J. Cheminform. 2011, 3, 1–14. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics. J. Chem. Inf. Comput. Sci. 2003, 43, 493–500. [Google Scholar] [CrossRef] [Green Version]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; et al. API design for machine learning software: Experiences from the scikit-learn project. arXiv 2013, arXiv:1309.0238. [Google Scholar]
Burges, C.J.C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Machine Learning Mastery. Available online: https://machinelearningmastery.com/logistic-regression-for-machine-learning/ (accessed on 25 May 2021).
Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene Selection for Cancer Classification Using Support Vector Machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
Jeong, Y.S.; Woo, J.; Lee, S.; Kang, A.R. Malware Detection of Hangul Word Processor Files Using Spatial Pyramid Average Pooling. Sensors 2020, 20, 5265. [Google Scholar] [CrossRef]
Goo, C.W.; Gao, G.; Hsu, Y.K.; Huo, C.L.; Chen, T.C.; Hsu, K.W.; Chen, Y.N. Slot-Gated Modeling for Joint Slot Filling and Intent Prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, 1–6 June 2018; pp. 753–757. [Google Scholar]
Chen, Q.; Meng, Z.; Liu, X.; Jin, Q.; Su, R. Decision Variants for the Automatic Determination of Optimal Feature Subset in RF-RFE. Genes 2018, 9, 301. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sklearn.feature_selection.RFE. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html (accessed on 25 May 2021).
Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; John Wiley & Sons: Hoboken, NJ, USA, 2008; Volume 11. [Google Scholar]
Navarro-Reig, M.; Ortiz-Villanueva, E.; Tauler, R.; Jaumot, J. Modelling of Hydrophilic Interaction Liquid Chromatography Stationary Phases Using Chemometric Approaches. Metabolites 2017, 7, 54. [Google Scholar] [CrossRef]
Hollas, B. An Analysis of the Autocorrelation Descriptor for Molecules. J. Math. Chem. 2003, 33, 91–101. [Google Scholar] [CrossRef]
Broto, P.; Moreau, G.; Vandycke, C. Molecular structures: Perception, autocorrelation descriptor and sar studies. Autocorrelation descriptor. Eur. J. Med. Chem. 1984, 19, 66–70. [Google Scholar]
Moreau, G.; Broto, P. The Autocorrelation of a topological structure: A new molecular descriptor. Nouv. J. Chim. 1980, 4, 359–360. [Google Scholar]
Todeschini, R.; Consonni, V. Molecular Descriptors for Chemoinformatics: Volume I: Alphabetical Listing/Volume II: Appendices, References; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 41. [Google Scholar]
Hou, X.; Chen, X.; Zhang, M.; Yan, A. QSAR study on the antimalarial activity of Plasmodium falciparum dihydroorotate dehydrogenase ( Pf DHODH) inhibitors. SAR QSAR Environ. Res. 2016, 27, 101–124. [Google Scholar] [CrossRef]
Ibrahim, Z.Y.; Uzairu, A.; Shallangwa, G.; Abechi, S. QSAR and molecular docking based design of some indolyl-3-ethanone-α-thioethers derivatives as Plasmodium falciparum dihydroorotate dehydrogenase (PfDHODH) inhibitors. SN Appl. Sci. 2020, 2, 1170. [Google Scholar] [CrossRef]
Aliyu, W.M.; Gideon, A.S.; Adamu, U. QSAR and molecular docking studies of 1,3-dioxoisoindoline-4-aminoquinolines as potent antiplasmodium hybrid compounds. Heliyon 2020, 6, e034494. [Google Scholar]
Kalra, S.; Joshi, G.; Kumar, R.; Munshi, A. Role of 2Dimensional Autocorrelation Descriptors in Predicting Antimalarial Activity of Artemisinin and its Aanalogues: A QSAR Study. Curr. Top. Med. Chem. 2018, 18, 2720–2730. [Google Scholar] [CrossRef] [PubMed]
Kalra, S.; Kumar, A.; Gupta, M. Modeling of antitubercular activity of biphenyl analogs of 2-nitroimidazo[2,1-b][1,3]oxazine to rationalize their activity profile. Med. Chem. Res. 2012, 22, 3444–3451. [Google Scholar] [CrossRef]

Figure 1. The development process for antimalarial drug prediction, from data gathering through ML models deployment.

Figure 2. Averaged test set accuracy comparison using feature selection algorithms, against the number of

F_{S}

.

Figure 2. Averaged test set accuracy comparison using feature selection algorithms, against the number of

F_{S}

.

Figure 3. All ML models’ test set accuracies.

Table 1. The glimpse sample of unprocessed data.

Service	ChEMBL_synonyms_PubChem_SID	Canonical_Isomeric_SMILES (Sources: PubChem_ChEMBL_and_EMBL-EBI)	Label
ChEMBL_&_PubChem	CHEMBL219517	C1CSCN(C1=O)CCCNC2=C3C=CC(=CC3=NC=C2)Cl	0
	380797	CC(C1=CC=CC=C1)NC(=O)C2=CC=CC=C2N=CC3=C(C=CC4=CC=CC=C43)O	0
	591362	C1=CC=C(C(=C1)C(=O)NC2=NC(=CS2)C3=CC=CC=N3)Br	0
	465546	C[C@@]1(CC[C@@H]2[C@]3(CC[C@@H](C([C@@H]3CC[C@]2(C1)O)(C)C)O)C)C=C	0
	341638	CCN(CC)CCCCSC1=C2C=CC(=CC2=NC=C1)Cl	0
	SID_381881704	CC1CN(CC(O1)C)C(=O)C2=C(C3=CC=CC=C3S2)OCC4=CC(=C(C=C4)F)F	1
	381885288	CC1CN(CC(O1)C)C(=O)C2=C(C3=CC=CC=C3S2)Cl	1
	381885327	CC1CN(CC(O1)C)C(=O)C2=C(C3=C(S2)C=C(C=C3)F)Cl	1
	381886215	CC1CN(CC(O1)C)C(=O)C2=NC3=CC=CC=C3S2	0
	381886674	CC1CN(CC(O1)C)C(=O)C2=C(C3=CC=CC=C3S2)OCC4=CC=CC=C4	1
	381886749	CC1CN(CC(O1)C)C(=O)C2=C(C3=C(S2)C=C(C=C3)C)Cl	1

Table 2. Data statistics.

	All Labels	Label ‘Active’	Label ‘Inactive’
# of data	4794	2070	2724

Table 3. Parameter settings of ML models.

Model	Setting
Random Forest	Number of estimators = 100
(RF)	No limitation of depth
	Minimum samples for splitting = 2
Support Vector Machine	Kernel = Linear
(SVM)	C = 1.0
Extreme Gradient Boosting	Number of estimators = 100
(XGB)	Learning rate = 0.3
Logistic Regression	Penalty = l2
(LR)	C = 1e5
	Class weight = None
	Multi_class = auto
	# of hidden layers = 2
Artificial Neural Network	# of nodes of each hidden layer = 100
(ANN)	Activation function = Relu [51]
	Optimizer = Adam [52]
	learning_rate = 0.0001
	# of epochs = 50 with early stopping

Table 4. Averaged test set accuracy of ML models, where

F_{A L L}

is the number of all features, and

F_{S}

means the number of features selected using the RFE algorithm.

Table 4. Averaged test set accuracy of ML models, where

F_{A L L}

is the number of all features, and

F_{S}

means the number of features selected using the RFE algorithm.

Model	$F_{ALL}$ = 1444	$F_{S}$ = 1000	$F_{S}$ = 722	$F_{S}$ = 361	$F_{S}$ = 160	$F_{S}$ = 100
RF	0.8294	0.8280	0.8256	0.8250	0.8284	0.8258
SVM	0.7850	0.7920	0.7964	0.8126	0.7931	0.7695
XGB	0.8318	0.8283	0.8342	0.8287	0.8230	0.8177
LR	0.7795	0.7828	0.7952	0.8111	0.7910	0.7682
ANN	0.8223	0.8269	0.8210	0.8283	0.8185	0.8100

Table 5. Per-label averaged test set precision of ML models, where

F_{A L L}

is the number of all features,

F_{S}