Machine Learning Prediction of the Redox Activity of Quinones

Kichev, Ilia; Borislavov, Lyuben; Tadjer, Alia; Stoyanova, Radostina

doi:10.3390/ma16206687

Open AccessEditor’s ChoiceArticle

Machine Learning Prediction of the Redox Activity of Quinones

¹

Institute of General and Inorganic Chemistry, Bulgarian Academy of Sciences, 1113 Sofia, Bulgaria

²

Faculty of Chemistry and Pharmacy, University of Sofia, 1164 Sofia, Bulgaria

^*

Authors to whom correspondence should be addressed.

Materials 2023, 16(20), 6687; https://doi.org/10.3390/ma16206687

Submission received: 21 September 2023 / Revised: 9 October 2023 / Accepted: 11 October 2023 / Published: 14 October 2023

(This article belongs to the Special Issue Size-Dependent Effects in Materials for Environmental Protection and Energy Application)

Download

Browse Figures

Versions Notes

Abstract

:

The redox properties of quinones underlie their unique characteristics as organic battery components that outperform the conventional inorganic ones. Furthermore, these redox properties could be precisely tuned by using different substituent groups. Machine learning and statistics, on the other hand, have proven to be very powerful approaches for the efficient in silico design of novel materials. Herein, we demonstrated the machine learning approach for the prediction of the redox activity of quinones that potentially can serve as organic battery components. For the needs of the present study, a database of small quinone-derived molecules was created. A large number of quantum chemical and chemometric descriptors were generated for each molecule and, subsequently, different statistical approaches were applied to select the descriptors that most prominently characterized the relationship between the structure and the redox potential. Various machine learning methods for the screening of prospective organic battery electrode materials were deployed to select the most trustworthy strategy for the machine learning-aided design of organic redox materials. It was found that Ridge regression models perform better than Regression decision trees and Decision tree-based ensemble algorithms.

Keywords:

quinones; machine learning; ridge regression; decision tree; ensemble methods; density functional theory; organic electrode materials

1. Introduction

In recent years, the global demand for effective energy-storage materials has constantly grown [1]. Traditionally, the widely used electrode materials in metal-ion batteries are inorganic compounds capable of reversible redox transformations [2,3]. Organic electrode materials, on the other hand, have some gainful properties, such as structural diversity and flexibility, synthetic tunability, lower price, and harmless recyclability [4,5,6,7]. Among the organic compounds considered for research on battery electrode materials, quinones have engendered the most ubiquitous expectations and extensive investigation. Quinones are a class of organic compounds derived from aromatic diols, whose redox capacity makes them interesting for designing novel organic electrode materials [8]. Quinones with a low molecular weight, such as 1,4-benzoquinone, have a relatively high redox potential [9] and, in case the two-electron redox reaction of benzoquinone takes place, a high capacity could be expected. However, due to the sublimation and dissolution of benzoquinone in the organic electrolyte solvents, a poor capacity is observed in practice [10]. These problems can be overcome by immobilizing benzoquinone on nanoparticles [11], by using various polymers containing benzoquinone fragments [12,13,14], or by introducing different substituent groups [15]. The redox potential of the quinones is dependent on the substituent type; electron-withdrawing substituents, such as carbonyl, nitro, and carboxylate groups, make quinones stronger oxidants, while electron-donating groups, such as amine, hydroxyl, and alkoxy groups, turn quinones into weaker oxidants [16]. In the present study, a dataset of quinones with electron-withdrawing substituents was constructed, since this class of materials exhibits a fairly high redox potential.

Machine learning and statistics approaches have successfully been applied for capturing the complex relationships between material structures and different properties of interest [17]. This kind of approach has also effectively been employed in the design of novel energy-storage materials: Joshi et al. [18] demonstrated that deep neural networks (DNNs), support vector regression (SVR), and kernel ridge regression (KRR) can be used to predict the redox potential of inorganic electrode materials extracted from the Materials Project Database; Zhang et al. [19] used a Crystal Graph Convolutional Neural Network (CGCNN) to creatively build an interpretable deep learning model that predicts redox potential based on inorganic crystal structures [19]. Machine learning algorithms have also been productively applied in the design of organic electrode materials: Allam et al. [20] developed a pre-screening procedure that relies on the density functional theory to compute both the redox potential of organic electrode materials and molecular descriptors, such as the electron affinity and the gap between the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO), to be used as input features of artificial neural networks (ANNs), gradient boosting regression (GBR) and KRR. A major disadvantage is that the density functional theory, which is comparatively computationally expensive, is used for descriptor computation. Tutte et al. [21] propose a Hammet-like approach to model quinone solubility in organic electrolytes that are typically used in lithium-ion batteries (the organic electrode materials must have low solubility in the battery electrolyte). Machine learning screening has also been applied for the design of quinone electrolytes for redox flow batteries [22]: Wang et al. created a dataset by generating various disubstituted quinones, replacing hydrogens in different quinone backbones with a predefined set of substituents, and, subsequently, utilized the extreme gradient boosting algorithm to build a model for screening the HOMO–LUMO gap and the free energy of solvation. In the current study, different linear and nonlinear regression models were built to predict the electrode potential of substituted quinones.

Dataset construction plays a central role in any data-driven study. In this report, two tactics for the creation of application-specific datasets were combined. Firstly, a top-down approach was used: molecular structures that satisfy some application-specific conditions (i.e., contain a quinone fragment) were extracted from PubChem [23] (a large, publicly available database). Next, a bottom-up approach was applied: the dataset of molecular structures produced in the first step was expanded via inclusion of the systematically generated derivatives of the already-selected species. This strategy guarantees that the final dataset created is structurally consistent.

2. Materials and Methods

2.1. Dataset Construction

2.1.1. Molecular Structure Generation

To construct the dataset, 100 benzoquinone derivatives were extracted from the PubChem database [23] as simplified molecular-input line-entry system (SMILES) strings. The SMILES strings were converted into 3D structures using the OpenBabel software package (version 3.1.1) [24] and, subsequently, the DerGen software (version 0.1) [25] was used to generate all possible derivatives of those compounds containing a -CN or a -C≡CMe group. In total, 494 structures were produced. This dataset construction procedure guarantees that the generated molecules are structurally similar, and hence makes it easier to establish the structure–electrode potential relationship for a quinone series with electron-withdrawing substituents—a group of compounds that is particularly interesting for the design of organic energy-storage materials.

2.1.2. Dataset Splitting

The dataset was shuffled and split into a training set (395 compounds, 80% of the whole dataset) and a test set (99 compounds, 20% of the whole dataset). To avoid data leakage [26], the descriptor selection and hyperparameter optimization were performed on the training set. An average R² metric over 5-fold cross-validation was used for model performance assessment during the descriptor selection and hyperparameter optimization.

2.2. Molecular Descriptors

Representing molecular structures in an unambiguous machine-readable format is not a trivial task. Many different molecular representations have been developed [27]. Molecular structures can be represented as the following:

Strings—for example, the SMILSES representation that contains information about atom types and connectivity [28];
Connection table formats [29]: tabular formats that provide information about atom counts, atom types, connectivity matrix, bonded pairs of atoms, chirality, etc.; an example for such molecular representation format is the MDL molfile;
Vectors of features: a molecule can be represented either as a vector of molecular properties (descriptors) such as molecular weight, molecular volume, numbers of certain atom types, topology, etc., [30] or as a molecular fingerprint: a bitstring (can be regarded as vector of ones and zeros) is derived from the molecular structure according to a predefined set of rules [31]—among the most employed fingerprints are the extended-connectivity fingerprints (ECFPs) based on Morgan’s algorithm [32], since they are specially designed for establishing structure–property relationships [33];
Computer-learned representations: in recent years, a large number of machine learning-based molecular representations were developed—those methods rely on convolutional neural networks (CNNs) and/or recurrent neural networks (RNNs) to transform a molecule represented as a SMILES string or as 3D Cartesian atom coordinates to a low dimensional latent space [34,35] that can be used both for property prediction and for the generation of new molecular structures [36].

In the current study, the PaDEL [37] software package was employed to generate a multitude (750 descriptors per molecule) of cheminformatics-based molecular descriptors, and the MOPAC program suite [38] was used to produce semi-empirical descriptors such as HOMO and LUMO energies and the dipole moments of the reducible compounds.

Descriptor Selection

Feature selection is a key step in any data-driven study [39]. The objective of the feature selection procedure is to assort features that have a strong correlation with the target variable. In the current work, the following steps were taken:

Low-variance descriptors were removed: descriptors whose value equalled the descriptor mode for 60% or more of the molecules in the dataset were discarded;
Descriptors that had a weak correlation with the target value were discarded. Correlations with covariance between the normalized descriptors and normalized target values of less than 0.25 were considered as weak correlations. The normalization was performed as follows:

$V_{n o r m} = \frac{V - V_{m e a n}}{σ_{V}},$

where V_norm is the normalized value, V is the unnormalized value, V_mean is the mean of V in the dataset, and σ_V is the standard deviation of V in the dataset.
Strongly mutually correlated descriptors (covariance between normalized descriptors greater than 0.7) were discarded. After this operation, 52 descriptors were left;
Backward stepwise regression [40] was used for further descriptor reduction (Figure 1). Finally, 32 descriptors remained (Table 1).

2.3. Redox Potential Calculation

The redox potential was calculated with the density functional theory (DFT) for the redox reaction:

Geometry optimization was performed on all quinone derivatives in the dataset (i) and their respective reduced forms (ii) using the B3LYP functional in combination with the 6-311++G(2df,2p) basis set, as implemented in the Gaussian 16 software package (version EM64L-G16RevB.01, 20 December 2017) [48]. This protocol was chosen as a trade-off between precision and computational time.

The electrode potential was calculated using the Nernst equation:

∆ E = \frac{- ∆ G}{n F},

(1)

where n is the number of exchanged electrons, F is the Faraday’s constant, and the reaction free energy, ΔG, is calculated as follows:

∆ G = G_{i i} - G_{i} - 2 G_{L i} .

(2)

G_ii and G_i were obtained from the B3LYP/6-311++G(2df,2p) calculation, as follows [49]:

H_{X} = E_{0} + Z P E + H_{t r a n s} + H_{r o t} + H_{v i b} + R T

(3a)

S_{X} = S_{t r a n s} + S_{r o t} + S_{v i b} + S_{e l}

(3b)

G_{X} = H_{X} - T S_{X},

(3c)

where E₀ is the total electronic energy, ZPE is the unscaled zero-point energy, H_trans, H_rot, and H_vib are, correspondingly, the translational, rotational, and vibrational shares in the enthalpy, S_trans, S_rot, S_vib, and S_el are, respectively, the rotational, translational, vibrational, and electronic motion contributions to the entropy. RT represents the work term converting the internal energy into enthalpy (T = 298 K). G_Li is the free energy of lithium in the gas phase. A comparison of calculated and experimental values of electrode potentials showed that when the free energy change in the redox reaction is estimated as the difference of the free energies of the reacting molecules in the gas phase, then the gas phase free energy for lithium should be considered as well (see Supplementary Information in [50]).

2.4. Machine Learning Methods Used

Different machine learning methods were deployed to investigate the relationship between the molecular structure and the electrode potential.

2.4.1. Ridge Regression

Ridge regression is a method for estimating the coefficients of l2-regularized multiple linear regression models:

X β = y,

(4)

where, for a dataset consisting of n molecules, each is represented as an m-dimensional vector; X is an n_x(m + 1) matrix of n-dimensional column vectors x_j (x₁ is [1 1 … 1]^T, while x₂, x₃, …x_{(m + 1)} are the values of the corresponding descriptors), known as explanatory variables; β is an (m + 1) dimensional vector of parameters, where β₁ is the intercept term and y is the vector of the observed values (redox potentials in the current study). The ridge estimator of β is given using the following equation [50]:

β = {(X^{T} X + λ I)}^{- 1} X^{T} y,

(5)

where λ is a regularization coefficient and I is the identity matrix. Ridge regression is known to perform better than linear regression in cases of mutually correlated explanatory variables (molecular descriptors in our case) [51].

2.4.2. Decision Tree

First introduced in 1987 [52], decision trees are hierarchical supervised machine learning models that logically combine a sequence of decisions, based on simple tests, and their possible outcomes. This is achieved by optimizing the simple test condition threshold during the training process [53]. In the course of training, all possible data splits are considered:

Q_{m}^{l} = \{(x, y)| x_{j} < t_{m}}

(6a)

Q_{m}^{r} = Q_{m} \ Q_{m}^{l},

(6b)

where Q_m is the data at node m,

Q_{m}^{l}

and

Q_{m}^{r}

are the candidate splits, x is the training data vectors, and y is the target variable vector. The threshold condition is optimized by comparing the quality of the splits using an appropriate cost function. For regression decision trees, the mean square error (MSE—Equation (8a)) or the Poisson deviance (Equation (8b)) can be used as cost functions [52]:

{\bar{y}}_{m} = \frac{1}{n_{m}} \sum_{y \in Q_{m}} y

(7)

H (Q_{m}) = \frac{1}{n_{m}} \sum_{y \in Q_{m}} {(y - {\bar{y}}_{m})}^{2}

(8a)

H (Q_{m}) = \frac{1}{n_{m}} \sum_{y \in Q_{m}} {(y l o g (\frac{y}{{\bar{y}}_{m}}) - y - {\bar{y}}_{m})}^{2}

(8b)

This splitting operation is performed for all the features, and the feature split that leads to the largest decrease in the cost function is kept at node m. This allows for the estimation of the feature importance—the more efficiently a feature split decreases the cost function, the more important the feature.

It should be noted that, due to their structure of sequential simple tests, decision trees are able to capture nonlinear dependencies between the explanatory variables and the measured property. Decision trees have been successfully utilized to solve both classification and regression problems [54,55,56]. There exist numerous algorithms for decision tree construction: ID3, C4.5, CART, MARS, and CHAID [57]. In the present study, the CART (classification and regression tree) algorithm with a mean square error cost function was used.

2.4.3. Random Forest

Random forests are ensemble machine learning algorithms that can be used for classification and regression. Multiple decision trees are constructed using randomly selected explanatory variables (molecular descriptors in our case) and each tree is trained on different bootstrapped samples (sampling, allowing for multiple selection of the same items) of the training set. When a prediction is made, the average result of all trees is returned [58].

2.4.4. Extra Trees

The extra trees algorithm [59] is similar to the random forest algorithm—a multitude of decision trees are used; however, the individual decision trees are trained on subsamples of the training set taken without replacement (in contrast to bootstrapping). Another important difference is that, in the extra trees algorithm, the cut point is selected randomly, while in the random forest algorithm, the optimal split is chosen. These differences generally lead to the reduction of bias and variance. The random choice of a cut point also makes the algorithm faster (in the random forest algorithm, the optimal split is found by computing some impurity metric for all possible splits).

2.4.5. Gradient Boosting

Gradient boosting relies on the fitting of a sequence of weak prediction models (decision trees in this case) on repeatedly altered versions of training data [60]. The predictions of all individual weak predictors are combined as a sum:

{\hat{y}}_{i} = F_{M} (x_{i}) = \sum_{m = 1}^{M} h_{m} (x_{i}),

(9)

where ŷ_i is the model prediction, x_i is a vector of all features that describes the i-th object in the dataset (in our case, all descriptors used to represent a molecule), M is the number of weak estimators, and h_m(x_i) is the prediction of the m-th weak estimator. From Equation (9), it follows that

F_{m} (x_{i}) = F_{m - 1} (x_{i}) + h_{m} (x_{i}) .

(10)

The weak predictor h_m(x_i) in Equation (10) is fitted to minimize a sum of the cost functions, Cm:

h_{m} = {a r g m i n}_{h} (C_{m}) = {a r g m i n}_{h} (\sum_{i = 1}^{n} c (y_{i}, F_{m - 1} (x_{i}) + h_{m} (x_{i}))),

(11)

where n is the number of training entries and c(y_i, F(x_i)) is a cost function, such as the mean square error (MSE, Equation (8a)).

Friedman [59] proposed a regularization strategy, based on scaling the contribution of each new weak predictor, based on a learning rate (γ):

F_{m} (x_{i}) = F_{m - 1} (x_{i}) + γ h_{m} (x_{i}) .

(12)

It has been demonstrated [61] that, in many cases, gradient boosting outperforms other ensemble methods such as random forests and extra trees.

In the present study, all machine learning algorithms were exploited as implemented in the scikit-learn library [62].

3. Results and Discussion

The redox potential distribution (Figure 2) over the entire dataset (494 compounds) shows that the redox potential spans the range of 0.3–2.8 V. The distribution plot has an asymmetric bell-like shape, with the majority of compounds having potentials between 0.75 V and 1.60 V.

In order to find an optimal approach for the machine learning modelling of structure–redox potential relationships, the following algorithms were tested: ridge regression, decision tree, random forest, extra trees, and gradient boosting. Artificial neural networks were not considered, since they are prone to overfitting, especially when trained on an insufficient amount of data [63].

In order to attain maximal predictive ability, the hyperparameters (parameters that control the learning process) of each of the machine learning models were optimized using a grid search. The model performance was evaluated based on the averaged coefficient of determination (R²) [64] of the five-fold cross-validation over the training set. The training R² was also taken into account, since the difference between the validation and training R² can be used to judge whether the model is overfitted.

The l2-regularization value (λ in Equation (5)) in ridge regression does not have a significant impact on the model performance (Figure 3a); increasing the l2-regularization value leads to a decrease (by an almost equal amount) of the training and validation R². It should be noted that the difference between the training and validation R² reached a minimum at λ = 0.1, and hence, this value of lambda results in an optimal (neither underfitted, nor overfitted) ridge regression model.

The decision tree maximal allowed depth plays a central role in determining whether the decision tree underfits or overfits the training data: a larger maximal allowed depth results in a deeper tree that fits the training data better; however, if a tree is too deep, the noise in the training data is also learned, i.e., the decision tree overfits. In the present work, the maximal tree depth varied from two to twenty (Figure 3b). Optimal algorithm performance was attained when the maximal tree depth was three. A serious advantage of decision trees is their ability to visualize the learning process (Figure 4). Furthermore, the decision tree algorithm enables the examination of the descriptor significance. It was found that the most significant descriptors, MAXDN2, LUMO, SaasC, SHdsCH, and BCUTc-1H (see Table 1), are all related to the electronic structure of the molecules—quinones that contain more CN and C≡C-Me groups (lower LUMO, large MAXDN2 due to CN groups) exhibit larger redox potential.

To examine the predictive ability of maximal random forest regression and extra trees regression, the depth of the decision tree estimator was set to three (since we established that this value of maximal depth ensures maximal learning performance), and the number of estimators was optimized to achieve the maximal coefficient of determination over the validation set (Figure 3c,d): 10 and 15 estimators were chosen for random forest and extra trees, respectively.

It was found that the extra trees algorithm is less prone to overfitting: the R² value over the validation set is closer to the R² value over the test set. The random forest and extra trees algorithms can also be used to estimate the descriptors’ importance—the most significant descriptors for the decision tree (described above) are found among the ten most significant descriptors of both algorithms, which confirms that the descriptors related to the electronic structure, such as the LUMO energy, and descriptors derived from electronegativity, such as SaasC, SHdsCH, MAXDN2, and meanI, are important for the machine learning prediction of the redox potential of organic energy-storage materials. As expected, we found that the gradient boosting regression exhibits a better predictive ability and is less prone to overfitting than the other ensemble methods used (random forest regression and extra trees regression). The learning rate (γ in Equation (12)) and the number of weak predictors values of 0.05 and 50, respectively, were found via grid searching (Figure 3e).

The prediction models’ performance, as evaluated based on the average coefficient of determination over a five-fold cross-validation (R²_CV), increases in the following order: regression decision tree (R²_CV = 0.632), random forest regression (R²_CV = 0.705), extra trees regression (R²_CV = 0.715), gradient boosting regression (R²_CV = 0.756), and ridge regression (R²_CV = 0.832).

All machine learning algorithms were evaluated on the test set. To visualize the model performance, scatter plots of the redox potential calculated based on the density functional theory versus the redox potential estimated using the corresponding machine learning algorithms were drawn (Figure 5). Linear regression was implemented to construct a trendline in the (E_model, E_DFT)—space (Figure 5, red lines), and the slope, intercept, and coefficient of determination (R²) of the trendline were calculated. When the model ideally fits the data, the trendline slope is supposed to have one and zero for the slope and trendline, respectively, and the R² value should be close to one. It was found that the models’ performance on the test set does not differ significantly from the models’ performance observed upon the five-fold cross-validation, which means that the models fit the data fairly well (i.e., the models are not significantly overfitted or underfitted). All models tend to give worse prediction for large voltages: a possible explanation is that, in the training set, there are fewer molecules exhibiting a high redox potential. The dataset and machine models’ source code are publicly available: https://github.com/carim2020/org-redox-dataset (accessed on 12 October 2023).

4. Conclusions

We have constructed a dataset of 494 potential organic electrode materials through the automated generation of derivatives of 100 quinones, extracted from a general-purpose public database (PubChem). A descriptor selection procedure that combines low-variance descriptor removal with covariance matrix analysis and stepwise linear regression for finding uncorrelated descriptors, on which the redox potential of the molecules in the dataset depends, was devised. Due to the comparatively small dataset size, deep learning approaches were not deployed as inappropriate, since they are prone to overfitting when trained on small amounts of data. Five different supervised machine learning models for regression that tend to give better results for smaller datasets were built. The hyperparameters of all those models were tuned to attain the maximal electrode potential predictive ability. The models’ performance was evaluated on a test set containing molecules that are completely unknown to the model. It was established that the model performance increases in the following order: regression decision tree < random forest regression < extra trees regression, gradient boosting regression < ridge regression. It turned out that the linear model, i.e., the ridge regression, outperforms the decision tree-based algorithms, known to be able to capture nonlinear dependencies between the descriptors and the target variable. This is an implication that the relationship between the electrode potential and some chemical properties is most probably linear. In particular, it was found that descriptors related to the electronic structure (LUMO and E-state descriptors) have a large significance. In addition, ridge regression is an excellent method for the screening of databases, as it is a very fast and computationally inexpensive approach, compared to other machine learning algorithms.

Author Contributions

Conceptualization, A.T. and R.S.; methodology, I.K. and L.B.; software, I.K. and L.B.; validation, L.B.; formal analysis, L.B.; resources, L.B.; data curation, I.K.; writing—original draft preparation, I.K. and L.B.; writing—review and editing, A.T.; visualization, L.B.; supervision, R.S. and A.T.; project administration, R.S.; funding acquisition, R.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by CARIM-VIHREN, grant number КП-06-ДВ-6/2019, and the APC was funded by European Twinning on Materials Chemistry Enabling Clean Technologies (TwinTeam), grant number D01-272/10.2020.

Informed Consent Statement

Not applicable.

Data Availability Statement

The computed data are available from the authors on request.

Acknowledgments

The European Regional Development Fund within the Operational Programme “Science and Education for Smart Growth 2014–2020” under the Project CoE, “National center of mechatronics and clean technologies” (BG05M2OP001-1.001-0008)—for computational facilities; I.K. is grateful to Pascal Friederich for internship hosting; thanks are due to Nina Markova and Yanislav Danchovsky for the initial dataset (100 compounds) construction.

Conflicts of Interest

The authors declare no conflict of interest.

References

Poizot, P.; Dolhem, F. Clean energy new deal for a sustainable world: From non-CO₂ generating energy sources to greener electrochemical storage devices. Energy Environ. Sci. 2011, 4, 2003–2019. [Google Scholar] [CrossRef]
Larcher, D.; Tarascon, J.M. Towards greener and more sustainable batteries for electrical energy storage. Nat. Chem. 2015, 7, 19–29. [Google Scholar] [CrossRef]
Poizot, P.; Gaubicher, J.; Renault, S.; Dubois, L.; Liang, Y.; Yao, Y. Opportunities and Challenges for Organic Electrodes in Electrochemical Energy Storage. Chem. Rev. 2020, 120, 6490–6557. [Google Scholar] [CrossRef]
Schon, T.B.; McAllister, B.T.; Li, P.-F.; Seferos, D.S. The rise of organic electrode materials for energy storage. Chem. Soc. Rev. 2016, 45, 6345–6404. [Google Scholar] [CrossRef]
Lu, Y.; Zhang, Q.; Li, L.; Niu, Z.; Chen, J. Design Strategies toward Enhancing the Performance of Organic Electrode Materials in Metal-Ion Batteries. Chemistry 2018, 4, 2786–2813. [Google Scholar] [CrossRef]
Lu, Y.; Chen, J. Prospects of organic electrode materials for practical lithium batteries. Nat. Rev. Chem. 2020, 4, 127–142. [Google Scholar] [CrossRef] [PubMed]
Esser, B.; Dolhem, F.; Becuwe, M.; Poizot, P.; Vlad, A.; Brandell, D. A perspective on organic electrode materials and technologies for next generation batteries. J. Power Sources 2021, 482, 228814. [Google Scholar] [CrossRef]
Yan, L.; Zhao, C.; Sha, Y.; Li, Z.; Liu, T.; Ling, M.; Zhou, S.; Liang, C. Electrochemical redox behavior of organic quinone compounds in aqueous metal ion electrolytes. Nano Energy 2020, 73, 10476. [Google Scholar] [CrossRef]
Tobishima, S.; Yamaki, J.; Yamaji, A. Cathode Characteristics of Organic Electron. Acceptors for Lithium Batteries. J. Electrochem. Soc. 1984, 131, 57–63. [Google Scholar] [CrossRef]
Senoh, H.; Yao, M.; Sakaebe, H.; Yasuda, K.; Siroma, Z. A two-compartment cell for using soluble benzoquinone derivatives as active materials in lithium secondary batteries. Electrochim. Acta 2011, 56, 10145–10150. [Google Scholar] [CrossRef]
Genorio, B.; Pirnat, K.; Cerc-Korosec, R.; Dominko, R.; Gaberscek, M. Electroactive Organic Molecules Immobilized onto Solid Nanoparticles as a Cathode Material for Lithium-Ion Batteries. Angew. Chem. Int. Ed. 2010, 49, 7222–7224. [Google Scholar] [CrossRef]
Foos, J.S.; Erker, S.M.; Rembetsy, L.M. Synthesis and Characterization of Semiconductive Poly-l,4-Dirnethoxybenzene and Its Derived Polyquinone. J. Electrochem. Soc. 1986, 133, 836–840. [Google Scholar] [CrossRef]
Häringer, D.; Novák, P.; Haas, O.; Piro, B.; Pham, M.-C. Poly(5-amino-1,4-naphthoquinone), a Novel Lithium-Inserting Electroactive Polymer with High Specific Charge. J. Electrochem. Soc. 1999, 146, 2393–2396. [Google Scholar] [CrossRef]
Gall, T.L.; Reiman, K.H.; Grossel, M.C.; Owen, J.R. Poly(2,5-dihydroxy-1,4-benzoquinone-3,6-methylene): A new organic polymer as positive electrode material for rechargeable lithium batteries, Journal of Power Sources. J. Power Sources 2003, 119–121, 316–320. [Google Scholar] [CrossRef]
Son, E.J.; Kim, J.H.; Kim, K.; Park, C.B. Quinone and its derivatives for energy harvesting and storage materials. J. Mat. Chem. A 2016, 4, 11179–11202. [Google Scholar] [CrossRef]
Chambers, J.Q. Quinonoid Compounds, 1st ed.; John Wiley & Sons Ltd.: Hoboken, NJ, USA, 2010. [Google Scholar]
Mueller, T.; Kusne, A.G.; Ramprasad, R. Machine learning in materials science: Recent progress and emerging applications. In Reviews in Computational Chemistry; Parrill, A.L., Lipkowitz, B.K., Eds.; John Wiley & Sons, Inc.: Indianapolis, IN, USA, 2016; Volume 29, pp. 186–273. [Google Scholar]
Joshi, R.P.; Eickholt, L.; Li, L.; Fornari, M.; Barone, V.; Peralta, J.E. Machine Learning the Voltage of Electrode Materials in Metal-ion Batteries. ACS Appl. Mater. Interfaces 2019, 11, 18494–18503. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Zhou, J.; Lu, J.; Shen, L. Interpretable learning of voltage for electrode design of multivalent metal-ion batteries. NPJ Comput. Mater. 2022, 8, 175. [Google Scholar] [CrossRef]
Allam, O.; Kuramshin, R.; Stoichev, Z.; Cho, B.W.; Lee, S.W.; Jang, S.S. Molecular structure–redox potential relationship for organic electrode materials: Density functional theory–Machine learning approach. Mater. Today Energy 2020, 17, 100482. [Google Scholar] [CrossRef]
Tuttle, M.R.; Brackman, E.M.; Sorourifar, F.; Paulson, J.; Zhang, S. Predicting the Solubility of Organic Energy Storage Materials Based on Functional Group Identity and Substitution Pattern. J. Phys. Chem. Lett. 2023, 14, 1318–1325. [Google Scholar] [CrossRef]
Wang, F.; Li, J.; Liu, Z.; Qiu, T.; Wu, J.; Lu, D. Computational design of quinone electrolytes for redox flow batteries using high-throughput machine learning and theoretical calculations. Front. Chem. Eng. 2022, 4, 1086412. [Google Scholar] [CrossRef]
Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B.; et al. PubChem 2023 update. Nucleic Acids Res. 2023, 51, D1373–D1380. [Google Scholar] [CrossRef]
Open Babel Development Team. Open Babel. 2016. Available online: http://openbabel.org/wiki/Main_Page (accessed on 12 October 2023).
Kichev, I.; Borislavov, L.; Tadjer, A. Automated generation of molecular derivatives—DerGen software package. Mater. Today Proceedings 2022, 61, 1287–1291. [Google Scholar] [CrossRef]
Nayak, S.K.; Ojha, A.C. Data Leakage Detection and Prevention: Review and Research Directions. In Machine Learning and Information Processing. Advances in Intelligent Systems and Computing; Swain, D., Pattnaik, P., Gupta, P., Eds.; Springer: Singapore, 2020; Volume 1101, pp. 203–212. [Google Scholar]
Wigh, D.S.; Goodman, J.M.; Lapki, A.A. A review of molecular representation in the age of machine learning. WIREs Comput. Mol. Sci. 2022, 12, e1603. [Google Scholar] [CrossRef]
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. [Google Scholar] [CrossRef]
Dalby, A.; Nourse, J.G.; Hounshell, W.G.; Gushurst, A.K.I.; Grier, D.L.; Leland, B.A.; Laufer, J. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J. Chem. Inf. Comput. Sci. 1992, 32, 244–255. [Google Scholar] [CrossRef]
Todeschini, R. Consonni Handbook of Molecular Descriptors. In Methods and Principles in Medicinal Chemistry; WILEY-VCH Verlag GmbH: Weinheim, Germany, 2000. [Google Scholar]
Cereto-Massagué, A.; Ojeda, M.J.; Valls, C.; Mulero, M.; Garcia-Vallvé, S.; Pujadas, G. Molecular fingerprint similarity search in virtual screening. Methods 2015, 71, 58–63. [Google Scholar] [CrossRef] [PubMed]
Morgan, H.L. The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107–113. [Google Scholar] [CrossRef]
Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. [Google Scholar] [CrossRef]
Elton, D.C.; Boukouvalas, Z.; Fuge, M.D.; Chung, P.W. Deep learning for molecular design—A review of the state of the art. Mol. Syst. Des. Eng. 2019, 4, 828–849. [Google Scholar] [CrossRef]
Kuzminykh, D.; Polykovskiy, D.; Kadurin, A.; Zhebrak, A.; Baskov, I.; Nikolenko, S.; Shayakhmetov, R.; Zhavoronkov, A. 3D Molecular Representations Based on the Wave Transform for Convolutional Neural Networks. Mol. Pharm. 2018, 15, 4378–4385. [Google Scholar] [CrossRef] [PubMed]
Skalic, M.; Jiménez Luna, J.; Sabbadin, D.; De Fabritiis, G. Shape-Based Generative Modeling for de-novo Drug Design. J. Chem. Inf. Model. 2019, 59, 1205–1214. [Google Scholar] [CrossRef]
Yap, C.W. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011, 32, 1466–1474. [Google Scholar] [CrossRef]
James, J.P. Stewart, Stewart Computational Chemistry; MOPAC: Colorado Springs, CO, USA, 2016; Available online: http://OpenMOPAC.net (accessed on 12 October 2023).
Pudjihartono, N.; Fadason, T.; Kempa-Liehr, A.W.; O’Sullivan, J.M. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front. Bioinform. 2022, 2, 927312. [Google Scholar] [CrossRef] [PubMed]
Hocking, R.R. The Analysis and Selection of Variables in Linear Regression. Biometrics 1976, 32, 1–49. [Google Scholar] [CrossRef]
Burden, F.R. Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci. 1989, 29, 225–227. [Google Scholar] [CrossRef]
Sharma, V.; Goswami, R.; Madan, A.K. Eccentric Connectivity Index: A Novel Highly Discriminating Topological Descriptor for Structure-Property and Structure-Activity Studies. J. Chem. Inf. Comput. Sci. 1997, 37, 273–282. [Google Scholar] [CrossRef]
Hall, L.H.; Kier, L.B. Electrotopological state indices for atom types: A novel combination of electronic, topological, and valence state information. J. Chem. Inf. Comput. Sci. 1995, 35, 1039–1045. [Google Scholar] [CrossRef]
Gramatica, P.; Corradi, M.; Consonni, V. Modelling and prediction of soil sorption coefficients of non-ionic organic pesticides by molecular descriptors. Chemosphere 2000, 41, 763–777. [Google Scholar] [CrossRef] [PubMed]
Nilakantan, R.; Nunn, D.S.; Greenblatt, L.; Walker, G.; Haraki, K.; Mobilio, D. A family of ring system-based structural fragments for use in structure-activity studies: Database mining and recursive partitioning. J. Chem. Inf. Model. 2006, 46, 1069–1077. [Google Scholar] [CrossRef] [PubMed]
Todeschini, R.; Consonni, V. Molecular Descriptors for Chemoinformatics; WILEY-VCH Verlag GmbH: Weinheim, Germany, 2009; pp. 809–812. [Google Scholar]
Zhao, Y.H.; Abraham, M.H.; Zissimos, A.M. Fast Calculation of van der Waals Volume as a Sum of Atomic and Bond Contributions and Its Application to Drug Compounds. JACS 2003, 68, 7368–7373. [Google Scholar] [CrossRef]
Frisch, M.J.; Trucks, G.W.; Schlegel, H.B.; Scuseria, G.E.; Robb, M.A.; Cheeseman, J.R.; Scalmani, G.; Barone, V.; Petersson, G.A.; Nakatsuji, H.; et al. Gaussian 16, Revision C.01; Gaussian, Inc.: Wallingford, CT, USA, 2016. [Google Scholar]
Ochterski, J.W. Thermochemistry in Gaussian; Gaussian, Inc.: Wallingford, CT, USA, 2000. [Google Scholar]
Danchovski, Y.; Rasheev, H.; Stoyanova, R.; Tadjer, A. Molecular Engineering of Quinone-Based Nickel Complexes and Polymers for All-Organic Li-Ion Batteries. Molecules 2022, 27, 6805. [Google Scholar] [CrossRef] [PubMed]
Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
Breiman, L. Classification and Regression Trees, 1st ed.; CHAPMAN & HALL/CRC: Boca Raton, FL, USA, 1984. [Google Scholar]
Kotsiantis, S.B. Decision trees: A recent overview. Artif. Intell. Rev. 2013, 39, 261–283. [Google Scholar] [CrossRef]
Klekota, J.; Roth, F.P. Chemical substructures that enrich for biological activity. Bioinformatics 2008, 24, 2518–2525. [Google Scholar] [CrossRef]
Hou, T.; Wang, J.; Li, Y. ADME evaluation in drug discovery. The prediction of human intestinal absorption by a support vector machine. J. Chem. Inf. Model. 2007, 47, 2408–2415. [Google Scholar] [CrossRef] [PubMed]
Lamanna, C.; Bellini, M.; Padova, A.; Westerberg, G.; Maccari, L. Straightforward recursive partitioning model for discarding insoluble compounds in the drug discovery process. J. Med. Chem. 2008, 51, 2891–2897. [Google Scholar] [CrossRef]
Patel, H.H.; Prajapati, P. Study and Analysis of Decision Tree Based Classification Algorithms. Int. J. Comput. Sci. Eng. 2018, 6, 74–78. [Google Scholar] [CrossRef]
Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Piryonesi, S.M.; El-Diraby, T.E. Data Analytics in Asset Management: Cost-Effective Prediction of the Pavement Condition Index. J. Infrastruct. Syst. 2020, 26, 04019036. [Google Scholar] [CrossRef]
Pedregosa, D.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Bejani, M.M.; Ghatee, M. A systematic review on overfitting control in shallow and deep neural networks. Artif. Intell. Rev. 2021, 54, 6391–6438. [Google Scholar] [CrossRef]
Wright, S. Correlation and causation. J. Agric. Res. 1921, 20, 557–585. [Google Scholar]

Figure 1. Results of the backward stepwise regression for descriptor selection.

Figure 2. Redox potential distribution histogram.

Figure 3. Model tuning via grid search for (a) the optimal learning rate in ridge regression; (b) the optimal maximal decision tree depth; ((c) and (d), respectively) the optimal number of decision tree estimators in random forest regression and extra trees regression; (e) the optimal number of decision tree estimators and the learning rate (LR) in gradient boosting. The optimal hyperparameter value is depicted by a dotted red line.

Figure 4. Regression decision tree chart with maximal depth of three.

Figure 5. Models’ performance on the test set.

Table 1. Molecular descriptors’ names and descriptions.

Description	Descriptor Name
Lowest partial charge weighted BCUTS [41]	BCUTc-1l
Highest partial charge weighted BCUTS [41]	BCUTc-1h
Total number of double bonds (excluding aromatic bonds)	nBondsD2
Triply bound carbon bound to another carbon	C1SP1
Doubly bound carbon bound to three other carbons	C3SP2
A topological descriptor combining distance and adjacency information [42]	ECCEN
Count of atom-type H E-State: H on aaCH, dCH2 or dsCH * [43]	nHother
Count of atom-type E-State: =C< [43]	ndssC
Count of atom-type E-State: aaC- [43]	naasC
Count of atom-type E-State: N≡ [43]	ntN
Sum of E-States for weak hydrogen bond acceptors [43]	SwHBa
Sum of atom-type H E-State: =CH- [43]	SHdsCH
Sum of atom-type H E-State: H on aaCH, dCH2 or dsCH [43]	SHother
Sum of atom-type E-State: =C< [43]	SdssC
Sum of atom-type E-State: aaC- [43]	SaasC
Sum of atom-type E-State: N≡ [43]	StN
Minimum atom-type H E-State: H on aaCH, dCH2 or dsCH [43]	minHother
Minimum atom-type E-State: aaC- [43]	minaasC
Minimum atom-type E-State: =O [43]	mindO
Maximum atom-type H E-State: H on aaCH, dCH2 or dsCH [43]	maxHother
Maximum atom-type E-State: aaC- [43]	maxaasC
Mean intrinsic state values I [43]	meanI
Maximum negative intrinsic state difference in the molecule (related to the nucleophilicity of the molecule) [44]	MAXDN2
Maximum positive intrinsic state difference in the molecule (related to the electrophilicity of the molecule) [44]	MAXDP2
Complexity of the system [45]	fragC
Number of rings	nRing
Topological diameter (maximum atom eccentricity)	topoDiameter
Mean topological charge index of order two [46]	JGI2
Topological polar surface area	TopoPSA
Van der Waals volume calculated using the method proposed in Zhao et al. JACS 2003, 68, 7368–7373 [47]	VABC
Molecular weight	MW
Energy of the lowest unoccupied molecular orbital estimated by PM6 [eV]	LUMO

* a = aromatic; s = single; d = double.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kichev, I.; Borislavov, L.; Tadjer, A.; Stoyanova, R. Machine Learning Prediction of the Redox Activity of Quinones. Materials 2023, 16, 6687. https://doi.org/10.3390/ma16206687

AMA Style

Kichev I, Borislavov L, Tadjer A, Stoyanova R. Machine Learning Prediction of the Redox Activity of Quinones. Materials. 2023; 16(20):6687. https://doi.org/10.3390/ma16206687

Chicago/Turabian Style

Kichev, Ilia, Lyuben Borislavov, Alia Tadjer, and Radostina Stoyanova. 2023. "Machine Learning Prediction of the Redox Activity of Quinones" Materials 16, no. 20: 6687. https://doi.org/10.3390/ma16206687

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning Prediction of the Redox Activity of Quinones

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.1.1. Molecular Structure Generation

2.1.2. Dataset Splitting

2.2. Molecular Descriptors

Descriptor Selection

2.3. Redox Potential Calculation

2.4. Machine Learning Methods Used

2.4.1. Ridge Regression

2.4.2. Decision Tree

2.4.3. Random Forest

2.4.4. Extra Trees

2.4.5. Gradient Boosting

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI