Next Article in Journal
Concurrent Reactive Oxygen Species Generation and Aneuploidy Induction Contribute to Thymoquinone Anticancer Activity
Previous Article in Journal
Recent Advances in In Silico Target Fishing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Computational Prediction of Compound–Protein Interactions for Orphan Targets Using CGBVS

1
Data Science Division, INTAGE Healthcare Inc., 2F NREG Midosuji Bldg., 3-5-7 Kawara-Machi, Chuo-ku, Osaka 541-0048, Japan
2
Business Development Division, Advanced Technology Department, INTAGE Inc., Akihabara Building, 3 Kanda-Neribeicho, Chiyoda-ku, Tokyo 101-8201, Japan
3
Institute for Theoretical Medicine Inc., 26-1 Muraoka-Higashi 2-Chome, Fujisawa 251-0012, Japan
*
Authors to whom correspondence should be addressed.
Molecules 2021, 26(17), 5131; https://doi.org/10.3390/molecules26175131
Submission received: 20 July 2021 / Revised: 16 August 2021 / Accepted: 23 August 2021 / Published: 24 August 2021
(This article belongs to the Section Computational and Theoretical Chemistry)

Abstract

:
A variety of Artificial Intelligence (AI)-based (Machine Learning) techniques have been developed with regard to in silico prediction of Compound–Protein interactions (CPI)—one of which is a technique we refer to as chemical genomics-based virtual screening (CGBVS). Prediction calculations done via pairwise kernel-based support vector machine (SVM) is the main feature of CGBVS which gives high prediction accuracy, with simple implementation and easy handling. We studied whether the CGBVS technique can identify ligands for targets without ligand information (orphan targets) using data from G protein-coupled receptor (GPCR) families. As the validation method, we tested whether the ligand prediction was correct for a virtual orphan GPCR in which all ligand information for one selected target was omitted from the training data. We have specifically expressed the results of this study as applicability index and developed a method to determine whether CGBVS can be used to predict GPCR ligands. Validation results showed that the prediction accuracy of each GPCR differed greatly, but models using Multiple Sequence Alignment (MSA) as the protein descriptor performed well in terms of overall prediction accuracy. We also discovered that the effect of the type compound descriptors on the prediction accuracy was less significant than that of the type of protein descriptors used. Furthermore, we found that the accuracy of the ligand prediction depends on the amount of ligand information with regard to GPCRs related to the target. Additionally, the prediction accuracy tends to be high if a large amount of ligand information for related proteins is used in the training.

1. Introduction

Post-genome research has been providing a large amount of omics data on genes and proteins, including genomes, transcriptomes, and proteomes. On the other hand, the development of technologies such as high-throughput screening has led to the accumulation of compound and bioactivity information on a vast number of compounds and drugs. This information is published in public databases such as ChEMBL [1,2,3] and PubChem [4] and are freely available to use. Such bioactivity information between compounds and proteins is also referred to as drug–target interaction (DTI) and in a broad context, simply Compound–Protein interaction (CPI). The research to utilize such data has been been one of the major hot topics in the field of drug discovery. Many drugs affect the human body in the form of drug effects or side effects through interactions with biomolecules such as target proteins. This is why the identification of CPIs is an important issue in drug discovery research. However, accurate and comprehensive identification of CPIs in experiments is almost impossible due to the enormous costs involved. In recent years, various Artificial Intelligence (AI) technologies have been developed to predict CPIs (or DTIs) on a large scale by effectively utilizing the vast amount of bioactivity data that has been accumulated [5,6,7,8,9,10,11,12,13,14].
In the early stages of drug discovery, ligands that act on target proteins are often insufficiently identified or not even identified at all. In addition, we cannot expect to get a lot of information on the three-dimensional structure of target proteins in these cases. The in silico approach does not work well in situations where known active ligands and protein structural information is extremely limited. Nevertheless, in order to move forward with the drug discovery project, we should also consider trying in silico approaches when we want a ligand even if it is less active. In such cases, AI technology for CPI prediction is promising as an in silico approach.
One of the many AI-based methods for predicting CPIs is called the CGBVS technique. This technique is theoretically simpler than other methods, can be implemented without difficulty, and gives sufficient accuracy of prediction. Hamanaka et al. [14] have implemented CGBVS with a deep neural network (CGBVS-DNN) that enabled training of over a million CPIs. Wassermann et al. [7], using a machine learning approach similar to CGBVS, used a limited number of protease targets as an example to test the prediction accuracy for orphan targets. They have concluded that ligand information of nearest neighbors is essential for a good prediction of ligands of orphan targets.
We studied the following two aspects when using the CGBVS technique. The first aspect is how accurate the ligand prediction is for orphan targets. In this study, we focused on the G protein-coupled receptor (GPCR) family, which is an important and data-rich target in drug discovery. Out of the available 243 possible targets, we randomly selected 52 GPCRs. We created 52 machine learning models of CGBVS, omitting all the ligand information for one particular target GPCR per model. That is, we created a virtual orphan GPCR per model and tested whether the model could predict the ligands for that virtual orphan GPCR. We also investigated how the accuracy of ligand prediction for 52 selected virtual orphan GPCRs is affected by the combination of compound and protein descriptors used in the machine learning process.
The second aspect is to examine the conditions and applicability of high prediction accuracy. Here, we first introduced an applicability index which helped us determine whether it is possible to apply CGBVS to true orphan GPCRs.

2. Materials and Methods

2.1. CGBVS

For the purpose of investigating the relationship between the applicability of the CGBVS method to ligand prediction of orphan GPCRs and the protein kernel, we used SVM instead of Deep Neural Network. The CGBVS technique we used is mostly implemented according to the method studied by Yabuuchi et al. [8], but the machine learning method for Support Vector Machine (SVM) [15] is slightly different from the original CGBVS technique. The reason is that the kernel function part of SVM is clearly divided into a compound-derived part and a protein-derived part for each calculation. The method of calculating this SVM is the same as the method used in the work of Wassermann et al. [7]. Letting c be the compound vector and p be the protein vector, we can then express the Compound–Protein interaction vector (CPI vector) x as their tensor product x = c p . The SVM kernel for the CPI vector can then be expressed by [16]:
K ( x , x ) = K C ( c , c ) · K P ( p , p ) ,
where K C and K P are the compound and protein kernels, respectively. Since the compound and protein kernels can be calculated independently, there is no need to explicitly calculate the matrix representation of the tensor product as a representation of the actual CPI vector.
A schematic diagram of the calculation procedure of our CGBVS technique is shown in Figure 1. The first step is to prepare the structural formula data set of the compounds and the amino acid sequence data set of the proteins for machine learning. From a compound structural formula, descriptors such as physicochemical parameters and fingerprints are calculated and converted into a compound vector. From an amino acid sequence, the descriptors associated with the strings are calculated and converted into a protein vector. The two vectors created are combined according to bioactivity data to create the CPI vector. If the activity value of the data is higher than the set threshold, it is a positive CPI vector; otherwise, it is a negative CPI vector. The CGBVS model is created by machine learning via SVM of positive and negative CPI vectors based on aforementioned kernels. This CGBVS model allows us to predict the activity of unknown Compound–Protein combinations. The usual SVM score is the value of the distance from the discriminative surface (decision function), but this value is sigmoidally transformed to perform probability estimation [17].

2.2. Virtual Orphan GPCR Model

The CGBVS model developed in this study is based on the GPCR-related activity data from the ChEMBL 25 database [1,2,3]. The total number of GPCRs was 243, and the number of associated compounds was 280,648. The criterion for the presence or absence of activity used in this study was whether there was 50% activity at ≤ 1 μ M or at ≥ 3 μ M, respectively. In addition, no distinction was made between agonists and antagonists. In addition, data such as the inhibition rate of a single concentration were not used. In this condition, the number of CPIs for positive samples was 165,877 and the number of CPIs for negative samples was 233,272.
To create a virtual orphan GPCR model, we select one GPCR and delete the CPI data for that GPCR from the training data set (see Figure 2) Fifty-two GPCRs having 100 or more active ligand data were randomly selected as virtual orphan GPCRs in this study (Table 1).
The CPI data for only one target are deleted per CGBVS model which leaves CPIs for 242 target out of the available 243. As control models to compare the prediction performance of the virtual orphan GPCR models, we also built models that included only half of the original number of ligand data for each target GPCR as a training set and retain the other half as a test set. We refer to these as half-sampled GPCR models (Tables S1 and S2).
The two types of compound descriptors used in this study are descriptors that can be calculated using alvaDesc [18] and ECFP [19]. Using alvaDesc, a software developed by the company Alvascience, 941 non-fingerprint 2D descriptors were calculated, while 2048-bit Extended Connectivity Fingerprints having a radius of 2 (ECFP4) were calculated using RDKit [20]. For proteins, on the other hand, there are three types of descriptors used: PROFEAT 2016 [21], ProtVec [22], and Multiple Sequence Alignment (MSA). PROFEAT descriptors were generated using the web service [23], and we calculated 1437 descriptors using the default settings. ProtVec descriptors were generated using a free tool called BioVec [24] to calculate 1500 descriptors. For MSA descriptors, the number of descriptors generated are equal to the number of target proteins employed in machine learning. There are some techniques that used pairwise sequence alignment as the SVM kernel [25,26]. In our case, we have developed a technique to create descriptors from multiple sequence alignment. To calculate for MSA descriptors, the GPCR amino acid sequences are first prepared in FASTA format. The identity matrix was then generated after performing Multiple Sequence Alignment using Clustal Omega. Then, the identity matrix S is eigen-decomposed as in the equation
S = U Λ U T = ( Λ U T ) T ( Λ U T ) = X T X , X = Λ U T ,
where Λ is a diagonal matrix and U is a unitary matrix made from eigenvectors. Finally, each column of the matrix X in Equation (2) can be taken as the column feature vector of the corresponding protein. In rare cases, several eigenvalues with small negative numbers may be found. In such cases, the eigenvalues and eigenvectors of the negative numbers are removed, and the matrix X is calculated. The MSA feature vectors computed above can be reconstructed through approximation of the matrix elements of the identity matrix S by choosing a linear kernel as the SVM kernel. Since the negative eigenvalues and eigenvectors have been removed, this kernel matrix is a semi-positive definite symmetric matrix.
The compound and protein descriptors described above are high dimensional vectors, thus we used principal component analysis to perform dimensionality reduction. We took the cumulative contribution of the principal components up to 99%.
The kernel function computed in SVM machine learning depends on the feature vector created from each descriptor. Table 2 shows the correspondence between descriptors and kernel functions. In this study, we created virtual orphan GPCR models for all combinations of two compound and three protein descriptors (six combinations) for each of the 52 GPCRs. To avoid overfitting of the SVM model, we have set the appropriate hyperparameters to maximize accuracy via 5-fold cross validation. Machine learning calculations with SVM have been performed using a proprietary tool named CzeekS [27].

2.3. Model Validation

As a method of confirming the prediction performance of virtual orphan GPCR models, a set of compounds for validation was screened against the virtual orphan GPCR. The set of compounds used for validation was composed of 280,648 GPCR-related compounds from the ChEMBL 25 database. These compounds are identical to those used to create the CGBVS models, but since each model is created after deleting the data of the GPCR to be tested, they are not considered to be problematic as validation compounds for prediction performance. Validation of HS GPCR models was performed in the same way as the virtual orphan GPCR models; however, the ligand data of the target GPCR included in the training set were omitted from the test set. The area under receiver operating characteristic curve (AUROC) and Enrichment Factor (EF) were adopted as measures of predictive performance. In these calculations, compounds whose interaction data with the target do not exist in the CHEMBL database were treated as having no activity. AUROCs are calculated using scikit-learn by sorting in order of increasing CGBVS score. On the other hand, EF 1 % is calculated as
EF 1 % = A found / N subset A total / N total .
N total is the total number of compounds screened and N subset is the number of compounds selected from the top scores. In addition, A total is the total number of active compounds for the target GPCR, and A found is the number of active compounds for the target GPCR found among the top scoring compounds selected.

2.4. Applicability Index

We considered the applicability index A ( p i ) of CGBVS to the target GPCR p i to be proportional to the sum of the number of active ligands N j of the neighboring GPCRs p j of the target virtual orphan GPCR. Thus, we defined it as
A p i = j w K P p i , p j N j .
Here, w is a weight function whose argument is the value of the protein kernel function K P , and its functional form is the sigmoid function
w ( x ) = 1 1 + exp α x r .
The two parameters of the sigmoid function, α and r, are determined to maximize the Spearman’s correlation coefficient between AUROC and log A . The AUROC is calculated using the procedure described here earlier. Bayesian optimization was used to perform the optimization of the correlation coefficient.

3. Results and Discussion

3.1. Analysis of Prediction Accuracy

The EF 1 % calculated in this study are for the top 1% (2806 compounds) from the highest scores of the screened compounds. In this case, EF 1 % takes a value from 0 to 100 and, when EF 1 % is equal to 1, it corresponds to random screening. The model is not worthy for screening unless the EF 1 % value is at least above 1. The enrichment factors for the 52 GPCRs used as virtual orphan targets are shown in Figure 3. Additionally, a table comparing the values of EF 1 % for the virtual orphan and half-sampled GPCR models is provided as Supplementary Data (Table S1). All possible combinations of compound and protein descriptors were tested for each GPCR target and results showed large variations in EF 1 % among GPCRs. The same was observed in EF 1 % among descriptor combinations for the same GPCR. This may indicate that each GPCR target possibly requires different combination of descriptors that are suitable for accurate prediction. Figure 3 shows that the combination of alvaDesc and MSA (red bars) has good EF 1 % values for most GPCRs, indicating that it could possibly be the best descriptor combination. The next best descriptor combination is exhibited by alvaDesc-PROFEAT (blue bars) followed by ECFP-MSA (purple bars).
In order to simplify the EF 1 % results for each GPCR and highlight the effect of descriptor combinations, the frequency distribution of EF 1 % for each descriptor combination is summarized in Table 3. For all combinations of descriptors, more than half of the GPCRs had EF 1 % greater than 1 (more than 26), and some of them had EF 1 % greater than 30, which can be considered accurate and better than our expectations. Some of them were comparable to the EF 1 % of the half-sampled GPCR models, and, surprisingly, there were six EF 1 % values that exceeded those of the half-sampled models. Looking at the variations in EF 1 % value and the number of GPCRs for each descriptor, it can be seen that the difference in the protein descriptor has a greater impact on the EF 1 % than the difference in the compound descriptor. We have found that, when MSA is used as the protein descriptor, there are many GPCRs having better EF 1 % than when other descriptors are used. In particular, for the combination of alvaDesc and MSA, there were nine GPCRs with EF 1 % greater than 30, making it the most suitable combination for the construction of the GPCR model.
EF 1 % is commonly used as a performance indicator for screening measurements. Since the number of active compounds for each protein is different, ( A total in the Equation (3)), care must be taken in the simple comparison of prediction performance between different GPCRs. Therefore, in this study, we calculated AUROC (area under the ROC curve) as another predictive performance indicator. The calculation results of AUROC for 52 GPCRs are shown in Figure 4. As in the case of EF 1 % , a table comparing the AUROC with the half sampled GPCR models is shown in Table S2. For all combinations of GPCRs and descriptors, the values of AUROC for the half-sampled GPCR models are higher than that for the virtual orphan GPCR model. Figure 5 shows four representative ROC curves from 52 GPCRs. Each one has been chosen for its particular characteristic. CHRM3 is a case where all combinations of descriptors have high predictive performance, while ADRB2 and HCRTR2 are cases where the six curves are scattered. HRH3 is a case where the predictive performance for all combinations of descriptors is low.
For the three GPCRs in Figure 5 that result in high prediction performance, it can be seen that the prediction performance is good in two descriptor combinations: alvaDesc-MSA (red) and ECFP-MSA (purple). The trend is roughly the same for other GPCRs. The number of GPCRs with AUROC greater than 0.8 was the largest in alvaDesc-MSA with 29, followed by ECFP-MSA with 27. The number for other descriptor combinations was 18–24. These results suggest that the best prediction results are obtained when MSA is used as the protein descriptor. It is interesting to note that the number of GPCRs with AUROC of 0.8 or higher in the half sampled GPCR models is 49 for all six combinations of descriptors (see Table S2). This means that the difference in prediction accuracy due to the difference in descriptors is small in the conventional method of accuracy comparison such as cross-validation but is clearly bigger when using virtual orphan GPCR models. Furthermore, in the case of virtual orphan GPCR models, prediction is greatly influenced by the combination of compound and protein descriptor used.

3.2. Applicability of CGBVS for Orphan Targets

In actual drug discovery research, when trying to search for active compounds of true orphan GPCRs using CGBVS, it is necessary to have an indicator of whether CGBVS will work or not. According to the results of the performance evaluation of CGBVS models by EF 1 % and AUROC, the prediction performance is high for widely studied GPCRs, such as adrenergic and muscarinic receptors, for which there is abundant data on known ligands. This is thought to be because the prediction performance does not deteriorate even if all the ligand data of the target orphan GPCR is deleted, since abundant ligand data of related GPCRs of the target GPCR can be included in the training data. This can be understood from the fact that the applicability domain of a machine learning model is often set to the region around a dense area of training data [28,29,30]. Therefore, we defined the applicability index A ( p i ) of CGBVS to be proportional to the sum of the number of active ligands N j of the related GPCRs p j to the target GPCR, as in Equation (4).
The results of calculating the applicability index as described above for six different combinations of compound and protein descriptors are shown in Figure 6 as scatter plots. Values of Spearman’s correlation between log A and AUROC are summarized in Table 4. Similar to EF 1 % , the variation in the values of correlation coefficients are also larger among protein descriptors compared to that among compound descriptors. In addition, the protein descriptor with the highest correlation coefficient is MSA, and both alvaDesc and ECFP are highly correlated with log A and AUROC. As for the other two protein descriptors, PROFEAT showed a weak correlation and ProtVec showed no correlation at all. Therefore, when attempting to find the ligand for a true orphan GPCR using CGBVS, the use of MSA as the protein descriptor can provide some estimate of whether the ligand search will be successful or not. Virtual orphan GPCRs with AUROC greater than 0.8 are positive, and those with AUROC less than 0.8 are negative, and are predicted based on whether they exceed the threshold of log A . The threshold of log A was determined to maximize the accuracy of the prediction. Table 5 summarizes the results of validating the prediction accuracy of virtual orphan GPCRs with AUROC greater than 0.8. The accuracy and positive predictive value (PPV) of MSA was close to 0.9, indicating high prediction accuracy compared to other protein descriptors. In the case of PROFEAT, the accuracy was not bad at over 0.7, but the PPV was a little low at 0.65 when the compound descriptor was alvaDesc.
One of the features of MSA protein descriptor is that the values of the parameters r and α of the applicability index weight function are smaller than those of the other two types of protein descriptors. This means that the weight function is looser in shape than the other protein descriptors, and the active ligands of GPCRs with small protein kernel values (small similarity) also contribute to the applicability index. This can be understood from the fact that the threshold of log A is the largest for MSA. In using MSA, although machine learned target proteins are less similar to the virtual orphan target, we were able to construct a machine learning model in which the ligand information exert influences on each other. This leads to high applicability of CGBVS to identifying ligands for orphan targets. The opposite can be said for PROFEAT which requires high similarity between machine learned and orphan targets in order to have high applicability. On the other hand, the weight function in the case of PROFEAT is a parameter with a shape that changes more rapidly at the position where the protein kernel value is larger than MSA. This means that only the ligand information of GPCRs that are very close to the virtual orphan GPCRs affects the prediction accuracy. In a study by Wassermann et al., the accuracy of ligand prediction for orphan targets was found to be greatly influenced by the ligand information of related targets. Our results are consistent with theirs, and Equation (4) gives a more generalized interpretation.

4. Conclusions

We tested the prediction accuracy of the CGBVS technique for 52 virtual orphan GPCRs. Machine learning with the CGBVS method was performed for all possible combinations of two types of compound descriptors and three types of protein descriptors. In the prediction of the ligands of virtual orphan GPCRs, it was shown that the protein descriptor had a greater impact on the prediction accuracy than the compound descriptor. Of the three types of protein descriptors validated in this study, MSA had the best accuracy, with the highest number of GPCRs exceeding the reference values ( EF 1 % > 10 , AUCROC > 0.8 ) for both EF 1 % and AUROC indices. On the other hand, for compound descriptors, alvaDesc had slightly more GPCRs with better prediction accuracy than ECFP, but with only small differences between actual values.
We also examined the conditions under which ligand search for virtual orphan GPCRs was possible using CGBVS. The simple applicability index we defined in Equation (4) was shown to correlate well with AUROC when an MSA descriptor was used. There is a weak correlation for PROFEAT and almost no correlation at all for ProtVec. By using an MSA descriptor, we can, therefore, determine whether CGBVS can be applied to an unknown orphan target by the value of log A . In this case, if log A is 8.7 or higher when using alvaDesc as compound descriptor, and if log A is 9.0 or higher when using ECFP, a high success rate can be expected.

Supplementary Materials

The following are available online. Table S1: Enrichment factors for 52 GPCRs which were calculated from the results of screening using virtual orphan (VO) and half-sampled (HS) GPCR models, Table S2: AUROC for 52 GPCRs which were calculated from the results of screening using virtual orphan (VO) and half-sampled (HS) GPCR models.

Author Contributions

Conceptualization, C.K.; methodology, C.K.; validation, C.K. and E.K.; formal analysis, C.K., E.K. and Y.M.; writing—original draft preparation, C.K. and E.K.; writing—review and editing, C.K., E.K. and A.Y.; project administration, R.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

The data presented in this study are available in this article and related supplementary material (Tables S1 and S2).

Acknowledgments

Computation time was provided by the Supercomputing Services, Human Genome Center, the Institute of Medical Science, the University of Tokyo.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Davies, M.; Nowotka, M.; Papadatos, G.; Dedman, N.; Gaulton, A.; Atkinson, F.; Bellis, L.; Overington, J.P. ChEMBL web services: Streamlining access to drug discovery data and utilities. Nucleic Acids Res. 2015, 43, W612–W620. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Gaulton, A.; Hersey, A.; Nowotka, M.; Bento, A.P.; Chambers, J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L.J.; Cibrián-Uhalte, E.; et al. The ChEMBL database in 2017. Nucleic Acids Res. 2017, 45, D945–D954. [Google Scholar] [CrossRef] [PubMed]
  3. Mendez, D.; Gaulton, A.; Bento, A.P.; Chambers, J.; De Veij, M.; Félix, E.; Magariños, M.P.; Mosquera, J.F.; Mutowo, P.; Nowotka, M.; et al. ChEMBL: Towards direct deposition of bioassay data. Nucleic Acids Res. 2019, 47, D930–D940. [Google Scholar] [CrossRef]
  4. Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B.; et al. PubChem in 2021: New data content and improved web interfaces. Nucleic Acids Res. 2021, 49, D1388–D1395. [Google Scholar] [CrossRef] [PubMed]
  5. Yamanishi, Y.; Araki, M.; Gutteridge, A.; Honda, W.; Kanehisa, M. Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 2008, 24, i232–i240. [Google Scholar] [CrossRef]
  6. Jacob, L.; Vert, J.P. Protein-ligand interaction prediction: An improved chemogenomics approach. Bioinformatics 2008, 24, 2149–2156. [Google Scholar] [CrossRef] [Green Version]
  7. Wassermann, A.M.; Geppert, H.; Bajorath, J. Ligand prediction for orphan targets using support vector machines and various target-ligand kernels is dominated by nearest neighbor effects. J. Chem. Inf. Model. 2009, 49, 2155–2167. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Yabuuchi, H.; Niijima, S.; Takematsu, H.; Ida, T.; Hirokawa, T.; Hara, T.; Ogawa, T.; Minowa, Y.; Tsujimoto, G.; Okuno, Y. Analysis of multiple Compound–Protein interactions reveals novel bioactive molecules. Mol. Syst. Biol. 2014, 7, 472. [Google Scholar] [CrossRef]
  9. Brown, J.; Okuno, Y. Systems biology and systems chemistry: New directions for drug discovery. Chem. Biol. 2012, 19, 23–28. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. Gönen, M. Predicting drug–target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics 2012, 28, 2304–2310. [Google Scholar] [CrossRef] [PubMed]
  11. Shiraishi, A.; Niijima, S.; Brown, J.; Nakatsui, M.; Okuno, Y. Chemical genomics approach for gpcr–ligand interaction prediction and extraction of ligand binding determinants. J. Chem. Inf. Model. 2013, 53, 1253–1262. [Google Scholar] [CrossRef]
  12. Van Laarhoven, T.; Marchiori, E. Predicting drug–target interactions for new drug compounds using a weighted nearest neighbor profile. PLoS ONE 2013, 8, e66952. [Google Scholar] [CrossRef]
  13. Liu, Y.; Wu, M.; Miao, C.; Zhao, P.; Li, X.L. Neighborhood regularized logistic matrix factorization for drug–target interaction prediction. PLoS Comput. Biol. 2016, 12, e1004760. [Google Scholar] [CrossRef]
  14. Hamanaka, M.; Taneishi, K.; Iwata, H.; Ye, J.; Pei, J.; Hou, J.; Okuno, Y. CGBVS-DNN: Prediction of Compound–Protein Interactions Based on Deep Learning. Mol. Inform. 2017, 36, 1600045. [Google Scholar] [CrossRef]
  15. Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar]
  16. Hardoon, D.R.; Shawe-Taylor, J. Decomposing the tensor kernel support vector machine for neuroscience data with structured labels. Mach. Learn. 2010, 79, 29–46. [Google Scholar] [CrossRef]
  17. Platt, J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classif. 1999, 10, 61–74. [Google Scholar]
  18. alvascience Srl. alvaDesc Version 1.0.8. Available online: https://www.alvascience.com (accessed on 14 July 2021).
  19. Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. [Google Scholar] [CrossRef] [PubMed]
  20. RDKit: Open-Source Cheminformatics Software. Available online: https://www.rdkit.org (accessed on 14 July 2021).
  21. Zhang, P.; Tao, L.; Zeng, X.; Qin, C.; Chen, S.Y.; Zhu, F.; Yang, S.Y.; Li, Z.R.; Chen, W.P.; Chen, Y.Z. PROFEAT Update: A Protein Features Web Server with Added Facility to Compute Network Descriptors for Studying Omics-Derived Networks. J. Mol. Biol. 2017, 429, 416–425. [Google Scholar] [CrossRef] [PubMed]
  22. Asgari, E.; Mofrad, M.R. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 2015, 10, e0141287. [Google Scholar] [CrossRef] [PubMed]
  23. Chen, Y. PROFEAT 2016. Available online: http://bidd.group/cgi-bin/profeat2016/main.cgi (accessed on 14 July 2021).
  24. BioVec. Available online: https://github.com/kyu999/biovec (accessed on 14 July 2021).
  25. Liao, L.; Noble, W.S. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In Proceedings of the Sixth Annual International Conference on Computational Biology, Washington, DC, USA, 18–21 April 2002; pp. 225–232. [Google Scholar]
  26. Saigo, H.; Vert, J.P.; Ueda, N.; Akutsu, T. Protein homology detection using string alignment kernels. Bioinformatics 2004, 20, 1682–1689. [Google Scholar] [CrossRef] [Green Version]
  27. INTAGE Healthcare Inc. CzeekS. Available online: https://www.intage-healthcare.co.jp/service/data-science/insilico/czeeks/ (accessed on 14 July 2021).
  28. Schroeter, T.S.; Schwaighofer, A.; Mika, S.; Ter Laak, A.; Suelzle, D.; Ganzer, U.; Heinrich, N.; Müller, K.R. Estimating the domain of applicability for machine learning QSAR models: A study on aqueous solubility of drug discovery molecules. J. Comput. Aided Mol. Des. 2007, 21, 485–498. [Google Scholar] [CrossRef] [PubMed]
  29. Fechner, N.; Jahn, A.; Hinselmann, G.; Zell, A. Estimation of the applicability domain of kernel-based machine learning models for virtual screening. J. Cheminform. 2010, 2, 2. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Kaneko, H.; Funatsu, K. Applicability domain based on ensemble learning in classification and regression analyses. J. Chem. Inf. Model. 2014, 54, 2469–2482. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Schematic diagram of how CGBVS is calculated. The feature vector for a compound is obtained by calculating descriptors from the compound’s structural formula. Feature vectors for proteins are calculated from amino acid sequences. The CPI vector is created by taking the tensor product of the compound vector and the protein vector, and is labeled as binding or non-binding vectors based on the activity data in ChEMBL database. The CGBVS model is generated by machine learning of CPI vectors via SVM.
Figure 1. Schematic diagram of how CGBVS is calculated. The feature vector for a compound is obtained by calculating descriptors from the compound’s structural formula. Feature vectors for proteins are calculated from amino acid sequences. The CPI vector is created by taking the tensor product of the compound vector and the protein vector, and is labeled as binding or non-binding vectors based on the activity data in ChEMBL database. The CGBVS model is generated by machine learning of CPI vectors via SVM.
Molecules 26 05131 g001
Figure 2. Schematic diagram of the creation of a virtual orphan GPCR model. The solid and dotted lines between the compounds and proteins indicate known activities confirmed from the ChEMBL database. The same solid and dotted lines indicate the Compound–Protein combinations that are used as input to machine learning and the Compound–Protein combinations that are not used as input to machine learning, respectively. Proteins connected by dotted lines indicate a virtual orphan target. The prediction accuracy is verified by screening known GPCR associated compounds.
Figure 2. Schematic diagram of the creation of a virtual orphan GPCR model. The solid and dotted lines between the compounds and proteins indicate known activities confirmed from the ChEMBL database. The same solid and dotted lines indicate the Compound–Protein combinations that are used as input to machine learning and the Compound–Protein combinations that are not used as input to machine learning, respectively. Proteins connected by dotted lines indicate a virtual orphan target. The prediction accuracy is verified by screening known GPCR associated compounds.
Molecules 26 05131 g002
Figure 3. EF 1 % of the screening calculation results using the virtual orphan GPCR models. Red, blue, and green bars indicate the combination of alvaDesc with MSA, PROFEAT, and ProtVec, respectively. Purple, orange, and light blue bars indicate the combination of ECFP with MSA, PROFEAT, and ProtVec, respectively.
Figure 3. EF 1 % of the screening calculation results using the virtual orphan GPCR models. Red, blue, and green bars indicate the combination of alvaDesc with MSA, PROFEAT, and ProtVec, respectively. Purple, orange, and light blue bars indicate the combination of ECFP with MSA, PROFEAT, and ProtVec, respectively.
Molecules 26 05131 g003
Figure 4. AUROC of the screening calculation results using virtual orphan GPCR models. Red, blue, and green bars indicate the combination of alvaDesc with MSA, PROFEAT, and ProtVec, respectively. Purple, orange, and light blue bars indicate the combination of ECFP with MSA, PROFEAT, and ProtVec, respectively.
Figure 4. AUROC of the screening calculation results using virtual orphan GPCR models. Red, blue, and green bars indicate the combination of alvaDesc with MSA, PROFEAT, and ProtVec, respectively. Purple, orange, and light blue bars indicate the combination of ECFP with MSA, PROFEAT, and ProtVec, respectively.
Molecules 26 05131 g004
Figure 5. ROC curves of four representative GPCRs generated after screening of designated compound dataset using virtual orphan GPCR models. Characteristic results are shown for four selected GPCRs. Red, blue, and green lines indicate the combination of alvaDesc with MSA, PROFEAT and ProtVec, respectively. Purple, orange, and light blue lines indicate the combination of ECFP with MSA, PROFEAT, and ProtVec, respectively. (a) CHRM3; (b) ADRB2; (c) HRH3; (d) HCRTR2.
Figure 5. ROC curves of four representative GPCRs generated after screening of designated compound dataset using virtual orphan GPCR models. Characteristic results are shown for four selected GPCRs. Red, blue, and green lines indicate the combination of alvaDesc with MSA, PROFEAT and ProtVec, respectively. Purple, orange, and light blue lines indicate the combination of ECFP with MSA, PROFEAT, and ProtVec, respectively. (a) CHRM3; (b) ADRB2; (c) HRH3; (d) HCRTR2.
Molecules 26 05131 g005
Figure 6. Relationship between applicability index and AUROC for six combinations of compound and protein descriptors. A horizontal red line indicates the AUROC value at 0.8. A vertical red line indicates the threshold value for the applicability index. (a,b) PROFEAT; (c,d) MSA; (e,f) ProtVec.
Figure 6. Relationship between applicability index and AUROC for six combinations of compound and protein descriptors. A horizontal red line indicates the AUROC value at 0.8. A vertical red line indicates the threshold value for the applicability index. (a,b) PROFEAT; (c,d) MSA; (e,f) ProtVec.
Molecules 26 05131 g006
Table 1. List of 52 GPCRs that were selected as virtual orphan targets in this study. In the table, the active column shows the number of ligands that are active (≤ 1 μ M), and the inactive column shows the number of ligands that are inactive (≥ 3 μ M).
Table 1. List of 52 GPCRs that were selected as virtual orphan targets in this study. In the table, the active column shows the number of ligands that are active (≤ 1 μ M), and the inactive column shows the number of ligands that are inactive (≥ 3 μ M).
Gene NameAccessionActiveInactiveProtein Name
ADRA1AP353481800244Alpha-1A adrenergic receptor
ADRA1BP353681425302Alpha-1B adrenergic receptor
ADRA1DP251001369248Alpha-1D adrenergic receptor
ADRB1P085881021539Beta-1 adrenergic receptor
ADRB2P0755015421832Beta-2 adrenergic receptor
ADRB3P139451472215Beta-3 adrenergic receptor
AGTR1P305561167599Type-1 angiotensin II receptor
AGTR2P50052900113Type-2 angiotensin II receptor
CCKBRP322391014516Gastrin/cholecystokinin type B receptor
CCR2P415971379287C-C chemokine receptor type 2
CCR5P516811749333C-C chemokine receptor type 5
CHRM1P1122917681088Muscarinic acetylcholine receptor M1
CHRM2P081721493663Muscarinic acetylcholine receptor M2
CHRM3P203091666605Muscarinic acetylcholine receptor M3
CHRM4P08173751522Muscarinic acetylcholine receptor M4
CHRM5P08912510651Muscarinic acetylcholine receptor M5
CRHR1P349981648239Corticotropin-releasing factor receptor 1
CXCR3P496821099184C-X-C chemokine receptor type 3
EDNRAP251011195257Endothelin-1 receptor
FFAR1O14842774300Free fatty acid receptor 1
GHSRQ928471541191Growth hormone secretagogue receptor type 1
GLP1RP43220345294,774Glucagon-like peptide 1 receptor
GNRHRP30968121796Gonadotropin-releasing hormone receptor
GPR119Q8TDV51234110Glucose-dependent insulinotropic receptor
GPR55Q9Y2T6153553G-protein coupled receptor 55
HCRTR1O436132200783Orexin receptor type 1
HCRTR2O436142611725Orexin receptor type 2
HRH1P35367999406Histamine H1 receptor
HRH3Q9Y5N13395212Histamine H3 receptor
HRH4Q9H3N8903318Histamine H4 receptor
HTR1AP0890835324805-hydroxytryptamine receptor 1A
HTR1BP282229321905-hydroxytryptamine receptor 1B
HTR1DP2822110781335-hydroxytryptamine receptor 1D
HTR2AP2822335406765-hydroxytryptamine receptor 2A
HTR2BP4159513373815-hydroxytryptamine receptor 2B
HTR2CP2833525887565-hydroxytryptamine receptor 2C
HTR6P5040629253065-hydroxytryptamine receptor 6
HTR7P3496915322485-hydroxytryptamine receptor 7
MC4RP322452311857Melanocortin receptor 4
MCHR1Q997053116524Melanin-concentrating hormone receptor 1
NPY5RQ157611038100Neuropeptide Y receptor type 5
OPRD1P4114331802086Delta-type opioid receptor
OPRK1P4114537431197Kappa-type opioid receptor
OPRL1P411461305128Nociceptin receptor
OPRM1P3537237972033Mu-type opioid receptor
P2RY12Q9H244912237P2Y purinoceptor 12
PTGDR2Q9Y5Y42541143Prostaglandin D2 receptor 2
S1PR1P214532165379Sphingosine 1-phosphate receptor 1
TACR1P251032334223Substance-P receptor
TACR2P21452794227Substance-K receptor
TACR3P29371788143Neuromedin-K receptor
TSHRP16473114015,271Thyrotropin receptor
Table 2. Table of correspondence between compound or protein descriptors and SVM kernel functions.
Table 2. Table of correspondence between compound or protein descriptors and SVM kernel functions.
DescriptorClassSVM KernelEquation
alvaDesccompoundGaussian K ( x i , x j ) = exp γ x i x j 2
ECFPcompoundTanimoto K ( x i , x j ) = x i · x j x i 2 + x j 2 x i · x j
PROFEAT2016proteinGaussian K ( x i , x j ) = exp γ x i x j 2
ProtVecproteinGaussian K ( x i , x j ) = exp γ x i x j 2
MSAproteinlinear K ( x i , x j ) = x i · x j
Table 3. Enrichment factor values and the number of GPCRs for each combination of descriptors. The compound and protein columns indicate the compound descriptor and protein descriptors used, respectively.
Table 3. Enrichment factor values and the number of GPCRs for each combination of descriptors. The compound and protein columns indicate the compound descriptor and protein descriptors used, respectively.
Descriptors EF 1 %
CompoundProtein0–11–1010–3030–50
alvaDescPROFEAT1323142
ECFPPROFEAT142891
alvaDescMSA1014199
ECFPMSA819223
alvaDescProtVec1720141
ECFPProtVec1526110
Table 4. Spearman’s correlation between AUROC and log A and the parameters α and r of the weight function for six combinations of compound and protein descriptors.
Table 4. Spearman’s correlation between AUROC and log A and the parameters α and r of the weight function for six combinations of compound and protein descriptors.
DescriptorsSpearman’s Corr. α r
CompoundProtein
alvaDescPROFEAT0.446679.730.6264
ECFPPROFEAT0.394994.180.6385
alvaDescMSA0.779212.770.4365
ECFPMSA0.804711.040.4564
alvaDescProtVec−0.036239.870.5801
ECFPProtVec0.175989.280.8000
Table 5. Accuracy of predicting GPCRs with an AUROC of 0.8 or higher using the applicability index for six different combinations of compound and protein descriptors. The log A column is the threshold of the applicability index, PPV is the positive predictive value, Accuracy is the prediction accuracy, and p-value is the result of Fisher’s exact test.
Table 5. Accuracy of predicting GPCRs with an AUROC of 0.8 or higher using the applicability index for six different combinations of compound and protein descriptors. The log A column is the threshold of the applicability index, PPV is the positive predictive value, Accuracy is the prediction accuracy, and p-value is the result of Fisher’s exact test.
Descriptors log A PPVAccuracyp-Value
CompoundProtein
alvaDescPROFEAT6.4900.65210.7115 4.580 × 10 3
ECFPPROFEAT7.4860.72720.7500 4.812 × 10 3
alvaDescMSA8.7020.87100.8846 1.899 × 10 8
ECFPMSA8.9990.88890.8846 2.549 × 10 8
alvaDescProtVec7.6710.83330.6153 8.357 × 10 2
ECFPProtVec−10.790.80000.6731 1.542 × 10 2
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Kanai, C.; Kawasaki, E.; Murakami, R.; Morita, Y.; Yoshimori, A. Computational Prediction of Compound–Protein Interactions for Orphan Targets Using CGBVS. Molecules 2021, 26, 5131. https://doi.org/10.3390/molecules26175131

AMA Style

Kanai C, Kawasaki E, Murakami R, Morita Y, Yoshimori A. Computational Prediction of Compound–Protein Interactions for Orphan Targets Using CGBVS. Molecules. 2021; 26(17):5131. https://doi.org/10.3390/molecules26175131

Chicago/Turabian Style

Kanai, Chisato, Enzo Kawasaki, Ryuta Murakami, Yusuke Morita, and Atsushi Yoshimori. 2021. "Computational Prediction of Compound–Protein Interactions for Orphan Targets Using CGBVS" Molecules 26, no. 17: 5131. https://doi.org/10.3390/molecules26175131

Article Metrics

Back to TopTop