Next Article in Journal
Tribbles Genes in Gastric Cancer: A Tumor-Suppressive Role for TRIB2
Next Article in Special Issue
DARDN: A Deep-Learning Approach for CTCF Binding Sequence Classification and Oncogenic Regulatory Feature Discovery
Previous Article in Journal
Association of Variants in Innate Immune Genes TLR4 and TLR5 with Reproductive and Milk Production Traits in Czech Simmental Cattle
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

When Protein Structure Embedding Meets Large Language Models

Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA
*
Author to whom correspondence should be addressed.
Genes 2024, 15(1), 25; https://doi.org/10.3390/genes15010025
Submission received: 6 November 2023 / Revised: 16 December 2023 / Accepted: 21 December 2023 / Published: 23 December 2023
(This article belongs to the Special Issue When Genes Meet Artificial Intelligence and Machine Learning)

Abstract

:
Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.

1. Introduction

Supervised analysis for proteins is a well-established field in bioinformatics and biochemistry, focusing on the relationship between sequence, structure, and function. While protein sequences have traditionally been the main input for classification [1], modern approaches incorporate additional data such as secondary structure, solvent accessibility, disorder propensity, and multiple sequence alignments (MSAs). Understanding protein features and properties is crucial for comprehending their function and interactions. Databases like the Protein Data Bank (PDB) provide valuable resources of protein structural information, facilitating detailed analysis and exploration.
Analysis of proteins is a prominent research field within computational biology, offering numerous applications such as enzyme design [2], protein–protein interactions [3,4], and facilitating drug discovery strategies [5]. The Protein Data Bank (PDB) [6] has played a pivotal role in providing a vast repository of protein structures, enabling comprehensive studies on protein structure and function. Protein function prediction helps in understanding biological processes, finding new drugs, treating illnesses, and many more applications [7,8,9]. It all depends on our ability to predict the actions of proteins. When studying protein function, amino acids from protein sequences look random, but they exhibit a pattern and are not random [10]. Thus, they are primarily helpful for understanding its function, and thus several sequence-based approaches are popular [11,12]. This covers methods such as motif/domain identification [13], homology modeling [14,15], and sequence alignment [16]. Another approach that is frequently employed in protein analysis is a structure-based methodology, which looks at the three-dimensional structure of the protein to infer its function [17]. This entails methods including docking simulations [18], structure–function connection analysis, and protein structure modeling [19]. Researchers have recently begun to use machine learning in bioinformatics to analyze large datasets, extract patterns, and predict protein functions based on known features, such as sequence, structure, or functional annotations [20,21]. This approach makes use of statistical models and computational algorithms. The complexity and diversity of protein functions, along with the ongoing discovery of new proteins with unique functions, make this an area of great challenge. Research is still being conducted to increase prediction accuracy and learn more about how proteins function in biological systems [22,23]. A greater comprehension of proteins and their functions in biological processes can be attained by considering both the protein sequence and its associated structure. Using an integrated approach, we can better investigate and understand the complexities of protein biology.
To predict a protein function, information from its structure or sequence can be extracted with the aid of language models [18,24,25]. These models have the potential to reveal patterns or relationships that are difficult to identify using more conventional techniques by examining the contextual information contained in sequences or structures [26]. Language models do not have direct access to the intricate three-dimensional structures of proteins; instead, they rely on textual data. Protein activity frequently relies on complex structural information that cannot be fully understood from textual data alone [14]. Because proteins have different structures and activities, it is difficult for language models to represent the subtleties and details of each protein’s operation using just textual patterns [27]. Furthermore, despite their superior ability to comprehend language patterns, language models may not possess the biological background necessary to comprehend the context of protein function. Accurate prediction requires an understanding of biological interactions, processes, and metabolic pathways. It may be difficult to appropriately understand and confirm predictions made by language models since they may not be born with this domain-specific knowledge [28]. Large language models exhibit potential across multiple domains. Predicting the function of proteins is a challenging issue involving multiple interdisciplinary domains, including structural biology, molecular biology, and bioinformatics [29]. While language-centric AI models can help with data analysis and pattern recognition, a thorough understanding of protein interactions, structures, and functions frequently necessitates specific training in these scientific fields. Accurate and trustworthy predictions in this area still depend on integrating knowledge from several domains [30]. Achieving precise and dependable protein function prediction still requires integrating massive language models with domain-specific information and experimental confirmation.
Protein classification has shifted from knowledge-based statistical reasoning (involves pre-existing knowledge, domain expertise) to the integration of machine learning techniques (data-driven approaches), including neural networks [31,32,33] and SVMs [34]. Recent studies have explored both alignment-based [35] and alignment-free [36,37] methods for protein sequence analysis. However, sequence-only approaches like SeqVec [37] and ProteinBert [38] have limitations in generalization due to the complexity of protein sequences. It is crucial to incorporate structural information and other sequence properties to overcome these limitations, enabling the development of robust and practical protein classification methods. Proteins, composed of amino acids or polypeptides, serve as essential building blocks in biological systems. The primary structure represents the linear arrangement of amino acids, while the secondary structure describes local folding patterns such as beta-pleated sheets and alpha helices along the polypeptide backbone [39]. The tertiary structure encompasses the overall three-dimensional arrangement achieved through the folding of the polypeptide chain. Even minor changes in the primary structure can significantly impact the protein’s structure and function, underscoring the importance of comprehending biomolecular structure in various health- and disease-related contexts.
The contact map-based embedding design utilizes the three-dimensional (3D) structure of proteins to create numerical representations. The contact map is a method that encodes the spatial proximity between amino acid residues in a protein. By leveraging the information from the contact map, the proposed method constructs embeddings that capture the structural characteristics of proteins. This approach takes into account the physical interactions and folding patterns of the protein, providing a more comprehensive representation compared to sequence-based embeddings. By incorporating the 3D structure of proteins, the contact map-based embedding design enhances the ability to capture crucial structural features and enables more accurate protein classification and function prediction. Our contributions in this paper are as follows:
  • We propose a contact map-based method to encode the 3D protein structure into a fixed-dimensional numerical representation, which can be used for efficient protein function prediction.
  • We incorporate extra features within our contact map-based embeddings using the features extracted from large language models for protein sequences, which enhance the overall predictive performance of the proposed model.
  • We also incorporate the features designed from protein sequences within our 3D structure-based embeddings to further improve the classification accuracy for protein function prediction.
  • The in-depth analysis of the proposed embedding model on two benchmark datasets shows superior predictive performance for the proposed method compared to recent baselines.
The organization of the manuscript is as follows: Section 2 provides a previous research overview, Section 3 discusses the proposed approach, Section 4 shows the experimental setup, Section 5 presents the results, and Section 6 concludes the paper.

2. Related Work

The study of biological sequences is a popular area of study in science. Understanding the behavior, functions, and interactions of proteins within biological systems is crucial for determining the functional and structural characteristics of the protein. Protein analysis [40] reveals information about how it interacts with other molecules and functions in different pathways and its potential associations with diseases. Moreover, understanding the structural characteristics of proteins aids in comprehending their functional roles, as structure often dictates function in biology [17,19]. Protein function and structure prediction is an essential component of biomedical research since it allows scientists to understand their mechanisms, create targeted therapies, and create treatments for a wide range of diseases [2]. Protein analyses can aid in the understanding of diseases and the development of preventative measures like drug discovery [5,41].
Traditionally, these modules relied on a mixture of physics-based energy functions, knowledge-based statistical reasoning, and heuristic algorithms [42,43], such as homology-based methods [14,15], which look up homologous sequences in a database of sequences. Every day, new sequences of amino acids and nucleotides are added to publicly available international databases, increasing the likelihood of discovering meaningful homologies. These databases can be searched for close homologs using a variety of tools and techniques like BLAST [44,45], all of which calculate sequence similarity to uncover significant biological relationships. However, researchers have been using an infusion of machine learning for over a decade. Work on protein structural classification has been ongoing for over a decade using supervised ML algorithms, such as neural networks [31,32,33] and support vector machines (SVMs) [34].
For biological sequence analyses, several feature engineering-centric approaches have been presented. Among them is OHE [35], which offers a straightforward mechanism for mapping the sequences into numerical vectors. For machine learning (ML) tasks like classification and clustering, some alignment-based [35,46] and alignment-free [47] embedding techniques have gained popularity. These methods do, however, also have scaling problems because of the extraordinarily high dimensionality of feature vectors. In metagenomics, the k-mers-based approach is also used for sequence analysis [36,48], but their inherent sparsity limits their usefulness.
For metagenomic data, the authors of [49] recommend using minimizers. Because metagenomic data contain short reads, only one minimizer (m-mer) can fully describe the data. However, all these methods only consider the basic structure of the protein, which is only the arrangement of its amino acids, without accounting for the three-dimensional form of the protein. A protein’s structure contains a multitude of physiochemical properties that are not fully explored in the literature. The proteins that make up multiple sequence alignments (MSAs) are related to each other evolutionarily for every structure, and they can be a crucial source of evolutionary information for contemporary protein structure prediction [50]. However, creating MSAs can be a computationally expensive process. In the literature, kernel-based techniques for sequence classification have also been proposed [51]. However, their memory consumption is high and the biggest drawback for all these methods is that they do not have biochemical features incorporated in them and are focused on sequences only. Although SeqVec is effective at describing and encoding biochemical features, it cannot infer crucial information about, for example, the activities of proteins [52].

3. Proposed Approach

In this section, we begin by giving a high-level overview of the proposed approach and then discuss in detail the process of extracting sequences from Protein Data Bank (PDB) files, followed by a discussion of the embedding method.
Figure 1 shows a high-level overview of the proposed approach. The PDB file is used as input to extract the sequences and structural information, as shown in Figure 1a–c. Using the structural information, we generate contact map-based embeddings, as shown in Figure 1d, whereas extracted sequences are used to generate LLM-based SeqVec embeddings and Spike2Vec embedding, as shown in Figure 1e,f. We evaluate these embeddings and their concatenated combinations to generate a feature vector to provide input for machine learning classifiers. Each step is discussed in detail below.

3.1. Sequence Extraction

Given the vital functions that proteins play in a wide range of scientific disciplines, we must comprehend the structure and function of proteins. To decipher protein sequences, we leverage the Protein Data Bank (PDB), a large data repository that embodies the complex three-dimensional structures of proteins. The methodical parsing of PDB files, which identifies the alpha-carbon atoms that comprise the fundamental protein backbone, is a necessary step in the extraction process. Afterward, this technique uses one-letter codes to stand in for the matching residues of amino acids. The result of this methodical process produces brief but meaningful sequences that elegantly capture the complex subtleties of protein structures, which enable in-depth examinations and large-scale research projects across numerous scientific fields. This basic knowledge drives advances and breakthroughs in a wide range of scientific fields, including medicine, biochemistry, structural biology, and beyond. We thus obtain concise and comprehensible sequences that represent the protein structure. To carry out the extraction procedure, a specialized tool called a PDB parser is utilized, which functions to carefully traverse and extract the relevant information contained within the intricate PDB files. This parser carefully focuses on the alpha-carbon atom connected to every amino acid residue, which is an essential step in defining the basic structure of proteins. This separation of the backbone is important because it serves as a structural framework that makes it possible to identify the complex spatial arrangement of amino acids that make up the architecture of the protein. After this important separation, the amino acid residues are methodically mapped to their respective one-letter codes using a mapping process. This mapping is usually achieved by using a dictionary, which is a tool that associates each residue of an amino acid with its assigned one-letter representation. The result of this methodical process is a concise but comprehensive representation of the protein sequence, which condenses the complicated structural features of the protein into a manageable form. This simplified form simplifies the comprehension and exchange of intricate protein structures, enabling in-depth examinations and thorough research in a variety of scientific fields. This intricate extraction procedure is an essential first step towards unraveling the complex world of proteins and advancing our understanding of their structural complexity, biological systems, and their functional importance.
One-letter codes are widely employed in molecular biology as a means to represent amino acid sequences. Each code corresponds to a specific amino acid, facilitating rapid and convenient identification of protein sequences by researchers. This process of sequence extraction is applied to all PDB files, generating a comprehensive collection of sequences that can be utilized for subsequent analysis. The extraction and analysis of protein sequences offer valuable insights into the structure and function of proteins, with implications spanning diverse fields including medicine, biochemistry, and biotechnology. By exploring these sequences, researchers can gain a deeper understanding of protein properties, enabling advancements in various scientific disciplines.

3.2. Contact Map-Based Embedding Generation

To generate the embeddings from the protein structures, we use the idea of a contact map. The algorithmic pseudocode for generating the embeddings from PDB files is given in Algorithm 1. The main goal of the approach is to create a protein embedding representation that works by using contact maps that are based on the spatial correlations between C-alpha atoms. This complex procedure begins with the extraction of relevant structural information from the given Protein Data Bank (PDB) file, with a particular emphasis on locating the spatial locations of the C-alpha atoms. Once these crucial coordinates are obtained, the method uses them to determine the pairwise distances between every pair of C-alpha atoms. After this computation, a distance matrix representing the subtle distances between each pair of atoms in the protein structure is produced. Applying a threshold distance, a configurable hyperparameter is a crucial step that follows to aid in the delineation of meaningful atom interactions. This criterion serves as a discriminating factor, distinguishing distances that fall below the specified threshold from those that cross it. By carefully using this thresholding method, the contact map—a visual depiction that separates atom pairs according to their spatial proximity—is produced. A spatial proximity or contact between the corresponding C-alpha atom pair is indicated by the element in the contact map that represents a distance that is less than the given threshold, marked as 1. Elements with distances greater than the threshold, on the other hand, are denoted with a 0, indicating that there is no physical contact or spatial proximity between the respective atom pairs. Using this advanced method, not only is the structural information contained in the atomic arrangement of the protein retrieved but an informative contact map is produced, which accurately depicts the critical spatial interactions between C-alpha atoms. The protein embedding representation becomes more flexible and adaptable as a result of the ability to define significant atom interactions with flexibility through adjustment of the threshold distance parameter.
Algorithm 1 Generating Contact Map-Based Embedding
1:
function GenerateContactMap( p d b _ f i l e )
2:
     s t r u c t u r e getStructure ( protein , pdb _ file )
3:
     m o d e l g e t P o l y P e p t i d e C h a i n ( s t r u c t u r e )
4:
     c _ a l p h a _ a t o m s [ ]
5:
    for each  c h a i n in m o d e l  do
6:
        for each  r e s i d u e in c h a i n  do
7:
            c _ a l p h a _ a t o m s . append ( r e s i d u e [ C A ] . get _ coord ( ) )
8:
        end for
9:
    end for
10:
     n _ a t o m s len ( c _ a l p h a _ a t o m s )
11:
     d i s t a n c e s initializeDistanceMatrix ( n _ a t o m s , n _ a t o m s )
12:
    for  i 0 to n _ a t o m s 1  do
13:
        for  j i + 1 to n _ a t o m s 1  do
14:
            d i s t a n c e s [ i , j ]   Norm ( c _ a l p h a _ a t o m s [ i ] c _ a l p h a _ a t o m s [ j ] )
15:
            d i s t a n c e s [ j , i ] d i s t a n c e s [ i , j ]
16:
        end for
17:
    end for
18:
     t h r e s h o l d 8.0                                                                     ▹ hyperparameter
19:
     c o n t a c t _ m a p filterContactMap ( d i s t a n c e s t h r e s h o l d , 1 , 0 )
20:
     p c a PCA ( n _ c o m p o n e n t s = 0.99 )
21:
     c o n t a c t _ m a p _ 1 d p c a . fit _ transform ( c o n t a c t _ m a p )
22:
     c o n t a c t _ m a p _ 1 d c o n t a c t _ m a p _ 1 d . flatten ( )
23:
    return  c o n t a c t _ m a p _ 1 d
24:
end function
The technique incorporates principal component analysis (PCA) into its workflow to reduce the dimensionality of the contact map while maintaining its essential structural information. PCA is a powerful dimensionality reduction method that helps extract important information from the contact map. After it is created, the contact map is transformed using PCA, which consciously reduces its dimensionality. Principal components, or linear combinations of the features of the original contact map that capture the largest variance in the data, are extracted as part of this transformation. PCA allows for dimensionality reduction without sacrificing the critical structural properties of the contact map by keeping the elements that capture the greatest amount of variance. Next, the converted contact map is reshaped or flattened into a one-dimensional array, yielding what is known as the contact map-based embedding. It is a concise yet instructive depiction of the structural characteristics of the protein that captures the important spatial interactions between its constituent C-alpha atoms. This novel algorithmic method offers a potent way to represent proteins according to their contact maps. The resulting contact map-based embeddings provide vital insights into the intricate structural features of proteins by condensing and interpreting extensive structural information into a simplified form. These embeddings provide the foundation for classification and downstream analytic tasks, enabling researchers to better understand protein structures and build complex models for a range of biological and computational applications. The contact map-based embedding is concatenated with a large language model (LLM)-based embedding method such as SeqVec [37] and a feature engineering-based method such as Spike2Vec [47], which are designed for sequence-only embedding (without considering structural information), to enhance the performance of the final embedding representation for the proteins. The details for SeqVec and Spike2Vec are given in Section 4.3.

4. Experimental Setup

In this section, we present the dataset overview, machine learning classifiers, and evaluation metrics detail. The experiments were conducted on an Ubuntu 64-bit OS (16.04.7 LTS Xenial Xerus) system with an Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz processor and 3023 GB of memory. We use a 70–30% train–test split of data, with 10% of the training data reserved for hyperparameter tuning. The experiments were repeated five times using random splits to ensure reliable and consistent results, and the average and standard deviation of the outcomes were evaluated. For classification, we use SVM, naive Bayes (NB), multi-layer perceptron (MLP), KNN, random forest (RF), logistic regression (LR), and decision tree (DT). For evaluation, we use average accuracy, precision, recall, weighted F1, macro F1, ROC AUC, and training runtime. In cases where the metrics were originally designed for binary classification, we utilized the one-vs.-rest approach to adapt them for multi-class classification scenarios.
In our study, we use well-established two benchmark datasets. The preprocessed data, necessary for reproducing our results along with code, is publicly available online (https://github.com/pchourasia1/PDB_Plus_LLM_Contact_Map, accessed on 20 December 2023). We use the following datasets:

4.1. STCRDAB

The STCRDAB (Structural T-Cell Receptor Database) [53] dataset is a meticulously curated collection of T-cell receptor structural data sourced from the Protein Data Bank (PDB). It consists of a total of 512 protein structures, downloaded as of 27 May 2021. For our experiment, we selected 480 PDB files from this dataset (after pre-processing), where the protein structures are classified into two classes: “Humans” (total 325 PDB files) and “Mouse” (total 155 PDB files), also shown in Table 1. Thus, the classification problem is binary. The minimum, maximum, and average lengths of sequences extracted from PDB files in the STCRDAB dataset are 109, 5415, and 1074.38, respectively.

4.2. PDB Bind

For the PDB Bind dataset, we obtained version 2020 [54] from the official source. The initial dataset consisted of a total of 14,127 PDB structures. After preprocessing, we selected 3792 structures for our analysis. The target labels used in this dataset correspond to the protein names, as presented in Table 2. The minimum, maximum, and average lengths of sequences extracted from PDB files in the PDB Bind dataset are 33, 3292, and 403.60, respectively.

4.3. Baseline Models

We use the Spike2Vec, SeqVec, and Unsupervised Protein Embeddings (UPE) as baseline models. The details for the baseline models are below:

4.3.1. Spike2Vec [47]

It extracts features from protein sequences using the concept of k-mers, which represents a contiguous substring of length k within a sequence. For this study, we used k = 3 to obtain the embeddings, chosen through standard validation. This choice ensures computational efficiency and captures sequence characteristics effectively. The length of the Spike2Vec-based embedding depends on the number of unique amino acids, denoted as ACDEFGHIKLMNPQRSTVWXY. The embedding length is | Σ | k , providing a representation that encompasses diverse amino acid properties, making it a promising tool for computational biology applications.

4.3.2. SeqVec [37]

It represents protein sequences as continuous vectors using the ELMO (Embeddings from Language Models) language model [55]. ELMO leverages the biophysical characteristics derived from unlabeled data from UniRef50 to generate embeddings (hence considered a large language model-based approach). This process, known as SeqVec (Sequence-to-Vector), assigns embeddings to individual words while taking into account their contextual information. By employing ELMO, it effectively captures the complex properties and relationships within protein sequences, enabling more comprehensive analysis and interpretation.

4.3.3. Unsupervised Protein Embeddings (UPE) [56]

It is an unsupervised deep learning approach for generating protein embeddings that considers both sequence and structural information. It uses a technique from [37] to generate initial embeddings from sequences. For structural features, it utilizes one-hot encoding of secondary structure angles derived from the protein’s 3D structure. The final protein representation is obtained by combining sequence and structural features. Unlike our proposed contact map-based approach, their method does not use one-hot encoding for embedding 3D structural information due to issues with dimensionality and information preservation [36].

5. Results and Discussion

In this section, we present the results of our proposed method under various settings and compare its performance with baseline approaches on two datasets using different evaluation metrics.
The classification results are summarized in Table 3 and Table 4. When considering sequence-only embedding methods, we observe that the majority of cases achieve a predictive performance of over 90% for both datasets, surpassing the results obtained with structure-only approaches. However, the runtime is reduced significantly when SeqVec + Spike2Vec are used together for both databases. But accuracy deteriorates significantly, which can be seen in Table 3. The same can be seen in the case of PDB Bind data in Table 4. This can be attributed to the fact that the functional regions of protein sequences are often more conserved across different proteins compared to their 3D structures, making them easier to identify and predict. Consequently, sequence-based models demonstrate greater effectiveness in protein function prediction and classification. Additionally, sequence-based models are simpler than 3D structure-based autoencoder models, as they do not need to account for the complexities of protein folding and interactions. This simplicity makes them more manageable to train and interpret, resulting in improved performance. Overall, our findings indicate that the Spike2Vec embedding with the LR classifier outperforms all other classifiers for the STCRDAB dataset. A similar trend is observed for the PDB Bind dataset. The superior performance of Spike2Vec can be attributed to the fact that SeqVec, the large language model (LLM), is trained on diverse protein sequences from the UniRef50 dataset, which may not effectively generalize to the sequences extracted from the PDB files in our benchmark datasets. It is worth noting that the PDB Bind dataset is widely acknowledged as a challenging benchmark for structure-based prediction methods. Consequently, we observed a relatively lower predictive performance when using structure-based embeddings in this dataset.
When we combine sequences and structure embeddings (i.e., contact map + Spike2Vec, contact map + SeqVec, and contact map + SeqVec + Spike2Vec), we can observe that the predictive performance for all classifiers increases. Eventually, the contact map + SeqVec + Spike2Vec outperforms all other methods. This is because when combining structure and sequence embeddings, we are incorporating more information about the protein and its environment, which can help improve the accuracy. The sequence embeddings capture the amino acid composition and ordering of the protein sequence, while the structure-based embeddings capture the 3D spatial arrangement of the atoms in the protein structure. By combining these two sources of information, we can leverage the strengths of both methods and obtain a more comprehensive representation of the protein. Moreover, the proposed sequence + structural method outperforms the baseline UPE [56] for all evaluation metrics and both datasets.
Our study demonstrates that combining structure and sequence information in protein analysis improves the predictive performance compared to using either information alone. While the proposed method achieves a reasonable performance when considering structure information alone, a higher performance is observed when using sequence information alone. This is likely due to the conservation of functional regions across different proteins in the sequence, making them easier to identify and predict. By combining both structure and sequence information, we obtain a comprehensive representation of the protein, considering both structural features and sequence variations. This combination leads to almost perfect predictive performance, highlighting the complementary nature of structure and sequence-based embeddings in protein classification. Incorporating both types of information allows for a holistic understanding of protein function and interactions, resulting in improved classification outcomes.
We ensured the reliability and consistency of the classification result through a statistical analysis using p-values. The analysis is based on the average and standard deviation values of five experimental runs for both datasets. The p-values determined the statistical significance of comparisons between the proposed model and baselines. The comparisons had p-values below 0.05, indicating statistically significant performance differences. However, for the training runtime metric, some p-values exceeded 0.05 due to greater variability in runtime values. Factors like processor performance and active processes during training can affect the runtime variability. It is important to note that our analysis focused primarily on predictive performance evaluation metrics rather than training runtime.

6. Conclusions

This research explores the complex interactions between protein sequences, with a focus on using 3D structural data and large language model (LLM) techniques to transform protein classification. Our thorough analysis highlights the significant synergy that results from combining these disparate data sources, demonstrating the unmatched complementary nature. Specifically, our results demonstrate the impressive performance gain achieved with a hybrid method that outperforms the effectiveness of single data modalities. Our experimental results provide empirical data that are notably different from the performance differences that are seen when using 3D structure information exclusively as opposed to using protein sequences alone. Its significantly lower performance in the classification framework is illustrated by our empirical results, which also highlight the drawbacks of depending only on structural information. Conversely, leveraging protein sequences in isolation exhibits a notable enhancement in performance metrics, showcasing its efficacy as a standalone information source. Our study shows that combining structure and sequence information in protein analysis improves predictive performance compared to using either information alone.
This research delves into the intricate interplay between protein sequences, particularly leveraging large language model (LLM) approaches and 3D structural information to revolutionize protein classification. Our comprehensive investigation underscores the profound synergy achieved by integrating these distinct information sources, showcasing the unparalleled complementarity that exists between them. Notably, our findings illuminate the remarkable performance boost attained through a combined approach, surpassing the efficacy of individual data modalities. Empirical evidence from our experiments starkly contrasts the performance disparities observed when relying solely on 3D structural information versus leveraging protein sequences in isolation. Our empirical findings underscore the limitations associated with relying solely on 3D structural information, demonstrating its relatively diminished performance within the classification framework. Conversely, leveraging protein sequences in isolation exhibits a notable enhancement in performance metrics, showcasing its efficacy as a standalone information source.
In the future, we will be at the forefront of creating complex deep learning systems that are designed to seamlessly combine and utilize structural and sequence data. The goal of this project is to develop a state-of-the-art model that optimizes classification accuracy and precision by maximizing the synergies between various modalities. In addition, we investigate graph-based models, which have the potential to transform the way complex 3D structural data are embedded and used in classification algorithms. The plan for our upcoming projects includes a thorough assessment approach that goes beyond the limitations of our available datasets. It might be useful to test our suggested model on a variety of datasets to obtain a thorough evaluation of their interpretability, robustness, and scalability. This rigorous evaluation strategy intends to validate the generalizability and applicability of our methodologies beyond specific datasets, thereby fortifying the credibility and utility of our approach within the broader scientific community.

Author Contributions

S.A. and P.C. worked towards data gathering and preprocessing, algorithm design, implementation, results computation, and manuscript writing. M.P. worked toward algorithm design, supervision, and manuscript writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by Molecular Basis of Diseases (MBD).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. AlQuraishi, M. Machine learning in protein structure prediction. Curr. Opin. Chem. Biol. 2021, 65, 1–8. [Google Scholar] [CrossRef] [PubMed]
  2. Kubinyi, H. Structure-based design of enzyme inhibitors and receptor ligands. Curr. Opin. Drug Discov. Dev. 1998, 1, 4–15. [Google Scholar]
  3. Zou, L.; Chen, L.; Lu, Y. Top-k subgraph matching query in a large graph. In Proceedings of the ACM First Ph.D. Workshop in CIKM, Lisbon, Portugal, 9 November 2007; pp. 139–146. [Google Scholar]
  4. Licheri, N.; Amparone, E.; Bonnici, V.; Giugno, R.; Beccuti, M. An Entropy Heuristic to Optimize Decision Diagrams for Index-driven Search in Biological Graph Databases. In Proceedings of the CIKM Workshops, Virtual. 1–5 November 2021. [Google Scholar]
  5. Batool, M.; Ahmad, B.; Choi, S. A structure-based drug discovery paradigm. Int. J. Mol. Sci. 2019, 20, 2783. [Google Scholar] [CrossRef] [PubMed]
  6. Burley, S.K.; Berman, H.M.; Kleywegt, G.J.; Markley, J.L.; Nakamura, H.; Velankar, S. Protein Data Bank (PDB): The single global macromolecular structure archive. In Protein Crystallography: Methods and Protocols; Springer: Berlin, Germany, 2017; pp. 627–641. [Google Scholar]
  7. Kmiecik, S.; Gront, D.; Kolinski, M.; Wieteska, L.; Dawid, A.E.; Kolinski, A. Coarse-grained protein models and their applications. Chem. Rev. 2016, 116, 7898–7936. [Google Scholar] [CrossRef] [PubMed]
  8. Schmidt, T.; Bergner, A.; Schwede, T. Modelling three-dimensional protein structures for applications in drug design. Drug Discov. Today 2014, 19, 890–897. [Google Scholar] [CrossRef] [PubMed]
  9. Lounnas, V.; Ritschel, T.; Kelder, J.; McGuire, R.; Bywater, R.P.; Foloppe, N. Current progress in structure-based rational drug design marks a new mindset in drug discovery. Comput. Struct. Biotechnol. J. 2013, 5, e201302011. [Google Scholar] [CrossRef]
  10. De Lucrezia, D.; Slanzi, D.; Poli, I.; Polticelli, F.; Minervini, G. Do natural proteins differ from random sequences polypeptides? Natural vs. random proteins classification using an evolutionary neural network. PLoS ONE 2012, 7, e36634. [Google Scholar] [CrossRef]
  11. Clark, W.T.; Radivojac, P. Analysis of protein function and its prediction from amino acid sequence. Proteins Struct. Funct. Bioinform. 2011, 79, 2086–2096. [Google Scholar] [CrossRef]
  12. Radivojac, P.; Clark, W.T.; Oron, T.R.; Schnoes, A.M.; Wittkop, T.; Sokolov, A.; Graim, K.; Funk, C.; Verspoor, K.; Ben-Hur, A.; et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 2013, 10, 221–227. [Google Scholar] [CrossRef]
  13. Bailey, T.L.; Williams, N.; Misleh, C.; Li, W.W. MEME: Discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006, 34, W369–W373. [Google Scholar] [CrossRef]
  14. Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef] [PubMed]
  15. Cavasotto, C.N.; Phatak, S.S. Homology modeling in drug discovery: Current trends and applications. Drug Discov. Today 2009, 14, 676–683. [Google Scholar] [CrossRef] [PubMed]
  16. Li, H.; Homer, N. A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 2010, 11, 473–483. [Google Scholar] [CrossRef] [PubMed]
  17. Amitai, G.; Shemesh, A.; Sitbon, E.; Shklar, M.; Netanely, D.; Venger, I.; Pietrokovski, S. Network analysis of protein structures identifies functional residues. J. Mol. Biol. 2004, 344, 1135–1146. [Google Scholar] [CrossRef] [PubMed]
  18. Jing, B.; Eismann, S.; Suriana, P.; Townshend, R.J.; Dror, R. Learning from protein structure with geometric vector perceptrons. arXiv 2020, arXiv:2009.01411. [Google Scholar]
  19. Haas, J.; Roth, S.; Arnold, K.; Kiefer, F.; Schmidt, T.; Bordoli, L.; Schwede, T. The Protein Model Portal—a comprehensive resource for protein structure and model information. Database 2013, 2013, bat031. [Google Scholar] [CrossRef]
  20. Yan, T.C.; Yue, Z.X.; Xu, H.Q.; Liu, Y.H.; Hong, Y.F.; Chen, G.X.; Tao, L.; Xie, T. A systematic review of state-of-the-art strategies for machine learning-based protein function prediction. Comput. Biol. Med. 2022, 154, 106446. [Google Scholar] [CrossRef]
  21. Bonetta, R.; Valentino, G. Machine learning techniques for protein function prediction. Proteins Struct. Funct. Bioinform. 2020, 88, 397–413. [Google Scholar] [CrossRef]
  22. Liu, X. Deep recurrent neural network for protein function prediction from sequence. arXiv 2017, arXiv:1701.08318. [Google Scholar]
  23. Kuhlman, B.; Bradley, P. Advances in protein structure prediction and design. Nat. Rev. Mol. Cell Biol. 2019, 20, 681–697. [Google Scholar] [CrossRef]
  24. Madani, A.; Krause, B.; Greene, E.R.; Subramanian, S.; Mohr, B.P.; Holton, J.M.; Olmos Jr, J.L.; Xiong, C.; Sun, Z.Z.; Socher, R.; et al. Large language models generate functional protein sequences across diverse families. Nature Biotechnol. 2023, 41, 1099–1106. [Google Scholar] [CrossRef]
  25. Quintana, F.; Treangen, T.; Kavraki, L. Leveraging Large Language Models for Predicting Microbial Virulence from Protein Structure and Sequence. In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Houston, TX, USA, 3–6 September 2023; pp. 1–6. [Google Scholar]
  26. Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [Google Scholar] [CrossRef] [PubMed]
  27. Ofer, D.; Brandes, N.; Linial, M. The language of proteins: NLP, machine learning & protein sequences. Comput. Struct. Biotechnol. J. 2021, 19, 1750–1758. [Google Scholar] [PubMed]
  28. Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; dos Santos Costa, A.; Fazel-Zarandi, M.; Sercu, T.; Candido, S.; et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv 2022, 2022, 500902. [Google Scholar]
  29. Forslund, K.; Sonnhammer, E.L. Predicting protein function from domain content. Bioinformatics 2008, 24, 1681–1687. [Google Scholar] [CrossRef] [PubMed]
  30. Pan, X.; Shen, H.B. RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach. BMC Bioinform. 2017, 18, 1–14. [Google Scholar] [CrossRef] [PubMed]
  31. Klein, P.; Delisi, C. Prediction of protein structural class from the amino acid sequence. Biopolym. Orig. Res. Biomol. 1986, 25, 1659–1672. [Google Scholar] [CrossRef] [PubMed]
  32. Vinga, S.; Gouveia-Oliveira, R.; Almeida, J.S. Comparative evaluation of word composition distances for the recognition of SCOP relationships. Bioinformatics 2004, 20, 206–215. [Google Scholar] [CrossRef]
  33. Ie, E.; Weston, J.; Noble, W.S.; Leslie, C. Multi-class protein fold recognition using adaptive codes. In Proceedings of the International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 329–336. [Google Scholar]
  34. Shamim, M.T.A.; Anwaruddin, M.; Nagarajaram, H.A. Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics 2007, 23, 3320–3327. [Google Scholar] [CrossRef]
  35. Kuzmin, K.; Adeniyi, A.E.; DaSouza, A.K., Jr.; Lim, D.; Nguyen, H.; Molina, N.R.; Xiong, L.; Weber, I.T.; Harrison, R.W. Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 2020, 533, 553–558. [Google Scholar] [CrossRef]
  36. Ali, S.; Sahoo, B.; Ullah, N.; Zelikovskiy, A.; Patterson, M.; Khan, I. A k-mer based approach for SARS-CoV-2 variant identification. In Proceedings of the International Symposium on Bioinformatics Research and Applications, Shenzhen, China, 26–28 November 2021; pp. 153–164. [Google Scholar]
  37. Heinzinger, M.; Elnaggar, A.; Wang, Y.; Dallago, C.; Nechaev, D.; Matthes, F.; Rost, B. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 2019, 20, 1–17. [Google Scholar] [CrossRef] [PubMed]
  38. Brandes, N.; Ofer, D.; Peleg, Y.; Rappoport, N.; Linial, M. ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics 2022, 38, 2102–2110. [Google Scholar] [CrossRef]
  39. Sofi, M.A.; Wani, M.A. Improving Prediction of Protein Secondary Structures using Attention-enhanced Deep Neural Networks. In Proceedings of the 2022 9th International Conference on Computing for Sustainable Global Development, New Delhi, India, 23–25 March 2022; pp. 664–668. [Google Scholar]
  40. Buchan, D.W.; Jones, D.T. The PSIPRED protein analysis workbench: 20 years on. Nucleic Acids Res. 2019, 47, W402–W407. [Google Scholar] [CrossRef]
  41. Rozemberczki, B.; Gogleva, A.; Nilsson, S.; Edwards, G.; Nikolov, A.; Papa, E. MOOMIN: Deep Molecular Omics Network for Anti-Cancer Drug Combination Therapy. In Proceedings of the International Conference on Information & Knowledge Management (CIKM), Atlanta, GA, USA, 17–21 October 2022; pp. 3472–3483. [Google Scholar]
  42. Apeltsin, L.; Morris, J.H.; Babbitt, P.C.; Ferrin, T.E. Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution. Bioinformatics 2011, 27, 326–333. [Google Scholar] [CrossRef] [PubMed]
  43. Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef] [PubMed]
  44. Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410. [Google Scholar] [CrossRef]
  45. Altschul, S.F.; Wootton, J.C.; Gertz, E.M.; Agarwala, R.; Morgulis, A.; Schäffer, A.A.; Yu, Y.K. Protein database searches using compositionally adjusted substitution matrices. FEBS J. 2005, 272, 5101–5109. [Google Scholar] [CrossRef]
  46. Ali, S.; Bello, B.; Chourasia, P.; Punathil, R.T.; Zhou, Y.; Patterson, M. PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. Biology 2022, 11, 418. [Google Scholar] [CrossRef]
  47. Ali, S.; Patterson, M. Spike2vec: An efficient and scalable embedding approach for COVID-19 spike sequences. In Proceedings of the IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 1533–1540. [Google Scholar]
  48. Wood, D.; Salzberg, S. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014, 15, 10. [Google Scholar] [CrossRef]
  49. Girotto, S.; Pizzi, C.; Comin, M. MetaProb: Accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 2016, 32, i567–i575. [Google Scholar] [CrossRef]
  50. De Oliveira, S.; Deane, C. Co-evolution techniques are reshaping the way we do structural bioinformatics. F1000Research 2017, 6, 1224. [Google Scholar] [CrossRef] [PubMed]
  51. Kuksa, P.; Khan, I.; Pavlovic, V. Generalized Similarity Kernels for Efficient Sequence Classification. In Proceedings of the SIAM International Conference on Data Mining (SDM), Anaheim, CA, USA, 26–28 April 2012; pp. 873–882. [Google Scholar]
  52. Kané, H.; Coulibali, M.K.; Ajanoh, P.; Abdallah, A. Augmenting protein network embeddings with sequence information. bioRxiv 2019. [Google Scholar] [CrossRef]
  53. Leem, J.; de Oliveira, S.H.P.; Krawczyk, K.; Deane, C.M. STCRDab: The structural T-cell receptor database. Nucleic Acids Res. 2018, 46, D406–D412. [Google Scholar] [CrossRef] [PubMed]
  54. Liu, Z.; Li, Y.; Han, L.; Li, J.; Liu, J.; Zhao, Z.; Nie, W.; Liu, Y.; Wang, R. PDB-wide collection of binding data: Current status of the PDBbind database. Bioinformatics 2015, 31, 405–412. [Google Scholar] [CrossRef]
  55. Sarzynska-Wawer, J.; Wawer, A.; Pawlak, A.; Szymanowska, J.; Stefaniak, I.; Jarkiewicz, M.; Okruszek, L. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res. 2021, 304, 114135. [Google Scholar] [CrossRef]
  56. Villegas-Morcillo, A.; Makrodimitris, S.; van Ham, R.C.; Gomez, A.M.; Sanchez, V.; Reinders, M.J. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 2021, 37, 162–170. [Google Scholar] [CrossRef]
Figure 1. Workflow for proposed approach. We provide PDB files as input in (a). Then Extract the structural information (b). In parallel, we extract the sequences from these files (c). Contact Map embedding is generated using the structural information (d). Whereas sequences are used to generate SeqVec embeddings (e) and Spike2Vec embeddings (f). Finally, we use a combination of these embeddings generated by concatenating them to generate Feature vectors (g).
Figure 1. Workflow for proposed approach. We provide PDB files as input in (a). Then Extract the structural information (b). In parallel, we extract the sequences from these files (c). Contact Map embedding is generated using the structural information (d). Whereas sequences are used to generate SeqVec embeddings (e) and Spike2Vec embeddings (f). Finally, we use a combination of these embeddings generated by concatenating them to generate Feature vectors (g).
Genes 15 00025 g001
Table 1. Class/target label statistics for STCRDAB dataset.
Table 1. Class/target label statistics for STCRDAB dataset.
SpeciesCount
Human325
Mouse155
Total480
Table 2. Class/target label statistics for PDB Bind dataset.
Table 2. Class/target label statistics for PDB Bind dataset.
Target NameCountTarget NameCount
SERINE/THREONINE—PROTEIN404PROTEIN138
TYROSINE—PROTEIN381E3138
MITOGEN—ACTIVATED325CYCLIN-DEPENDENT128
BETA—SECRETASE299GLUTAMATE112
BETA—LACTAMASE220DUAL111
BROMODOMAIN—CONTAINING174HEAT110
HIV-1164PROTEASOME110
CARBONIC159TANKYRASE-2108
CELL157LYSINE-SPECIFIC105
GLYCOGEN144DNA104
PHOSPHATIDYLINOSITOL-45-BISPHOSPHATE100COAGULATION101
Total3792
Table 3. Average classification results (of 5 runs) for different methods and STCRDAB datasets. The best values are shown in bold. The up arrow in the metric indicates a higher value is better while down arrow indicates a lower value is better.
Table 3. Average classification results (of 5 runs) for different methods and STCRDAB datasets. The best values are shown in bold. The up arrow in the metric indicates a higher value is better while down arrow indicates a lower value is better.
CategoryEmbeddingAlgo.Acc. ↑Prec. ↑Recall ↑F1 (Weig.) ↑F1 (Macro) ↑ROC-AUC ↑Train Time (s) ↓
Sequence Only (Baselines)Spike2Vec [47]SVM0.9760.9770.9760.9760.9720.9671.824
NB0.9780.9780.9780.9780.9740.9670.189
MLP0.9830.9840.9830.9830.9810.9825.145
KNN0.9630.9630.9630.9620.9560.9480.087
RF0.9750.9750.9750.9750.9710.9670.462
LR0.9880.9880.9880.9870.9850.9860.119
DT0.9570.9570.9570.9570.9500.9480.204
SeqVec [37]SVM0.7940.7950.7940.7830.7500.7370.025
NB0.7430.7410.7430.7390.7080.7020.004
MLP0.7260.7340.7260.7270.7000.7053.297
KNN0.8290.8300.8290.8200.7930.7780.047
RF0.8120.8540.8120.7880.7470.7260.590
LR0.8860.8860.8860.8840.8710.8640.030
DT0.7900.7870.7900.7860.7570.7490.114
SeqVec + Spike2VecSVM0.8820.8860.8820.8760.8570.8400.051
NB0.8290.8280.8290.8270.8030.7950.002
MLP0.7670.7690.7670.7680.7390.7410.651
KNN0.9260.9330.9260.9240.9120.8930.033
RF0.9130.9170.9130.9100.8950.8770.331
LR0.9820.9820.9820.9820.9800.9800.019
DT0.8970.8970.8970.8970.8840.8810.053
Sequence + Structure (Baseline)UPE [56]SVM0.9160.9890.9160.9880.9090.9070.961
NB0.8970.9080.8970.8950.8960.9110.975
MLP0.9150.9290.9150.9280.9830.9711.097
KNN0.9210.9280.9210.9290.9810.9790.452
RF0.8940.8850.8940.8920.8810.8930.813
LR0.9570.9420.9570.9540.9750.9630.128
DT0.9010.8990.9010.9000.9210.9430.042
Structure Only (ours)Contact MapSVM0.5690.5560.5690.5600.5140.5170.040
NB0.6390.6440.6390.6240.5840.5910.007
MLP0.5630.5630.5630.5440.4980.5154.255
KNN0.6210.5540.6210.5370.4530.5090.048
RF0.6460.5540.6460.5110.4030.5060.747
LR0.6640.6530.6640.6480.6050.6030.037
DT0.5790.5720.5790.5730.5310.5320.207
Sequence + Structure (ours)Contact Map + Spike2VecSVM0.7890.8050.7890.7690.7290.7130.152
NB0.8470.8530.8470.8430.8210.8100.008
MLP0.6140.6190.6140.6030.5520.55918.721
KNN0.9390.9450.9390.9370.9280.9130.283
RF0.9240.9320.9240.9210.9100.8912.191
LR0.9810.9810.9810.9810.9780.9780.243
DT0.9130.9150.9130.9130.9020.9050.330
Sequence + Structure (ours)Contact Map + SeqVecSVM0.8400.8690.8400.8190.7680.7370.061
NB0.8000.7980.8000.7840.7300.7100.007
MLP0.6900.6970.6900.6880.6320.6393.287
KNN0.8440.8440.8440.8430.8130.8100.043
RF0.8240.8500.8240.8000.7430.7140.779
LR0.8790.8810.8790.8800.8580.8610.077
DT0.7990.8010.7990.7990.7640.7660.286
Sequence + Structure (ours)Contact Map + SeqVec + Spike2VecSVM0.9910.9900.9910.9900.9880.98581.170
NB0.9880.9880.9880.9870.9850.9820.320
MLP0.9880.9880.9880.9880.9850.98811.968
KNN0.9400.9420.9400.9390.9240.9090.330
RF0.9790.9800.9790.9790.9740.9700.745
LR0.9860.9860.9860.9860.9830.9800.889
DT0.9560.9570.9560.9560.9480.9510.410
Table 4. Average classification results (of 5 runs) for different methods using PDB Bind datasets. The best values are shown in bold. The up arrow in the metric indicates a higher value is better while down arrow indicates a lower value is better.
Table 4. Average classification results (of 5 runs) for different methods using PDB Bind datasets. The best values are shown in bold. The up arrow in the metric indicates a higher value is better while down arrow indicates a lower value is better.
CategoryEmbeddingAlgo.Acc. ↑Prec. ↑Recall ↑F1 (Weig.) ↑F1 (Macro) ↑ROC-AUC ↑Train Time (s) ↓
Sequence Only (Baselines)Spike2Vec [47]SVM0.9600.9650.9600.9610.9540.975263.112
NB0.9430.9560.9430.9440.9310.9648.230
MLP0.9340.9390.9340.9340.9190.95885.427
KNN0.8960.9540.8960.9100.8970.9411.961
RF0.9600.9660.9600.9610.9540.9756.888
LR0.9660.9670.9660.9660.9590.9788.471
DT0.9390.9420.9390.9390.9290.9624.682
SeqVec [37]SVM0.8450.8810.8450.8460.8570.9094.124
NB0.3010.5500.3010.2990.3000.6320.209
MLP0.7450.7560.7450.7410.7350.86532.370
KNN0.8280.8490.8280.8300.8170.9010.311
RF0.8220.8760.8220.8290.8440.8987.645
LR0.8740.8800.8740.8740.8700.92719.388
DT0.7830.7820.7830.7810.7820.8877.134
SeqVec + Spike2VecSVM0.8830.9050.8830.8820.8840.92512.571
NB0.6880.7650.6880.7030.6920.8410.136
MLP0.7570.7680.7570.7540.7470.8739.640
KNN0.9190.9420.9190.9240.9120.9542.412
RF0.9350.9430.9350.9370.9290.9616.024
LR0.9580.9620.9580.9590.9510.97422.074
DT0.8780.8810.8780.8780.8650.9302.765
Sequence + Structure (Baseline)UPE [56]SVM0.8910.9120.8910.9420.9290.8996.581
NB0.9220.9410.9220.9180.9190.8961.675
MLP0.9630.9220.9630.9210.9050.8964.254
KNN0.9590.9230.9590.9490.9380.8930.234
RF0.9210.9440.9210.9320.9280.9484.563
LR0.9540.9250.9540.9300.9290.9659.753
DT0.9390.9280.9390.9350.9120.9450.973
Structure Only (ours)Contact MapSVM0.5850.8230.5850.6270.6650.77925.248
NB0.3520.4400.3520.3450.3250.6570.947
MLP0.5020.5700.5020.5050.5100.74880.370
KNN0.5710.7060.5710.5990.5740.7600.482
RF0.6900.7590.6900.6940.7020.82119.493
LR0.7120.7260.7120.7130.6990.840151.352
DT0.5780.5860.5780.5790.5710.77717.549
Sequence + Structure (ours)Contact Map + Spike2VecSVM0.6780.8270.6780.7060.7310.82117.635
NB0.4260.5010.4260.4310.4090.6900.575
MLP0.5350.5880.5350.5350.5360.767105.944
KNN0.5930.7690.5930.6370.6340.7840.475
RF0.8410.8660.8410.8410.8440.90414.192
LR0.9180.9230.9180.9190.9080.952138.618
DT0.7750.7790.7750.7740.7660.88012.891
Sequence + Structure (ours)Contact Map + SeqVecSVM0.8020.8700.8020.8070.8100.87734.657
NB0.4590.5330.4590.4510.4430.7151.630
MLP0.5530.5980.5530.5540.5580.77764.202
KNN0.5520.7170.5520.5890.5730.7540.465
RF0.7980.8530.7980.8060.8200.88519.770
LR0.8040.8090.8040.8030.8020.894196.321
DT0.7140.7170.7140.7130.7100.85025.274
Sequence + Structure (ours)Contact Map + SeqVec + Spike2VecSVM0.6770.8160.6770.7010.7280.81817.343
NB0.4300.5280.4300.4410.4250.6970.498
MLP0.5240.5680.5240.5240.5300.761120.589
KNN0.6710.7260.6710.6820.6860.8280.429
RF0.8390.8600.8390.8390.8440.90414.844
LR0.9680.9720.9680.9690.9660.980134.948
DT0.7640.7720.7640.7660.7620.87512.175
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ali, S.; Chourasia, P.; Patterson, M. When Protein Structure Embedding Meets Large Language Models. Genes 2024, 15, 25. https://doi.org/10.3390/genes15010025

AMA Style

Ali S, Chourasia P, Patterson M. When Protein Structure Embedding Meets Large Language Models. Genes. 2024; 15(1):25. https://doi.org/10.3390/genes15010025

Chicago/Turabian Style

Ali, Sarwan, Prakash Chourasia, and Murray Patterson. 2024. "When Protein Structure Embedding Meets Large Language Models" Genes 15, no. 1: 25. https://doi.org/10.3390/genes15010025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop