Next Article in Journal
Microglial Potassium Channels: From Homeostasis to Neurodegeneration
Next Article in Special Issue
What Makes GPCRs from Different Families Bind to the Same Ligand?
Previous Article in Journal
The ANXA2/S100A10 Complex—Regulation of the Oncogenic Plasminogen Receptor
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Three-Dimensional Graph Matching to Identify Secondary Structure Correspondence of Medium-Resolution Cryo-EM Density Maps

by
Bahareh Behkamal
1,
Mahmoud Naghibzadeh
1,*,
Mohammad Reza Saberi
2,3,
Zeinab Amiri Tehranizadeh
2,
Andrea Pagnani
4,5,6 and
Kamal Al Nasr
7,*
1
Department of Computer Engineering, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad 9177948974, Iran
2
Medicinal Chemistry Department, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad 9177899191, Iran
3
Bioinformatics Research Group, Mashhad University of Medical Sciences, Mashhad 9177899191, Iran
4
Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy
5
Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo, Italy
6
INFN, Sezione di Torino, I-10125 Torino, Italy
7
Department of Computer Science, Tennessee State University, Nashville, TN 37209, USA
*
Authors to whom correspondence should be addressed.
Biomolecules 2021, 11(12), 1773; https://doi.org/10.3390/biom11121773
Submission received: 30 September 2021 / Revised: 18 November 2021 / Accepted: 20 November 2021 / Published: 26 November 2021

Abstract

:
Cryo-electron microscopy (cryo-EM) is a structural technique that has played a significant role in protein structure determination in recent years. Compared to the traditional methods of X-ray crystallography and NMR spectroscopy, cryo-EM is capable of producing images of much larger protein complexes. However, cryo-EM reconstructions are limited to medium-resolution (~4–10 Å) for some cases. At this resolution range, a cryo-EM density map can hardly be used to directly determine the structure of proteins at atomic level resolutions, or even at their amino acid residue backbones. At such a resolution, only the position and orientation of secondary structure elements (SSEs) such as α-helices and β-sheets are observable. Consequently, finding the mapping of the secondary structures of the modeled structure (SSEs-A) to the cryo-EM map (SSEs-C) is one of the primary concerns in cryo-EM modeling. To address this issue, this study proposes a novel automatic computational method to identify SSEs correspondence in three-dimensional (3D) space. Initially, through a modeling of the target sequence with the aid of extracting highly reliable features from a generated 3D model and map, the SSEs matching problem is formulated as a 3D vector matching problem. Afterward, the 3D vector matching problem is transformed into a 3D graph matching problem. Finally, a similarity-based voting algorithm combined with the principle of least conflict (PLC) concept is developed to obtain the SSEs correspondence. To evaluate the accuracy of the method, a testing set of 25 experimental and simulated maps with a maximum of 65 SSEs is selected. Comparative studies are also conducted to demonstrate the superiority of the proposed method over some state-of-the-art techniques. The results demonstrate that the method is efficient, robust, and works well in the presence of errors in the predicted secondary structures of the cryo-EM images.

1. Introduction

Proteins are one of the essential parts of all organisms that perform most of the tasks of living species. To study the relationship between protein structure and function, it is necessary to have access to precise three-dimensional (3D) structural information [1]. Hence, understanding the protein structure is of great interest to biologists. Traditionally, protein structures have been obtained using experimental techniques such as X-ray crystallography and NMR spectroscopy. X-ray crystallography has been used to study thousands of protein complexes which are crystallizable. NMR spectroscopy is limited to small molecules of an atomic mass less than 50 kDa. Therefore, neither of these techniques can be used to study molecular complexes which can be found in nature in their near-native state [2]. More recently, cryo-electron microscopy (cryo-EM) has emerged as an experimental technique to address most of the scalability concerns of the traditional techniques by being able to image large macromolecular complexes, such as ribosomes and viruses, in their native conformations. This widely used technique does not require crystalizing before data acquisition and it is applicable on a molecule larger than ~100 kDa [3,4]. In recent years, there have been significant advances in cryo-EM imaging techniques [5]. However, for some cases, the cryo-EM reconstructions are limited to medium-resolution (~4–10 Å), where the secondary structure elements can be computationally and visually identified, but not the individual amino acid residues [6]. This lack of atomic-level resolution leads to many computational challenges for protein 3D structure determination. For the density maps at high-resolution (~2–4 Å), the backbone is recognizable, and the protein structure at the atomic level can be directly derived. However, for the low (~10–25 Å) or medium-resolution (~4–10 Å) density maps, the backbone of the protein and the atomic information cannot be directly achieved from the cryo-EM maps. This limitation has motivated the development of many computational methods that use the medium-resolution cryo-EM map to collect protein structural information [7,8,9,10,11,12,13,14,15]. In the cryo-EM modeling pipeline, some major steps should be handled, such as extracting the secondary structure elements on a cryo-EM density map and matching them to a sequence/model, the C placement of SSEs, building an atomic structure, and structure optimization [6]. One of the main challenging and critical steps is finding the mapping of the secondary structures of the modeled structure to the cryo-EM map. This is because this step provides the initial anchor point to find the location of the C atoms and to construct the protein backbone. The precise identification of SSEs correspondence enables us to produce an accurate initial 3D structure of a protein that can be refined further by later steps in the model-building pipeline.
At medium-resolution, the analyses of cryo-EM maps rely on the availability of the known protein structures obtained by other high-resolution experimental methods (X-ray crystallography, NMR). When the atomic structure from other sources of information is not accessible, a de novo modeling approach could be utilized [9,16,17,18,19,20]. S. Abeysingh et al. [16] introduced a research study on solving the α-helix correspondence problem through shape matching by modeling both a 1D sequence and a 3D volume to attributed relational graphs. Furthermore, they developed Gorgon [21], which is an interactive molecular modeling toolkit with an interactive visualization platform. Al Nasr et al. developed a weighted directed graph to solve the secondary structure assignment and presented an approach to enumerate the top-ranked topologies instead of enumerating all possible topologies [18]. The authors conducted another study, DP-TOSS, to solve the topology determination based on a layered graph using a dynamic programming approach into a constrained k-shortest path algorithm [19]. DP-TOSS was compared with Gorgon in our previous study [19]. The results indicated that DP-TOSS was superior to Gorgon. Afterwards, Biswas et al. [22] enhanced the performance of DP-TOSS by combining the information from multiple secondary structure prediction servers. They utilized some different structural information, such as the length of secondary structures, the loop length, and the skeleton between two secondary structure traces as a scoring function. Al Nasr et al. enhanced the DP-TOSS accuracy using the efficient scoring methodology. The proposed scoring functions were a skeleton-based scoring function, a geometry-based function, and a multi-well potential energy-based function [20].
In the presence of a high-resolution structure for an insufficient resolution cryo-EM map, the fitting methods, which are categorized into flexible and rigid-body fitting, could be utilized to derive the atomic structure from the cryo-EM map [9,12,14,17]. Early studies have concentrated on searching for the optimal position and orientation of a protein’s secondary structure components with the best overlaps with the SSEs extracted from a cryo-EM density map [23,24,25,26]. Dou et al. proposed a flexible fitting of an atomic structure into a cryo-EM map which is guided by the correspondences between α-helices in the atomic model and the cryo-EM map [27]. In the work of [28], a computational method is presented to quantify the agreement between two sets of central axes of α-helices which are relevant to atomic structures and cryo-EM maps. It utilized an arc-length association strategy to characterize the lateral and the longitudinal differences of the two axes.
Our approach in this study is to introduce a novel geometrical matching approach to find the correct matches between SSEs-C and SSEs-A (SSEs correspondence). The central theme of our approach is to cast the SSEs mapping problem as that of three-dimensional graph matching. For this purpose, the SSEs matching problem is formulated as a 3D vector matching problem in Cartesian coordinate space. Then, the 3D vector matching problem is transformed into a 3D graph matching problem. To solve the 3D graph matching problem, three novel mathematical-based features, as well as two robust statistical scoring functions, are proposed. Finally, to obtain the final SSEs assignment among all possible ones, a similarity-based voting algorithm combined with the PLC concept is developed. Furthermore, the results show the superiority of the proposed method compared to some of the state-of-the-art techniques.

2. Materials and Methods

In this section, an automatic assignment method for finding the SSEs correspondence in three-dimensional space is proposed. An overview of the method is illustrated in Figure 1. The method takes the modeled structure and the medium-resolution cryo-EM density map as inputs (Figure 1a,b) and produces SSEs correspondence as output. Initially, in the preprocessing step, the α-helices and β-strands from the modeled structure (SSEs-A) and the cryo-EM map (SSEs-C) are extracted (Figure 1c,d). Then, the extracted SSEs from both the structure and the map are constructed as vectors in the three-dimensional Cartesian coordinate systems (Figure 1e,f). After that, utilizing the novel strategy and innovative mathematical-based features (i.e., angle, Euclidian distance, and relative length), the 3D vector matching problem is transformed into the 3D graph matching problem (Figure 1g,h). To solve the 3D graph matching problem, two robust statistical scoring functions, which are Bhattacharyya distance (BD) and modal assurance criterion (MAC), are proposed. At the end, a similarity-based voting algorithm has been developed (Figure 1i) to extract the SSEs correspondence.

2.1. Preprocessing

In this step, the model, generated by I-TASSER [30,31,32,33], and the cryo-EM density map are used as initial inputs and the geometrical features are returned as outputs. Generally, the protein modeling can be performed using various modeling tools such as Modeller [34], AlphaFold [35,36], RaptorX [37,38,39], and I-TASSER. I-TASSER (Zhang-Server) and AlphaFold (A7D) are two efficient and robust methods, which are based on deep residual-convolutional networks. AlphaFold utilizes artificial intelligence and deep learning methods to generate the 3D structure of proteins. The framework of the AlphaFold is based on a deep two-dimensional convolutional residual network that enables this method to create high-accuracy structures even under sequences with fewer homologous sequences. I-TASSER is developed for automated protein structure prediction, which performs the model construction by collecting the high-scoring structural templates based on the threading approaches. The hierarchical architecture is composed of four steps, including threading, structural assembly, model selection, and structure-based functional annotation. I-TASSER finds a protein template of similar super-secondary structures from the Protein Data Bank (PDB) through LOMETS [40,41]. Then, the extracted segments from the templates are reconstructed through replica-exchange Monte Carlo simulations. The performance of the generated model is assessed based on the reliability of the threading templates and the convergence parameters of the structural assembly. The server was successful in the Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition in recent years. Hence, in this study, the authors opted for I-TASSER, which is available at (https://zhanggroup.org/I-TASSER/, accessed on 30 September 2021) due to its simplicity and high accuracy.
The geometrical features are Cartesian coordinate voxels of the SSEs (α-helices and β-strands). For more clarification, the α-helices and β-strands are the primary elements of the secondary structures, as illustrated in Figure 2. These elements are formed by amino acid residues. Each residue consists of four primary atoms ( N , C , C , and O ). The C . atom is the most important one in the backbone of the SSEs. For the first input (i.e., the 3D model), all the C coordinates of the SSEs-A (the geometrical location of the backbone alpha carbon of the α-helices and β-strands) are extracted. The second input is the cryo-EM map. At a medium-resolution cryo-EM map, the secondary structure components can be observed as density rods [17]. Various computational methods, such as SSEhunter [42], SSELearner [43], SSETracer [44], and Emap2sec [45] have been developed to detect the position, orientation, and length of α-helices and β-strands on the cryo-EM images. In this study, the Cartesian coordinate voxels of the SSEs-C have been extracted using SSETracer [44].

2.2. Construction of 3D Vectors from SSEs-A and SSEs-C

This study aims to find the correspondence between the α-helices and β-strands detected on the cryo-EM map with those extracted on the modeled structure. To deal with this issue, the extracted SSEs from the map and the 3D model are converted to the 3D vectors in the Cartesian coordinate system. For visualization, a simple α-protein 1FLP (PDB ID) is selected from the data set of interest, as demonstrated in Figure 3. The start and end voxels of the SSEs-A have been utilized to construct the 3D vectors (Figure 3a,b). Since we do not have any information regarding the C atom of the medium-resolution cryo-EM map, the coordinate voxels of the central axis of the SSEs-C have been used to construct the 3D vectors (Figure 3c,d).

2.3. Three-Dimensional Vector Matching

In order to solve the vector matching problem, three effective mathematical-based features, which are the angle, the Euclidean distance, and the relative length, are proposed. These features are computed with the aid of all vectors in R SSEs A 3 . and R SSEs C 3 . Afterward, the 3D vector matching problem is transformed into the 3D graph matching problem based on the extracted features. The construction of the graph is elaborated in the following.

Construction of Weighted Fully Connected Graphs of SSEs-A and SSEs-C

Based on the problem at hand, the central idea of the method is to find the correspondence between the constructed 3D vectors of R SSEs A 3 and R SSEs C 3 . Hence, two weighted fully connected graphs (i.e., G SSEs A . and G SSEs C ) have been constructed from R SSEs A 3 and R SSEs C   3 .
Figure 4 illustrates the transformation of the 3D vectors to the 3D graphs. For the sake of simplicity, only the relevant edges of one node in the weighted fully connected graphs are illustrated.
Let A = ( A 1   A 2     A m ) be a set of SSEs-A detected from the atomic structure and C = ( C 1   C 2     C n ) be a set of SSEs-C extracted on the cryo-EM map. The weighted fully connected graph of SSEs-A and SSEs-C are undirected fully connected graphs that are represented as a 4-tuple G SSEs A = ( N A ,   E A , V A , W A ) and G SSEs C = ( N C ,   E C , V C , W C ) , respectively. Note that, since the process of construction of the G SSEs A and G SSEs C graphs are the same, for summarizing, the construction of the G SSEs A graph in the following has been elaborated.
Given G SSEs A = ( N A ,   E A , V A , W A ) , the first element of the G SSEs A graph is N A , which is a nonempty set of nodes that represent the vectors of SSEs-A in the 3D space. | N A | denotes the number of nodes, which is equal to the number of vectors in R SSEs A   3 . The second element of the graph is E A , which is defined as a set of edges representing all possible interactions of nodes. The third element, V A , is a set of labels of the nodes and they are defined based on the spatial position of C atoms. It is appropriate to assign a pair ( s i ,   e i ) = ( x i s ,   y i s ,   z i s , x i e ,   y i e ,   z i e ) from the start and end points of the ith vector to ith SSEs-A node of the graph. s i and e i are the first and the last C coordinate voxels of the ith SSEs-A which is corresponded to the start and end voxel of the ith SSEs-A vector ( HV i .). The last element of the graph, W A , is defined for assigning weights to the edges of the graph according to the mathematical-based features. More details about the construction of the three graphs based on the three mathematical-based features are provided as follows:
i.
Angle-based fully connected graph ( G SSEs A A n g l e .): This graph uses the angle of vectors for assigning weights to the edges of the graph. W SSEs A A n g l e ( e i   , e j ) is defined to calculate the weights of the G SSEs A   A n g l e . graph based on the angle of every two vectors:
  W SSEs A A n g l e ( e i   , e j ) = (   e i   . e j   ) e i   e j     ,     e i   , e j   ϵ   H N i   .  
ii.
Euclidean distance-based fully connected graph ( G SSEs A E D .): This graph utilizes the Euclidean distance (ED) metric for assigning weights to the edges of the G SSEs A E D . graph. The edge’s weight of the graph is computed based on the Euclidean distance of the midpoint of two vectors as follows:
m i = s i + e i   2 ,   m j = s j + e j   2 ,   W SSEs A E D ( m i   , m j   ) = m i m j
iii.
Relative length-based fully connected graph ( G SSEs A R L .): This graph determines the weight of the edge based on the relative length (RL) of two vectors. This characteristic is defined to specify the relative length between two vectors and is computed based on Equation (3).
L i = | s i e i | ,   W SSEs A R L ( L i , L j ) = | L i L j | ( L i + L j )  
According to the aforementioned three constructed graphs, three weighted adjacency matrices for G SSEs A have been constructed. Based on the same principle, three graphs and three weighted adjacency matrices for G SSEs C . have been constructed. The G SSEs A and G SSEs C   matrices are m × m and n × n , respectively. The characteristics of the matrices are:
  • All entries on the main diagonal are zero ( x i i = 0);
  • All off-diagonal entries are positive ( x i j > 0 if i ≠ j);
  • The matrices are a symmetric matrix ( x i j = x j i ).
In the following phase of the study, to compute the similarity of the nodes between the G SSEs A . and G SSEs C . graphs, two robust statistical scoring functions, BD and MAC, have been proposed. The Bhattacharyya distance (BD) computes the distance of two probability distributions or variables based on the statistical moments of the data [46]. These statistical indicators have been widely applied in signal processing [47], image processing [48], speaker recognition [49], and pattern recognition [50]. In this study, the metric is utilized to measure the geometrical similarity and to calculate the distance between all nodes of the G SSEs A and G SSEs C graphs. For more clarification, suppose that r i SSEs A   and r j SSEs C   are two rows of two weighted adjacency matrices. In detail, r i SSEs A   is the ith row of M a t r i x SSEs A   and r j SSEs C is the j th row of M a t r i x SSEs C   . r i SSEs A   signifies the weights of all adjacency edges for the ith SSEs-A node. Similarly, r j SSEs C indicates the weights of all adjacency edges for the j th SSEs-C node. To compute the similarity score between the two nodes of G SSEs A and G SSEs C , the following formula has been applied:
B D   ( r i SSEs A     , r j SSEs C   ) = ln (   ( r i SSEs A     ) . ( r j SSEs C   ) ) ,     i   ϵ   1 i   m ,     j   ϵ   1 j   n
The calculated distance score (BD) determines the relative closeness of two nodes in two peer graphs. The BD scoring function varies between 0 to 1   ( 0 B D 1 ) , in which BD=0 represents two nodes with high similarity, and vice versa. We applied the BD scoring function for all nodes of three peer graphs (i.e., < G SSEs A   A n g l e ,   G SSEs C A n g l e > , < G SSEs A   E D ,   G SSEs C E D > , < G H e l i x R L ,   G s t i c k R L > ) to achieve the initial correspondence set for each pair of graphs.
The second proposed scoring function, the modal assurance criterion (MAC), is a robust statistical metric that provides a measure of consistency between two linear arrays [51,52]. The basic idea behind the metric comes from the modal assurance criterion, which computes a measure of consistency between the experimental and the analytical modal arrays. In this study, the MAC considers as a scoring function to calculate the similarity of nodes in each two peer graphs based on Equation (5). Similar to the BD scoring function, the MAC metric takes two rows (i.e.,   r i SSEs A   and r j SSEs C .) of two peer matrices (e.g., G SSEs A   A n g l e ,   G SSEs C A n g l e ) as inputs and computes the similarity score. The generated similarity score is in the range of 0–1, where a zero score indicates no consistency between the two peer nodes of the graphs, and one indicates complete consistency.
MAC SSEs A , SSEs C = ( ( ( r i SSEs A     ) T . ( r j SSEs C   ) ) 2 ( ( ( r i SSEs A     ) T . r i SSEs A     ) . ( ( r j SSEs C   ) T . r j SSEs C   )   ) ) i  
After applying the two aforementioned distance/similarity scoring functions on the three peer graphs, three candidate SSEs correspondence sets were generated. To extract the final SSEs correspondence among the three obtained candidate SSEs correspondence sets, a similarity-based voting algorithm has been developed.

2.4. Similarity-Based Voting Algorithm (SimVA)

The similarity-based voting algorithm (SimVA) has been proposed as a decision-making strategy to extract the final SSEs correspondence among the three generated correspondence sets. The SimVA initially takes the three obtained correspondence sets as inputs and then generates the final SSEs correspondence as output. The final correspondences are extracted in three steps, including (i) unanimous voting, (ii) majority voting, and (iii) the principle of least conflict (PLC). These steps are presented in the following in detail.

2.4.1. Unanimous Voting

In this step, the SimVA algorithm considers an assignment as an acceptable assignment if it is repeated in all the three candidate correspondence sets. In the other words, if ith SSEs-A matches with the j th SSEs-C based on the three mathematical-based features (angle, Euclidian distance, and relative length), this assignment is a great choice, and it is reported as an acceptable assignment.

2.4.2. Majority Voting

This routine supposes an assignment to be an acceptable assignment when it is repeated in the two candidate correspondence sets among the three correspondence sets. For example, if the ith SSEs-A match with the j th SSEs-C based on two of the mathematical-based features out of three, it is considered as an acceptable assignment and is inserted into the final correspondence set.

2.4.3. Principle of Least Conflict

The main idea behind the principle of least conflict (PLC) approach is to find the assignments in the case that there is a remaining assignment that has not been selected in the two previous steps. In this step, the assignment with the minimum conflict has been recognized and selected as an acceptable assignment. The minimum conflict assignment is a <SSEs-A, SSEs-C> pair that has the least conflict with the other pairs. As an example, if the 1st SSEs-A should match with the 4th SSEs-C (i.e., the pair <1, 4> is a true assignment), all the other assignments except <1, 4> for the 1st SSEs-A (e.g., <1, 2>, <1, 3>, … <1, n>) are considered as conflict pairs. On the other hand, for the 4th SSEs-C, all other assignments except <1, 4> are also in conflict (e.g., <2, 4>, <3, 4>, … <m, 4>). After all the conflict pairs have been detected for all assignments, the number of conflict pairs for each assignment has been enumerated and the pair with the minimum number of conflicts is selected as an acceptable assignment. The proposed concept allows the SimVA algorithm to continue at times when we could not find the assignment from the two aforementioned voting routines in each iteration of the algorithm. At the end, all the acceptable assignments obtained from the SimVA algorithm are considered as a final SSEs correspondence.

3. Results

This section presents experiments which have been designed to evaluate the robustness of the presented method. The effectiveness of the method was validated on 25 experimental and simulated cryo-EM maps in terms of precision, sensitivity, F-measure, and accuracy. The validity of the proposed approach was carried out by comparing the SSEs correspondence computed by the method presented in this study with the native correspondence (true SSEs correspondence). The native correspondence is obtained from the manual labeling of the SSEs in the density map based on the known atomic structure (for simulated data) or a structural homolog (for experimental data). We calculate the accuracy, precision, sensitivity, and F-measure based on the following formula:
  A c c u r a c y = T P + T N / ( T P + F P + F N + T N ) * 100  
  P r e c i s i o n = T P / ( T P + F P ) * 100
  S e n s i t i v i t y = T P / ( T P + F N ) * 100
F m e a s u r e = ( 2 × P r e c i s i o n × S e n s i t i v i t y ) / ( P r e c i s i o n + S e n s i t i v i t y ) 100
In the aforementioned equations, true positive (TP) is the number of detected matched SSEs that are correct, true negative (TN) represents the number of detected unmatched SSEs that are correct, false positive (FP) denotes the number of matched SSEs that are incorrect, and false negative (FN) is the number of rejected matched SSEs that are incorrect.

3.1. Experimental and Simulated Cryo-EM Density Maps

The efficiency and accuracy of the automatic method were tested using 25 α-β proteins. The data set of interest consists of 10 experimental and 15 simulated cryo-EM maps. The experimental cryo-EM maps, which are reported in Table 1, were obtained from the Electron Microscopy Data Bank (EMDB) [53] so that their resolutions ranges from 3.7 to 8.9 Å.
The simulated maps, which are represented in Table 2, are synthesized at 10 Å resolution using the Chimera package [29], and the structure of the proteins were downloaded from the Protein Data Bank (PDB) (https://www.rcsb.org/, accessed on 30 September 2021) [54].
In the dataset of interest, the lengths of the proteins range from 117 (PDB ID: 3FIN) to 1703 (PDB ID: 6UXW) amino acid residues. The largest test case (PDB ID: 5KBU) in this dataset includes 65 SSEs-A and 54 SSEs-C. Therefore, the selected data set is appropriate to evaluate the robustness and effectiveness of the method in handling large samples.

3.2. Performance Comparison of Two Scoring Functions

As described in the earlier section, three peer graphs from SSEs-A and SSEs-C (i.e., < G SSEs A A n g l e ,     G SSEs C A n g l e > ,   G SSEs A E D ,     G SSEs C E D ,   G SSEs A R L ,     G SSEs C R L ) have been constructed based on the three mathematical-based features. To measure the similarity of the nodes in each peer graph, two statistical scoring functions, BD and MAC, have been utilized. To assess the quality of the algorithm, we have evaluated our work based on the three proposed mathematical-based features using the BD and MAC scoring functions. The accuracy of the achieved SSEs correspondence sets (angle-, ED-, and RL-based correspondence sets) is calculated based on the Equation (6), as reported in Table 3.
As can be seen in Table 3, the percentage of the average accuracy based on the angle-, ED-, and RL-based correspondence sets concerning the BD scoring function are equal to 53.20%, 69.39%, and 50.63%, respectively. For the MAC scoring function, these values are identical to 57.59%, 70.58%, and 53.76%, respectively. The results indicate that the MAC metric is more reliable than BD in finding the similarity of the nodes of the graphs.
To extract the final SSEs correspondence set from the three produced correspondence ones, the SimVA algorithm has been designed and implemented. In the following, the effectiveness of the developed algorithm is assessed on the experimental and simulated cryo-EM map.

3.3. Impact of the SimVA Algorithm on the SSEs Correspondence Result

To improve the efficiency of the matching process, the SimVA algorithm has been proposed. The SimVA algorithm has been developed to extract the final SSEs correspondence based on the feature integration strategy. Here, the accuracy of the SimVA algorithm using two scoring functions, BD and MAC, is analyzed. Table 4 compares the performance of the method before and after incorporating the SimVA algorithm.
A comparison of the reported results in Table 4 shows that for 24 out of 25 test cases, the accuracy has been improved by incorporating the SimVA algorithm. The total average accuracy obtained from the three mathematical-based features using BD and MAC is 57.74% and 61.51%, respectively. After incorporating the SimVA algorithm in the final step, the total average of the accuracy using BD and MAC are equal to 76.17 % and 76.09%, respectively. This reveals that incorporating the SimVA algorithm led to an 18.43% and a 14.58% improvement in the accuracy of the method.

3.4. Assessment of the Method

To analyze the robustness of the method, four performance measurements (precision (P), sensitivity (S), F-measure (F), and accuracy (A)) were used. Figure 5 demonstrates the efficiency of the method using the measurements on the data set of interest.
As can be observed in Figure 5, for most of the proteins in the data set with the aid of the SimVA_MAC, the accuracy is more than 70%. The results show that the method is robust and works well even under the presence of errors and uncertainties in the extracted SSEs in the cryo-EM images. This is a valuable outcome of this study.

3.5. Comparison of Method with DP-TOSS

In this section, the accuracy of the SimVA algorithm using two scoring functions, BD and MAC, has been compared with DP-TOSS [20]. Many approaches have recently been developed to solve the SSEs mapping problem for medium-resolution cryo-EM maps, as discussed in the introduction. Here, the proposed method is compared with the latest version of DP-TOSS. As can be seen in Table 5, the average of accuracy on the data set of interest for DP-TOSS, SimVA_BD, and SimVA_MAC are equal to 61.35%, 76.17%, and 76.09%, respectively.
Based on the obtained results, it can be concluded that SimVA is more efficient than DP-TOSS. More specifically, the percentages of the accuracy improvement of the proposed method compared to DP-TOSS using the BD and MAC are equal to 14.82% and 14.74%, respectively. Furthermore, SimVA is able to work on large protein with a total number of 65 SSEs (PDB ID 5KBU). This is one of the valuable achievements of this study that can cope with the problem of using large complex proteins with many secondary structure elements. Working on large complex proteins has been a challenging issue in recent studies [18,19,20,54]. As reported in the state-of-the-art studies, the largest protein in their dataset includes 33 SSEs-A and 20 SSEs-C. In the current study, we have been able to run the designed automatic method on two experimentally huge cryo-EM maps, 6UXW (PDB ID) and 5KBU (PDB-ID), which consist of 1034 and 1703 amino acids, respectively.

3.6. Runtime of the Method

The proposed automatic matching algorithm consists of four main steps. The first step is to extract the SSEs from two sources of information (i.e., PDB and map), the second step is to construct the 3D vectors from extracted SSEs, the third step is to transform the 3D vectors into the 3D graphs, and the last step is to develop a similarity-based voting algorithm in order to obtain the final SSEs correspondence. Here, the runtime of the method has been computed for the last three steps. The total runtime has been computed on a workstation with MacBook Pro, 2.2 GHz 6-Core Intel Core i7 Processor, and 16 GB of memory. The running time of the method on the benchmark data set is illustrated in Figure 6.
As can be observed in Figure 6, the runtime of the algorithm increases as the number of SSEs-A increases. For example, the least running time (0.46 s) is related to the protein 1BZ4 (PDB ID) with 5 SSEs-A, and the most running time (10.58 s) is relevant to the protein 5KBU (PDB ID) with 65 SSEs-A.

4. Discussion and Conclusions

Cryo-EM has played an increasing role in the structure determination of molecular complexes in recent years. Despite many advances in cryo-EM technologies, in some cases, the resolution of the generated maps ranges between 4Å to 10Å. Therefore, the medium-resolution cryo-EM map may not be adequate to directly determine the atomic structure of the protein. At medium-resolution, the secondary structure elements have been extracted and visualized by various methods. In this study, the automatic assignment method has been developed to find the mapping of the secondary structures of the modeled structure to the cryo-EM map. Knowing this assignment allows us to form an initial hypothesis on the structure of the protein backbone. The key idea of the 3D matching strategy proposed in this study is to represent the extracted SSEs from the density map and the modeled structure in a common way, and then build up the correspondence between these two representations. Our common approach is 3D weighted fully connected graphs, with nodes representing the SSEs and the edges representing the connectivity between the SSEs. The key contributions of the geometrical matching method can be summarized as follows: (i) the modeling of the SSEs to the geometrical vectors in 3D space, (ii) transforming the 3D vectors into the 3D graphs based on the proposed mathematical-based features, (iii) introducing two robust statistical scoring functions, BD and MAC, to measure the similarity of nodes of the graphs, and (iv) developing the innovative similarity-based voting algorithm combined with the PLC concept to find the true correspondence. It is important to mention that the SSEs correspondence may not be a bijection. Due to the noise and uncertainty in a typical map, the SSEs detection algorithms may fail to find the location of all the SSEs within the map and may also identify false SSEs. We demonstrated the performance of the method on the simulated as well as experimental data sets in the presence of errors. Comparative studies have also been conducted to demonstrate the superiority of the 3D matching method over some of the existing state-of-the-art techniques. The results show that the automatic method is highly efficient (76.09% overall accuracy) and works well for large cryo-EM maps. Moreover, the key strength of the matching method is that it does not require any prior segmentation of the density map and does not need skeleton data to obtain the SSEs correspondence. Besides, the automatic method is able to work on the large cryo-EM data (PDB ID 5KBU) containing 65 SSEs-A and 54 SSEs-C with 81.62% accuracy in less than 11 s.

5. Code Availability

The source code and data of the method is publicly available at https://github.com/Bahareh-Behkamal/Match_SSEs_CryoEM, accessed on 20 November 2021. Moreover, the instruction for utilizing the method can be found in the shared readme file.

Author Contributions

Conceptualization, B.B., M.N., K.A.N. and M.R.S.; methodology, B.B., M.N., Z.A.T. and A.P.; validation, B.B., K.A.N. and A.P.; formal analysis, B.B., Z.A.T. and K.A.N.; investigation, B.B., M.N., A.P., K.A.N. and M.R.S.; writing—original draft preparation, B.B. and M.N.; writing—review and editing, B.B., M.N. and K.A.N.; supervision, M.N. and K.A.N. All authors have reviewed and approved the final version of this manuscript.

Funding

K.A.N.’s research was funded by the NIH Academic Research Enhancement Award (R15 AREA), grant number 1R15GM126509 01.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xiang, Z.; Gong, W.; Li, Z.; Yang, X.; Wang, J.; Wang, H. Predicting Protein–Protein Interactions via Gated Graph Attention Signed Network. Biomolecules 2021, 11, 799. [Google Scholar] [CrossRef] [PubMed]
  2. Bhattacharya, S.; Lin, X. Recent Advances in Computational Protocols Addressing Intrinsically Disordered Proteins. Biomolecules 2019, 9, 146. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Doerr, A. Single-Particle Electron Cryomicroscopy. Nat. Methods 2014, 11, 30. [Google Scholar] [CrossRef] [PubMed]
  4. Glaeser, R.M. How Good Can Cryo-EM Become. Nat. Methods 2016, 13, 28–32. [Google Scholar] [CrossRef] [Green Version]
  5. Carrasco, M.; Toledo, P.; Tischler, N.D. Macromolecule Particle Picking and Segmentation of a KLH Database by Unsupervised Cryo-EM Image Processing. Biomolecules 2019, 9, 809. [Google Scholar] [CrossRef] [Green Version]
  6. Baker, M.L.; Baker, M.R.; Hryc, C.F.; DiMaio, F. Analyses of Subnanometer Resolution Cryo-EM Density Maps, 1st ed.; Elsevier Inc.: Amsterdam, The Netherlands, 2010; Volume 483. [Google Scholar]
  7. Saha, M.; Morais, M.C. FOLD-EM: Automated Fold Recognition in Medium-and Low-Resolution (4–15 Å) Electron Density Maps. Bioinformatics 2012, 28, 3265–3273. [Google Scholar] [CrossRef] [Green Version]
  8. Si, D.; He, J. Tracing Beta Strands Using StrandTwister from Cryo-EM Density Maps at Medium Resolutions. Structure 2014, 22, 1665–1676. [Google Scholar] [CrossRef] [Green Version]
  9. Lindert, S.; Alexander, N.; Wötzel, N.; Karakaş, M.; LStewart, P.; Meiler, J. EM-Fold: De Novo Atomic-Detail Protein Structure Determination from Medium-Resolution Density Maps. Structure 2012, 20, 464–478. [Google Scholar] [CrossRef] [Green Version]
  10. Shakeel, S.; Passmore, L.A.; Casan, A.; Casañal, A.; Shakeel, S.; Passmore, L.A. Interpretation of Medium Resolution CryoEM Maps of Multi-Protein Complexes. Curr. Opin. Struct. Biol. 2019, 58, 166–174. [Google Scholar]
  11. Ng, A.; Si, D. Genetic Algorithm Based Beta-Barrel Detection for Medium Resolution Cryo-EM Density Maps. In Proceedings of the 13th International Symposium on Bioinformatics Research and Applications, Honolulu, HI, USA, 29 May–2 June 2017; pp. 174–185. [Google Scholar]
  12. Zhang, B.; Zhang, X.; Pearce, R.; Shen, H.-B.; Zhang, Y. A New Protocol for Atomic-Level Protein Structure Modeling and Refinement Using Low-to-Medium Resolution Cryo-EM Density Maps. J. Mol. Biol. 2020, 432, 5365–5377. [Google Scholar] [CrossRef]
  13. Behkamal, B.; Naghibzadeh, M.; Pagnani, A.; Saberi, M.R.; Al Nasr, K. Solving the α-Helix Correspondence Problem at Medium-Resolution Cryo-EM Maps through Modeling and 3D Matching. J. Mol. Graph. Model. 2020, 103, 107815. [Google Scholar] [CrossRef] [PubMed]
  14. Leelananda, S.P.; Lindert, S. Iterative Molecular Dynamics-Rosetta Membrane Protein Structure Refinement Guided by Cryo-EM Densities. J. Chem. Theory Comput. 2017, 13, 5131–5145. [Google Scholar] [CrossRef]
  15. Fàbrega-Ferrer, M.; Cuervo, A.; Fernández, F.J.; Machón, C.; Pérez-Luque, R.; Pous, J.; Vega, M.C.; Carrascosa, J.L.; Coll, M. Using a Partial Atomic Model from Medium-Resolution Cryo-EM to Solve a Large Crystal Structure. Acta Crystallogr. Sect. D Struct. Biol. 2021, 77, 11–18. [Google Scholar] [CrossRef] [PubMed]
  16. Abeysinghe, S.; Ju, T.; Baker, M.L.; Chiu, W. Shape Modeling and Matching in Identifying 3D Protein Structures. CAD Comput. Aided Des. 2008, 40, 708–720. [Google Scholar] [CrossRef]
  17. Lindert, S.; Staritzbichler, R.; Wötzel, N.; Karakaş, M.; Stewart, P.L.; Meiler, J.; Wtzel, N.; Karaka, M.; Stewart, P.L.; Meiler, J.; et al. EM-Fold: De Novo Folding of α-Helical Proteins Guided by Intermediate-Resolution Electron Microscopy Density Maps. Structure 2009, 17, 990–1003. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Al Nasr, K.; Ranjan, D.; Zubair, M.; He, J. Ranking Valid Topologies of the Secondary Structure Elements Using a Constraint Graph. J. Bioinform. Comput. Biol. 2011, 9, 415–430. [Google Scholar] [CrossRef] [Green Version]
  19. Al Nasr, K.; Ranjan, D.; Zubair, M.; Chen, L.; He, J. Solving the Secondary Structure Matching Problem in Cryo-EM De Novo Modeling Using a Constrained K-Shortest Path Graph Algorithm. IEEE/ACM Trans. Comput. Biol. Bioinform. 2014, 11, 419–430. [Google Scholar] [CrossRef] [PubMed]
  20. Al Nasr, K.; Yousef, F.; Jebril, R.; Jones, C. Analytical Approaches to Improve Accuracy in Solving the Protein Topology Problem. Molecules 2018, 23, 28. [Google Scholar] [CrossRef] [Green Version]
  21. Baker, M.L.; Abeysinghe, S.S.; Schuh, S.; Coleman, R.A.; Abrams, A.; Marsh, M.P.; Hryc, C.F.; Ruths, T.; Chiu, W.; Ju, T. Modeling Protein Structure at near Atomic Resolutions with Gorgon. J. Struct. Biol. 2011, 174, 360–373. [Google Scholar] [CrossRef] [Green Version]
  22. Biswas, A.; Ranjan, D.; Zubair, M.; Zeil, S.; Al Nasr, K.; He, J. An Effective Computational Method Incorporating Multiple Secondary Structure Predictions in Topology Determination for Cryo-EM Images. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 14, 578–586. [Google Scholar] [CrossRef]
  23. Fabiola, F.; Chapman, M.S. Fitting of High-Resolution Structures into Electron Microscopy Reconstruction Images. Structure 2005, 13, 389–400. [Google Scholar] [CrossRef] [Green Version]
  24. Jiang, W.; Baker, M.L.; Ludtke, S.J.; Chiu, W. Bridging the Information Gap: Computational Tools for Intermediate Resolution Structure Interpretation. J. Mol. Biol. 2001, 308, 1033–1044. [Google Scholar] [CrossRef]
  25. Rossmann, M.G. Fitting Atomic Models into Electron-Microscopy Maps. Acta Crystallogr. Sect. D Biol. Crystallogr. 2000, 56, 1341–1349. [Google Scholar] [CrossRef] [Green Version]
  26. Wriggers, W.; Chacon, P. Modeling Tricks and Fitting Techniques for Multiresolution Structures. Structure 2001, 9, 779–788. [Google Scholar] [CrossRef] [Green Version]
  27. Dou, H.; Burrows, D.W.; Baker, M.L.; Ju, T. Flexible Fitting of Atomic Models into Cryo-EM Density Maps Guided by Helix Correspondences. Biophys. J. 2017, 112, 2479–2493. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  28. Zeil, S.; Kovacs, J.; Wriggers, W.; He, J. Comparing an Atomic Model or Structure to a Corresponding Cryo-Electron Microscopy Image at the Central Axis of a Helix. J. Comput. Biol. 2017, 24, 52–67. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  29. Pettersen, E.F.; Goddard, T.D.; Huang, C.C.; Couch, G.S.; Greenblatt, D.M.; Meng, E.C.; Ferrin, T.E. UCSF Chimera—A Visualization System for Exploratory Research and Analysis. J. Comput. Chem. 2004, 25, 1605–1612. [Google Scholar] [CrossRef] [Green Version]
  30. Zhang, Y. I-TASSER Server for Protein 3D Structure Prediction. BMC Bioinform. 2008, 9, 40. [Google Scholar] [CrossRef] [Green Version]
  31. Roy, A.; Kucukural, A.; Zhang, Y. I-TASSER: A Unified Platform for Automated Protein Structure and Function Prediction. Nat. Protoc. 2010, 5, 725. [Google Scholar] [CrossRef] [Green Version]
  32. Yang, J.; Yan, R.; Roy, A.; Xu, D.; Poisson, J.; Zhang, Y. The I-TASSER Suite: Protein Structure and Function Prediction. Nat. Methods 2015, 12, 7. [Google Scholar] [CrossRef] [Green Version]
  33. Yang, J.; Zhang, Y. I-TASSER Server: New Development for Protein Structure and Function Predictions. Nucleic Acids Res. 2015, 43, W174–W181. [Google Scholar] [CrossRef] [Green Version]
  34. Eswar, N.; Webb, B.; Marti-Renom, M.A.; Madhusudhan, M.S.; Eramian, D.; Shen, M.; Pieper, U.; Sali, A. Comparative Protein Structure Modeling Using MODELLER. Curr. Protoc. Bioinform. 2014, 47, 5–6. [Google Scholar]
  35. Senior, A.W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.; Žídek, A.; Nelson, A.W.R.; Bridgland, A. Protein Structure Prediction Using Multiple Deep Neural Networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13). Proteins Struct. Funct. Bioinform. 2019, 87, 1141–1148. [Google Scholar] [CrossRef] [Green Version]
  36. Senior, A.W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.; Žídek, A.; Nelson, A.W.R.; Bridgland, A. Improved Protein Structure Prediction Using Potentials from Deep Learning. Nature 2020, 577, 706–710. [Google Scholar] [CrossRef]
  37. Wang, S.; Sun, S.; Li, Z.; Zhang, R.; Xu, J. Accurate de Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLoS Comput. Biol. 2017, 13, e1005324. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  38. Wang, S.; Li, W.; Zhang, R.; Liu, S.; Xu, J. CoinFold: A Web Server for Protein Contact Prediction and Contact-Assisted Protein Folding. Nucleic Acids Res. 2016, 44, W361–W366. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  39. Ma, J.; Wang, S.; Wang, Z.; Xu, J. Protein Contact Prediction by Integrating Joint Evolutionary Coupling Analysis and Supervised Learning. Bioinformatics 2015, 31, 3506–3513. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  40. Wu, S.; Zhang, Y. LOMETS: A Local Meta-Threading-Server for Protein Structure Prediction. Nucleic Acids Res. 2007, 35, 3375–3382. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  41. Zheng, W.; Zhang, C.; Wuyun, Q.; Pearce, R.; Li, Y.; Zhang, Y. LOMETS2: Improved Meta-Threading Server for Fold-Recognition and Structure-Based Function Annotation for Distant-Homology Proteins. Nucleic Acids Res. 2019, 47, W429–W436. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  42. Baker, M.L.; Ju, T.; Chiu, W. Identification of Secondary Structure Elements in Intermediate-Resolution Density Maps. Structure 2007, 15, 7–19. [Google Scholar] [CrossRef] [Green Version]
  43. Si, D.; Ji, S.; Al Nasr, K.; He, J. A Machine Learning Approach for the Identification of Protein Secondary Structure Elements from Electron Cryo-Microscopy Density Maps. Biopolymers 2012, 97, 698–708. [Google Scholar] [CrossRef]
  44. Si, D.; He, J. Beta-Sheet Detection and Representation from Medium Resolution Cryo-EM Density Maps. In Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, Washington, DC, USA, 22–25 September 2013; pp. 764–770. [Google Scholar] [CrossRef]
  45. Subramaniya, S.; Terashi, G.; Kihara, D. Protein Secondary Structure Detection in Intermediate-Resolution Cryo-EM Maps Using Deep Learning. Nat. Methods 2019, 16, 911–917. [Google Scholar] [CrossRef] [PubMed]
  46. Aherne, F.J.; Thacker, N.A.; Rockett, P.I. The Bhattacharyya Metric as an Absolute Similarity Measure for Frequency Coded Data. Kybernetika 1998, 34, 363–368. [Google Scholar]
  47. Kailath, T. The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
  48. Goudail, F.; Réfrégier, P.; Delyon, G. Bhattacharyya Distance as a Contrast Parameter for Statistical Processing of Noisy Optical Images. JOSA A 2004, 21, 1231–1240. [Google Scholar] [CrossRef]
  49. You, C.H.; Lee, K.A.; Li, H. An SVM Kernel with GMM-Supervector Based on the Bhattacharyya Distance for Speaker Recognition. IEEE Signal. Process. Lett. 2008, 16, 49–52. [Google Scholar]
  50. Patra, B.K.; Launonen, R.; Ollikainen, V.; Nandi, S. A New Similarity Measure Using Bhattacharyya Coefficient for Collaborative Filtering in Sparse Data. Knowl.-Based Syst. 2015, 82, 163–177. [Google Scholar] [CrossRef]
  51. Allemang, R.; Modal, D.B. A Correlation Coefficient for Modal Vector Analysis. In Proceedings of the 1st International Modal Analysis Conference, Orlando, FL, USA, 8–10 November 1982; Volume 1, pp. 110–116. [Google Scholar]
  52. Pastor, M.; Binda, M.; Harčarik, T. Modal Assurance Criterion. Procedia Eng. 2012, 48, 543–548. [Google Scholar] [CrossRef]
  53. Lawson, C.; Patwardhan, A.; Pintilie, G.D.; Garcia, E.S.; Lagerstedt, I.; Baker, M.L.; Sala, R.; Ludtke, S.J.; Berman, H.M.; Kleywegt, G. Emdatabank: Unified Data Resource for 3DEM. Biophys. J. 2013, 104, 351. [Google Scholar] [CrossRef] [Green Version]
  54. Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235–242. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Different stages of the framework pipeline: (a) the inputs, including the modeled structure (PDB ID: 1BJ7, chain A) visualized by Chimera [29]; (b) the density map simulated at 10 Å resolution using protein structure 1BJ7 and Chimera package [29]; (c) the secondary structure elements extracted from the 3D modeled structure in the preprocessing step (SSEs-A); (d) the secondary structure elements extracted from the cryo-EM density map (SSEs-C); (e) the 3D vectors constructed based on the extracted SSEs-A; (f) the 3D vectors constructed based on the extracted SSEs-C; (g,h) the 3D graphs are constructed; (i) the similarity-based voting algorithm is proposed as a decision making strategy for finding the SSEs correspondence; (j) the secondary structure elements correspondence.
Figure 1. Different stages of the framework pipeline: (a) the inputs, including the modeled structure (PDB ID: 1BJ7, chain A) visualized by Chimera [29]; (b) the density map simulated at 10 Å resolution using protein structure 1BJ7 and Chimera package [29]; (c) the secondary structure elements extracted from the 3D modeled structure in the preprocessing step (SSEs-A); (d) the secondary structure elements extracted from the cryo-EM density map (SSEs-C); (e) the 3D vectors constructed based on the extracted SSEs-A; (f) the 3D vectors constructed based on the extracted SSEs-C; (g,h) the 3D graphs are constructed; (i) the similarity-based voting algorithm is proposed as a decision making strategy for finding the SSEs correspondence; (j) the secondary structure elements correspondence.
Biomolecules 11 01773 g001
Figure 2. Secondary structure elements (α-helices and β-strands) in the fitted atomic structure with cryo-EM map visualized by Chimera [29].
Figure 2. Secondary structure elements (α-helices and β-strands) in the fitted atomic structure with cryo-EM map visualized by Chimera [29].
Biomolecules 11 01773 g002
Figure 3. Construction of 3D vectors from extracted SSEs: (a) 3D structure of protein 1FLP (PDB ID) is shown with chimera [29]; (b) each α-helix in the atomic model is considered as a helix vector (HV) in the Cartesian coordinate system ( R SSEs A 3 ); (c) the cryo-EM density map and the SSEs-C detected on it. The map is simulated at 10 Å resolution using protein structure 1FLP (PDB ID). The location of SSEs-C is illustrated as purple cylinders with Gorgon [21]; (d) extracted SSEs-C on the map considered as stick vector (SV) in three-dimensional Cartesian space R SSEs C 3 .
Figure 3. Construction of 3D vectors from extracted SSEs: (a) 3D structure of protein 1FLP (PDB ID) is shown with chimera [29]; (b) each α-helix in the atomic model is considered as a helix vector (HV) in the Cartesian coordinate system ( R SSEs A 3 ); (c) the cryo-EM density map and the SSEs-C detected on it. The map is simulated at 10 Å resolution using protein structure 1FLP (PDB ID). The location of SSEs-C is illustrated as purple cylinders with Gorgon [21]; (d) extracted SSEs-C on the map considered as stick vector (SV) in three-dimensional Cartesian space R SSEs C 3 .
Biomolecules 11 01773 g003
Figure 4. Transformation of 3D vectors into the weighted fully connected graph: (a) α-helix vectors in R SSEs A 3 ; (b) construction of the weighted fully connected graph of α-helices ( G SSEs A ). The ith helix vector (HVi) is transformed into an ith helix node (HNi); (c) stick vectors in R SSEs C 3 ; (d) construction of the weighted fully connected graph of sticks ( G SSEs C .). The ith stick vector (SVi) is transformed into the ith stick node (SNi).
Figure 4. Transformation of 3D vectors into the weighted fully connected graph: (a) α-helix vectors in R SSEs A 3 ; (b) construction of the weighted fully connected graph of α-helices ( G SSEs A ). The ith helix vector (HVi) is transformed into an ith helix node (HNi); (c) stick vectors in R SSEs C 3 ; (d) construction of the weighted fully connected graph of sticks ( G SSEs C .). The ith stick vector (SVi) is transformed into the ith stick node (SNi).
Biomolecules 11 01773 g004
Figure 5. Assessment of the method concerning the performance measurements: (a) precision, (b) sensitivity, (c) F-measure, (d) accuracy.
Figure 5. Assessment of the method concerning the performance measurements: (a) precision, (b) sensitivity, (c) F-measure, (d) accuracy.
Biomolecules 11 01773 g005aBiomolecules 11 01773 g005b
Figure 6. The runtime of the method with respect to the number of SSEs-A in proteins. (PDB ID (#SSEs-A)).
Figure 6. The runtime of the method with respect to the number of SSEs-A in proteins. (PDB ID (#SSEs-A)).
Biomolecules 11 01773 g006
Table 1. The information of the experimental cryo-EM maps.
Table 1. The information of the experimental cryo-EM maps.
NoEMDB ID aPDB ID bChain c# Length d# SSEs-A e# SSEs-C fResolution g
150303FIN *R117776.4
238886EM3 *A2911194.2
386255UZB *A1771397
441766F36 *M32713113.7
517333C91 *A23318156.8
680705I1M *V45819177
725264CHV *A36123227
837615O8O *A34924226.8
9209346UXW *A170343358.9
1082315KBU *A103465547.8
a The EMDB ID of the protein used in the test; b the PDB ID of the protein used in the test. β-containing proteins are marked with *; c the protein chain; d the number of amino acid residues in the sequence; e the total number of secondary structure elements (α-helices and β-strands) in the atomic structure; f the total number of secondary structure elements (α-helices and β-strands) extracted from the cryo-EM map; g the resolution of the experimental map in angstrom (Å).
Table 2. The information of the simulated cryo-EM map.
Table 2. The information of the simulated cryo-EM map.
NoName aPDB ID bUniprot ID cChain dLength e#SSEs-A f#SSEs-C g
1Apolipoprotein E1BZ4P02649A14455
2Hemoglobin-11FLPP41260A14277
3Gag polyprotein2Y4Z *P03336A14088
4Uncharacterized protein YqeY1NG6P54464A14897
5Phosphatidylinositol1HG5O55012A289119
6Class IV chitinase Chia4-Pa23HBEQ6WSR8X204117
7Phospholipase C1P5XP09598A245139
8Tetracycline repressor protein class D2XB5P0ACT4A207139
9Protein LlR18A1ICX *P52778A1551311
10N-glycosylase/DNA lyase1XQOQ8ZVK6A2561414
11AlphaRep-43LTJ__A2011612
124,4’-diapophytoene synthases3ACWA9JQL9A2931714
13Flagellar motor switch protein FliG3HJLO66891A3292020
14Symplekin3ODSQ92797A4152116
15Albumin2XVVP02768A5853319
a the name of the protein; b the PDB ID of the protein used in the test. β-containing proteins are marked with *; c the Uniport ID of the protein; d the protein chain; e the number of amino acid residues in the sequence; f the total number of secondary structure elements (α-helices and β-strands) in the atomic structure; g the total number of secondary structure elements extracted from the cryo-EM map.
Table 3. The accuracy of the three SSEs correspondence sets using two scoring functions.
Table 3. The accuracy of the three SSEs correspondence sets using two scoring functions.
BD MAC
NOPDB IDAngleEDRLAngleEDRL
11BZ4808080806080
21FLP42.8557.1428.5757.1471.4257.14
32Y4Z5058.3358.3358.335050
41NG644.4488.8866.6644.4488.8877.77
51HG572.7236.3636.3654.5445.4554.54
63HBE81.8190.981.8181.8190.972.72
71P5X69.2384.1661.5376.9210069.23
82XB538.4676.9269.2346.1553.8469.23
91ICX76.1977.3853.5784.5270.2363.09
101XQO64.2857.145071.4278.5728.57
113LTJ43.7593.7537.510043.7562.5
123ACW35.2964.747.0535.2952.9435.29
133HJL209030409530
143ODS33.3352.3833.3323.857.1442.58
152XVV60.678.7845.4563.6378.7854.54
163FIN58.3358.3329.1645.8387.558.33
176EM370.8347.9158.3381.2554.1652.08
185UZB55.5566.6644.4455.5566.6655.55
196F3638.4692.353.8438.4610053.84
203C9162.563.756062.568.7545
215I1M36.8452.6357.8931.5747.3636.84
224CHV53.3373.3346.6653.3393.3366.66
235O8O52.3866.6652.385092.8550
246UXW41.2179.8448.1849.6967.2741.66
255KBU47.6346.5935.5153.7849.7636.97
Average53.2069.3950.6357.5970.5853.76
Table 4. The accuracy of the method incorporating the SimVA algorithm.
Table 4. The accuracy of the method incorporating the SimVA algorithm.
NoPDB ID aBD bSimVA_BD cMAC dSimVA_MAC e
11BZ4808073.3380
21FLP42.8557.1461.985.71
32Y4Z55.5566.6655.5566.66
41NG666.6610070.3777.77
51HG548.4854.5451.5172.72
63HBE84.8490.981.8190.9
71P5X71.7992.382.0584.61
82XB561.5376.9256.469.23
91ICX69.0484.5272.6191.66
101XQO57.1478.5759.5264.28
113LTJ58.3310062.556.25
123ACW49.0170.5841.1770.58
133HJL46.66855575
143ODS39.6861.941.2666.66
152XVV61.6166.6665.6563.63
163FIN48.6170.8363.8870.83
176EM359.0264.5862.587.5
185UZB55.5577.7759.2566.66
196F3661.5369.2364.176.92
203C9162.0887.558.7578.75
215I1M49.1278.9438.5984.21
224CHV57.7786.6671.1186.66
235O8O57.1447.6164.2885.71
246UXW56.4184.8452.8767.87
255KBU43.2470.7346.8481.62
Average57.7476.1761.5176.09
a the PDB ID of the protein; b the total accuracy obtained from three mathematical-based features using BD scoring function; c the accuracy of the SimVA algorithm using BD scoring function; d the total accuracy obtained from three mathematical-based features using MAC scoring function. e the accuracy of the SimVA algorithm using the MAC scoring function.
Table 5. Comparison between DP-TOSS and SimVA.
Table 5. Comparison between DP-TOSS and SimVA.
NoPDB ID aDP-TOSS bSimVA_BD cSimVA_MAC d
11BZ41008080
21FLP10057.1485.71
32Y4Z5066.6666.66
41NG671.4010077.77
51HG555.6054.5472.72
63HBE57.1090.990.9
71P5X55.6092.384.61
82XB566.7076.9269.23
91ICX45.5084.5291.66
101XQO71.478.5764.28
113LTJ83.3010056.25
123ACW10070.5870.58
133HJL1008575
143ODS10061.966.66
152XVV89.4066.6663.63
163FIN10070.8370.83
176EM344.4064.5887.5
185UZB55.5077.7766.66
196F3610069.2376.92
203C9146.7087.578.75
215I1M41.2078.9484.21
224CHV086.6686.66
235O8O047.6185.71
246UXW084.8467.87
255KBU070.7381.62
Average61.3576.1776.09
a the PDB ID of the protein; b the accuracy of DP-TOSS method; c the accuracy of the SimVA algorithm using BD scoring function; d the accuracy of the SimVA algorithm using MAC scoring function.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Behkamal, B.; Naghibzadeh, M.; Saberi, M.R.; Tehranizadeh, Z.A.; Pagnani, A.; Al Nasr, K. Three-Dimensional Graph Matching to Identify Secondary Structure Correspondence of Medium-Resolution Cryo-EM Density Maps. Biomolecules 2021, 11, 1773. https://doi.org/10.3390/biom11121773

AMA Style

Behkamal B, Naghibzadeh M, Saberi MR, Tehranizadeh ZA, Pagnani A, Al Nasr K. Three-Dimensional Graph Matching to Identify Secondary Structure Correspondence of Medium-Resolution Cryo-EM Density Maps. Biomolecules. 2021; 11(12):1773. https://doi.org/10.3390/biom11121773

Chicago/Turabian Style

Behkamal, Bahareh, Mahmoud Naghibzadeh, Mohammad Reza Saberi, Zeinab Amiri Tehranizadeh, Andrea Pagnani, and Kamal Al Nasr. 2021. "Three-Dimensional Graph Matching to Identify Secondary Structure Correspondence of Medium-Resolution Cryo-EM Density Maps" Biomolecules 11, no. 12: 1773. https://doi.org/10.3390/biom11121773

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop