1. Introduction
Given the rapid pace with which wet and dry laboratories are generating molecular structure data, there is now a growing demand for machine learning (ML) methods to handle, summarize, and make observations from such data [
1,
2,
3,
4,
5,
6]. This is particularly true for proteins, where the strong relationship between three-dimensional (tertiary) structure and biological function [
7] is spurring renewed interest in featurizing structure [
8].
Research on featurizations of protein structure that retain information on biological activity is active. Such research falls primarily into two categories, one where researchers hand-engineer features and evaluate them in some function prediction task, and the other where statistical and ML methods are employed instead to discover such features. According to our knowledge, not one review can be found in the published literature to cover all such methods, but we point readers to the work in [
9] on research in protein function prediction via feature engineering. Our focus in this paper is in the second category, as it removes the demands of acquiring domain-specific insight by researchers. Instead, statistical or ML methods promise to discover in a data-driven manner the pertinent features that summarize structure all the while retaining the functional information encoded in structure. Research on such methods shows varying performance. Among statistical methods, predominantly, linear, variance-maximizing methods, such as Principal Component Analysis (PCA) [
10] have been favored due to their ease of implementation and evaluation [
11,
12,
13]. Some work has also considered nonlinear methods [
14], such as Isomap [
15], Locally Linear Embedding [
16], Diffusion Maps [
17], and others [
18]. Other work has considered topic models and has drawn observations regarding structure and function relationships in the universe of known protein structures [
1]. The discovered features have been leveraged in important recognition tasks, such as predicting protein folds, function, and other properties [
19], as well as in expediting the search for more structures and structural transitions of target proteins [
20,
21,
22,
23,
24,
25,
26,
27].
In the past decade, autoencoders (AEs) have gained popularity in the ML community for unsupervised feature learning [
28,
29]. AEs present highly versatile architectures that can be tuned to yield linear or nonlinear featurizations of data. Open-source neural network libraries, such as Keras, which we employ here, greatly facilitate constructing, training, and evaluating AE architectures and conducting model search.
AEs have yet to gain in popularity in the structure biology community, but some attempts have been made. The first occurrence can be found in [
30], where an AE is applied to tertiary structures of a small molecule of 24 atoms. The presented AE is a deep one (as related in
Section 2), but the risk of overfitting its numerous parameters in the presence of little data blunts the impact of this early work. In more recent work [
31], a deep AE is applied to tertiary structures of two small molecules (one of 12 backbone dihedral angles, and another of 20 amino acids); the structures are collected from molecular dynamics simulations, and the goal is to reveal collective variables with which to expedite the sampling of more equilibrium structures. Work in [
32] investigates a similar AE to summarize the folding landscape of Trp-Cage, a small polypeptide of 20 amino acids. Despite the focus being on small systems and on elucidating very specific properties of these systems, these applications of AEs motivate us to further consider and evaluate AEs for featurizations of tertiary structure data at scale.
Specifically, motivated by rapid progress in neural network research and some early adoption of AEs for analysis of molecular structures, we investigate and evaluate AEs yielding linear and nonlinear featurizations of protein tertiary structures. We build over preliminary published work [
33], where we compare linear and nonlinear architectures. In this paper, we expand this analysis to more architectures that additionally allow incorporating external constraints on the sought features. We point to a best architecture for tertiary protein structures generated by template-free protein structure prediction methods. In addition, we demonstrate the utility of AEs in a practical context. Employing AE-based featurizations, we address the classic problem of decoy selection in protein structure prediction. Utilizing off-the-shelf supervised learning methods, we demonstrate that the featurizations are meaningful and allow detecting active tertiary structures, thus opening the way for further research on AEs and their utilization for structure–function studies of proteins and other molecular systems.
The rest of this paper proceeds as follows.
Section 2 briefly relates some preliminaries and summarizes AEs.
Section 3 describes the AE architectures we investigate, relates various details regarding training and evaluation, and describes in greater detail the utilization of AEs for the problem of decoy selection.
Section 4 presents a detailed comparative evaluation of various AEs against a baseline linear model (PCA) on decoy data over a benchmark set of protein targets often used by decoy generation algorithms, as well as relates results on the decoy selection task.
Section 5 concludes the paper with a discussion of future work.
2. Preliminaries
We do not aim to provide a detailed overview of AEs and their history in this paper. The interested reader is pointed to Refs. [
29,
34]. Here, we summarize preliminaries most pertinent to our study.
We begin by pointing out that all AEs contain an
encoder and a
decoder. Each contain one or more layers of neurons/units. The neurons in the first layer in the encoder are fed the elements of the input. The encoder
maps the input layer
x to its output layer
y. The decoder mirrors the encoder and maps the same layer
y to its output layer
z. The layer
y contains the learned code or reduced representation learned for the input
x. The top panel in
Figure 1 shows a vanilla AE, where a 4-dimensional input
x is mapped to a 2-dimensional code
y. The bottom panel shows a deep AE, where the encoder and decoder contain several hidden layers. An alternative architecture, which stacks vanilla AEs, has been investigated in [
33]. We do not dwell on it here, as recent work in [
33] shows that, while they converge faster than deep AEs, the quality of the reconstruction (described below) of tertiary structures is comparatively poor.
The encoder is a deterministic mapping parameterized by a vector of parameters that transforms x into y. Typically, one seeks a reduced representation (), and is an affine mapping that can be followed by a nonlinearity: . Here, is the sigmoid function , and W and b are the weights and biases that connect neurons of one layer to those of another. The sigmoid is a specific activation function; there are many others. The decoder performs , where ; and are the weights and biases, and is an affine mapping followed (or not) by nonlinearity. The decoder seeks to reconstruct x via z.
AE Training: An AE learns
y in a data-driven manner. The
training of an AE is guided by a loss function that, for real-valued data, measures the reconstruction error
. Parameters
and
are learned via gradient-based minimization of this error. The Adam optimizer [
35] has been shown superior in many applications, including in our recent work [
33]. Since the loss function may be high-dimensional, its optimization proceeds in epochs. In each epoch, the training data is divided into batches. In each epoch, parameters are updated after a batch is passed forward. The negative gradient of the loss function is evaluated and passed backwards to update the weights and biases.
Vanilla AEs: If there are no other layers between the input layer
x and the code layer
y (and, mirrorwise, the code layer
y and the output layer
z), one obtains a shallow/vanilla AE (shown in the top panel in
Figure 1). We will refer to it as vAE. The number of weights (biases not included) in this shallow architecture is
. Since
, this number is
. Typically, the desired
is low (not a function of
), so the number of weights is
. In a back-of-the-envelope calculation, when considering a molecule with 50 atoms, if
x consists of the Cartesian coordinates of the CA atoms (the main carbon atom of each amino acid), then the number of weights in a vAE is 22,500.
Deep AEs: Deep AEs, to which we will refer as dAEs from now on, contain possibly many intermediate layers of different neurons (of typically decreasing number from the input layer to the code layer), as shown in the top panel in
Figure 1. There are no prescriptions on the number of layers and the number of neurons per layer. Using the schematic in the bottom panel in
Figure 1 as a reference, the number of weights that need to be learned is
, with
l denoting the number of layers
L (0 being the input layer) and the decoder mirroring the encoder in its architecture; this number increases even further if one additionally considers the biases.
3. Methods
3.1. Number of Neurons Per Layers
The number of neurons in the code layer
y shared between the encoder and decoder determine the desired dimensionality of the feature space. In this paper, expanding upon preliminary work in [
33], we consider architectures where
. Considering that tertiary structures occupy a Cartesian space of thousands or more dimensions, this is a drastic reduction in dimensionality.
3.2. Restricting Number of Layers
Depth comes at a cost, as it directly impacts the number of parameters (weights and biases) that have to be learned; equivalently, the dimensionality of the loss function/surface increases exponentially with increasing depth, which makes it particularly challenging for optimization algorithms to converge to a global minimum of the loss function. While data size in the structural biology community has steadily increased, it cannot approach the million regime available for image data. Significant computational resources are needed, for instance, to generate around 50–60K structures of a given protein sequence, as we do in this paper. Therefore, we consider AEs of limited depth.
Based on our preliminary evaluation in [
33], which relates challenges with training very deep architectures, we restrict dAEs investigated here to only two intermediate hidden layers in the encoder and decoder. Specifically, we investigate the architecture
. While
is much smaller than
but much larger than
,
is chosen to be bigger than
, as this is shown to prevent overfitting and improve generalization [
29]. Our preliminary work in [
33] shows no overfitting for both vAE and dAE models.
3.2.1. Regularization via Weight Tying
To further reduce the number of weights that have to be learned during training, we employ the so-called “weight-tying” trick. The weights of the decoder are not free parameters. Instead, . This trick is a form of regularization, as it adds a constraint, thus reducing the dimensionality of the loss function surface.
3.2.2. Regularization via Orthogonality
Alternatively, one can add an orthogonality constraint, where the weight matrices of the encoding and decoding layers are orthogonal to each other. This also means that we do not need to train the encoder and decoder separately, with the benefit of reduction of dimensionality of the loss function. To summarize, we have .=I, where I is the identity matrix.The same orthogonality constraint is enforced on all the intermediate encoding and decoding layers.
3.3. Activation Functions
One of the primary motivations for investigating AEs for featurizing tertiary protein structure data is their versatility in yielding linear versus nonlinear feature spaces via the choice of the activation function. We consider the following popular activation functions in the encoder and decoder: identity (I), sigmoid (), leaky RELU (LR), and parametric LR (PLR). Briefly, ; (x) = ; , for and otherwise; PLR turns into a hyper-parameter learned during training. We note that we do not intend to exhaust all activation functions published in deep learning literature. With four options for the activation function in the encoder and decoder, this yields different variants for each architecture considered (shallow versus deep). For instance, we refer to a vAE with sigmoid in the encoder but LR in the decoder as and to a dAE with PLR in both the encoder and decoder as .
3.4. Exploring Model Space in Search of a Best Model
Considering the variation in the dimensionality of the code layer (
), 16 combinations of activation functions, vAE versus dAE architectures, and architectures with or without the additional orthogonality constraint we design and train 128 different models. Each model is trained over training data and tested over testing data. The squared (reconstruction) error is measured over every instance in a testing dataset, and the mean of these values, the mean reconstruction error (MSE) is employed as a primary metric to evaluate a model. The MSE-based comparison related in
Section 4 includes PCA as a baseline model, due to its popularity. In our
Supplementary Materials, we investigated different AE architectures with different dimensionality, and the results clearly showed that, the dAE architecture
still dominates the performance. We note that due to its linearity, one can easily obtain the MSE for a PCA model. To keep the comparison in
Section 4 fair, we “train” PCA over the same training dataset and “test” it over the same testing dataset as an AE model. The
pca.fit and
pca.transform function in Python’s sklearn library allow easily doing so; the
pca.inverse_transform provides the MSE over a desired dataset.
Handling Non-Determinism
An AE model can converge to a different local minimum of the loss function during training. The optimization process depends on the initial values of the parameters, which are set at random. Therefore, we train each AE model 3 times (each time starting with random initial parameters), resulting in 3 trained variants. When evaluating a particular architecture, we relate the mean of the MSEs obtained over the 3 variants.
3.5. Interpreting the Learned Latent Features
The ability to interpret the learned features is valuable in protein studies. Typically, PCA is preferred, as the interpretation can easily be carried out over the axes of the latent space (the eigenvectors/principal components). A structure is selected as reference, and changes to it are introduced by deforming the structure along one latent axis while keeping the other constant. Visualization is then employed to note the type of structural changes encoded in the feature space.
We carry out a similar process here to visualize “walks” in the feature space. Specifically, we select a structure as reference. However, since one cannot deform a structure along an axis in a nonlinear space, we “hop” between structures whose encodings in the feature space are close to a line parallel to a selected axis. We show the corresponding structural changes in
Section 4, providing insight into what information on structure variation is encoded in the latent dimensions of the feature space.
3.6. Supervised Learning over AE-Obtained Features
Beyond visualization of the latent feature space, a practical question concerns the utility of learned features for prediction tasks. In [
33] we show that features learned with an AE can be used to in a supervised setting to predict the dissimilarity of a computed structure from an experimentally known, biologically active/native structure. Buoyed by these results and building over a more comprehensive model search in this paper, we evaluate the features learned in an unsupervised manner from the top AE model(s) to predict the lRMSD of a tertiary structure computed by a template-free structure prediction method from a known native structure. lRMSD refers to the popular least root-mean-squared-deviation metric [
36]. The latter first finds an optimal superimposition of a decoy to a known native structure (extracted from the Protein Data Bank (PDB) [
37]) to remove differences due to translation and rotation in 3D and then averages the Euclidean distance over the atoms.
For a target protein, the data at hand (over which we train and test AE architectures and models) consist of 50–60K tertiary structures generated with the Rosetta AbInitio protocol [
38]. A proof-of-concept evaluation, which we carry out in [
33], would be as follows. Split the AE-featurized data into a training and a testing dataset for a target protein, train a supervised learning method
on that protein, and then evaluate the model on
that protein’s testing dataset. We are not interested in such a task here. Instead, we consider the following, more general setting. Over a list of target proteins organized in three categories of difficulty (based on the quality of the tertiary structures generated for each protein), we select half the proteins in each category to constitute the training dataset and the rest to constitute the testing dataset. This setting is more realistic, as we build one model per category. We then consider yet another setting, where we build one model over all categories. We also show performance of supervised Learning over PCA features as well Isomap features which is a another nonlinear dimensionality reduction method.
3.7. Datasets, Implementation Details, and Experimental Setup
Data Collection
The evaluation is carried out on 18 proteins of varying lengths (53 to 146 amino acids long) and folds (
,
,
, and
) that are used as a benchmark to evaluate structure prediction methods [
39,
40]. In an abuse of convention but in the interest of expediency, we refer to these proteins not by their actual names but by the PDB id of a representative native structure deposited for each of them in the PDB (Column 2 in
Table 1). The names of the proteins can be found in the
Supplementary Materials. On each target protein, we have run the Rosetta AbInitio protocol to obtain a dataset of no lower than 50,000 structures. The protocol takes as input an amino-acid sequence (in FASTA format) and a generated fragment library (which we have generated using the ROSETTA server). The protocol is run in an embarrassing parallel fashion, submitting batch jobs to our Mason ARGO super-computing cluster. The slight differences in the number of structures obtained for each protein are largely due to small variations in allotted time and increasing computational cost of the protocol to obtain all-atom structures for long sequences and/or complex folds.
Table 1 presents all the 18 proteins arranged into three different categories/levels of difficulty (easy, medium, and hard). These levels have been determined using the minimum lRMSD between Rosetta-generated decoys and a known native structure of the corresponding target protein (obtained from the PDB); the four-letter PDB ids are shown in Column 2; the fifth letter identifies the chain in a multi-chain PDB entry. These codes are used in an abuse of notation to refer to a particular protein (and its decoy dataset). The size of the dataset
for each target is shown in Column 5. Column 7 shows the percentage of native decoys within an lRMSD threshold of the known native structure; the values of these thresholds vary on the dataset and have been determined and related in prior work that focuses on clustering protein tertiary structures [
42]. Column 7 relates the imbalance of the decoy datasets; in some cases, the near-native decoys constitute less than
of the dataset, which posits that decoy selection is a challenging ML problem.
Data Preparation
For each Rosetta-generated structure, we only retain its CA atoms. In each dataset, we designate a structure as the
reference structure. We select this arbitrarily to be the first structure in a dataset. All structures are then optimally superimposed to the reference structure to minimize differences due to rigid-body motions [
36] (that is, differences due to translations and rotations in 3D). The superimposition changes the coordinates of each structure except for the reference one. The reference structure is then subtracted from each superimposed structure to obtain atomic deviations (coordinate differences per atom). This “centralization” is common practice in how PCA is applied to a molecular structure data, and we follow the same process to prepare a dataset for training an AE model. Thus, the input fed to a model does not consist of atomic coordinates, but rather atomic deviations. It is, however, easy to obtain a reconstructed structure. Let us consider an input structure is
S. After its superimposition to the reference structure, its atomic deviations from the input structure are
. The encoder uses
to obtain
, which is the learned latent representation (the code). The decoder provides the reconstructed
. One can easily obtain the reconstructed structure
.
The so-centralized dataset for each protein is split to obtain a training, validation, and testing dataset. A :: split yields the training, validation, and testing datasets, respectively. We note that the performance over the validation dataset is monitored in tandem with the performance over the training dataset during training to ensure no overfitting or underfitting occurs.
Metrics of Model Performance
As summarized above, the squared (reconstruction) error is measured over every instance in a testing dataset, and the mean of these values, to which one refers as MSE, is a primary metric to evaluate a model. Specifically, the SE calculated on an instance corresponding to a structure S is ; recall that the actual input to an AE (and to PCA) consists of atomic deviations corresponding to each structure. It is not hard to see that this evaluates to the same quantity as , as one can write that .
Implementation Details
We use Keras to implement, train and evaluate the various AEs investigated in this paper [
43]; Keras is an open-source neural-network library written in Python. Each of the investigated AEs is trained for a total of 100 epochs with a batch size of 256. A learning rate of
is employed to prevent premature convergence to local optima. In [
33], various dropout and learning rates are evaluated by hyper-parameter search. When the LR activation function is employed, the negative slope coefficient
is set to
. Training times vary from
to
s depending on the size of the training dataset. Since the proteins shown in
Table 1 vary from 53 to 146 amino acids, input instances
x vary in dimensionality from
to
. So, in a vAE trained on tertiary structures of a protein of 53 amino acids,
. In a dAE, we set
to 250 for all datasets where
, and to
otherwise. The dimensionality of the second hidden layer is set to
.
5. Conclusions
In this paper, we investigate and evaluate AEs and AE-marginalized protein tertiary structures. A systematic evaluation points to a top-performing architecture. The utility of the learned representations is evaluated via supervised learning in discriminating between native and non-native structures.
Altogether, we believe that AEs hold great promise for the reduction and summarization of molecular structure data. Platforms such as Keras make them easy to implement, evaluate, and thus adopt, opening the way to further research on exploiting AE-featurized structures for structure–function recognition in molecular biology. Many directions of research are promising. Pursuing additional regularizations will help in further lowering the dimensionality of the loss surface. Variational AEs are also another direction of future research that can help with generating novel tertiary structures for data augmentation and other applications.