Antimicrobial Peptide Screening from Microbial Genomes in Sludge Based on Deep Learning

Liu, Yin-Xuan; Jin, Xue-Bo; Xu, Chun-Ming; Ma, Hui-Jun; Wu, Qi; Liu, Hao-Si; Li, Zi-Meng

doi:10.3390/app14051936

Open AccessArticle

Antimicrobial Peptide Screening from Microbial Genomes in Sludge Based on Deep Learning

by

Yin-Xuan Liu

¹,

Xue-Bo Jin

¹

,

Chun-Ming Xu

^2,*,

Hui-Jun Ma

¹,

Qi Wu

²,

Hao-Si Liu

² and

Zi-Meng Li

¹

College of Computers and Artificial Intelligence, Beijing Technology and Business University, Beijing 100048, China

²

College of Light Industry Science and Engineering, Beijing Technology and Business University, Beijing 100048, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(5), 1936; https://doi.org/10.3390/app14051936

Submission received: 7 January 2024 / Revised: 6 February 2024 / Accepted: 7 February 2024 / Published: 27 February 2024

(This article belongs to the Special Issue Deep Learning and Machine Learning Applications in Biomedicine)

Download

Browse Figures

Versions Notes

Abstract

:

As the issue of traditional antibiotic resistance continues to worsen, exploring new antimicrobial substances has become crucial to addressing this challenge. Antimicrobial peptides (AMPs), recognized for their low resistance levels and minimal bacterial mutation frequencies, have garnered significant attention from researchers. However, traditional screening methods for AMPs are inefficient and costly. This study proposes a combined AMP screening model based on long short-term memory (LSTM) neural networks with an attention mechanism. By analyzing the characteristics of peptide segments, which are simulated enzymatic hydrolysis products of proteins expressed in sludge microbial genomes, the model accurately identifies peptide segments with potential antimicrobial activity. Molecular docking and dynamic simulation results validate three potential antimicrobial peptide candidates: LLPRLLARRY, GVREIHGLNPGGCLHTVRLVCR, and FRTTLAPHVLTRLLAPCW. These candidates exhibit high binding stability and affinity with target proteins, confirming the efficiency of the proposed AMP screening model.

Keywords:

machine learning; LSTM; microbial genome; antimicrobial peptides

1. Introduction

Antimicrobial resistance (AMR) has emerged as a formidable challenge in the realm of global health, precipitated by the misuse or overuse of antibiotics [1]. This overreliance on antibiotics complicates infection treatment, as bacteria evolve resistance to commonly utilized antimicrobial agents, potentially culminating in treatment failures, aggravated illness, or even mortality [2]. In addition to curbing antibiotic resistance, the development of new antimicrobial drugs has become a crucial strategy in addressing this challenge. Antimicrobial peptides (AMPs), favored by researchers for their low acquisition resistance rate and low bacterial mutagenesis rate, are considered as potential alternatives to antibiotics [3]. Therefore, the screening of novel AMPs has become a critical research direction in the field of biomedicine. Generally, AMPs can be obtained through chemical synthesis or from animals, plants, and microorganisms [4]. Recent studies have found that sludge microbiota exhibit rich biodiversity, making them a significant resource for obtaining antimicrobial peptides. Therefore, exploring novel antimicrobial peptides from sludge microbiota offers new possibilities for addressing the issue of antibiotic resistance.

Traditional AMP screening methodologies involve laborious and time-intensive separation and identification processes, hampering high-throughput capabilities [5]. However, with the advent of machine learning and bioinformatics, an array of machine learning-based peptide screening techniques has surfaced. Presently, these include neural networks, random forests, support vector machines, decision trees, Bayesian methods, and ensemble learning, among others [6,7,8,9]. Neural networks adeptly capture non-linear data dependencies, while random forests provide an interpretable ensemble learning approach. Support vector machines excel in managing high-dimensional data, aiding in the differentiation between AMPs and non-AMPs. Comprehensive research underscores the efficacy of machine learning in predicting AMPs, marking a significant improvement in screening efficiency over traditional methods [10].

The interdependence relationships within gene sequences are crucial information for distinguishing AMPs from non-AMPs in subsequent analyses. Accurately analyzing these relationships is a key factor in improving screening performance. These interdependence relationships exist not only in adjacent genomic regions but are also manifested in more distant gene sequences [11,12]. Therefore, the critical genomic interdependence relationships of AMPs typically manifest in a small portion of the genome. It is essential to accurately identify and focus on these genomic regions.

Long short-term memory network (LSTM) is a commonly used type of recurrent neural network. Compared to traditional RNNs, LSTM is better at preserving long- and short-term dependencies when handling sequence data, mitigating issues such as vanishing gradients and exploding gradients. Building on LSTM and convolutional neural networks, Wang et al. proposed the CL-ACP model for screening anticancer peptides [13]. Similarly, Dee proposed a model that combines LSTM and convolutional neural networks to classify antimicrobial peptides, enhancing prediction accuracy [14]. Yaseen et al. developed the HemoNet model, which utilizes neural network methods and LSTM to predict the hemolytic activity of peptides [9]. Hussain proposed the Samp-pfpDeep model, which combines LSTM with deep neural networks to accurately screen short antimicrobial peptides [15]. These studies suggest the effectiveness of deep network models based on LSTM in antimicrobial peptide prediction. Compared to traditional machine learning methods, they significantly improve accuracy.

However, as mentioned earlier, the crucial dependency information in the genome is vital during the screening process, and treating all dependencies equally is not prudent. In this study, we first constructed the ALSTM model by adding an attention mechanism to the LSTM model. This model is designed to emphasize specific key gene relationships for the screening of peptide segments with antimicrobial activity. Furthermore, to accommodate both long- and short-dependency relationships between genomes, we designed a combined model. This model integrates the LSTM and ALSTM networks, achieving effective extraction and analysis of antimicrobial peptide genomic data. It simultaneously considers long- and short-term dependency relationships in genomic data and utilizes an attention mechanism to emphasize contextual relationships of key features.

The innovations of this paper include:

(1): In response to the demand for AMP screening, this study innovatively constructed the ALSTM model based on the LSTM model, incorporating an attention mechanism to emphasize key gene relationships for the effective screening of peptide segments with antimicrobial activity.
(2): This paper introduces a combined screening deep network designed to consider both long- and short-term dependency relationships in genomic data. It utilizes an attention mechanism to emphasize contextual relationships of key features, accurately screening genes and peptide segments with potential antimicrobial activity.
(3): In the validation experiments, this paper introduces molecular docking and dynamic simulations to simulate the interaction between the screened peptide segments and potential antibiotic target proteins. This allows for the study of the interaction between antimicrobial peptides and target proteins, thereby validating the effectiveness of the proposed antimicrobial peptide screening model.

2. Materials and Methods

2.1. Dataset for Training the Network

Given the absence of a standardized dataset in antimicrobial peptide prediction, the negative samples were obtained from Zhang et al.’s datasets [16], while the positive samples were selected from a pool of over 6000 antimicrobial peptides available in DRAMP (http://dramp.cpu-bioinfor.org/ (accessed on 5 April 2023)). The dataset comprises a total of 14,256 samples (see Table 1). The dataset was split into training and validation sets in a 7:3 ratio. The training set includes 4370 positive samples (antimicrobial peptides) and 5132 negative samples (non-antimicrobial peptides), while the validation set comprises 2211 positive samples and 2543 negative samples.

2.2. ALSTM

The attention-based LSTM (ALSTM) model is an extension of the traditional LSTM model, incorporating both LSTM and an attention mechanism. The LSTM model consists of an input layer, a hidden layer, and an output layer. The input layer receives the dataset as input, and the hidden layer contains multiple LSTM units responsible for processing sequence data and extracting key features. The output layer is a fully connected layer that outputs the probability of antimicrobial activity for peptide segments.

The LSTM model comprises numerous LSTM units, each encompassing distinct components, including a forget gate, an input gate, and an output gate. In this architecture, the forget gate manages the retention or omission of information from the preceding time step in the memory cell. Simultaneously, the input gate regulates the acceptance or rejection of input at the current time step by the memory cell. Finally, the output gate determines whether the output at the current time step proceeds to the subsequent LSTM unit or the output layer. The calculation is as follows:

c_{t} = \tanh (w_{c} (h_{t - 1} + x_{t}) + b_{c})

(1)

a_{t} = \tanh (w_{a} (h_{t - 1} + x_{t}) + b_{a})

(2)

f_{t} = σ (w_{f} (h_{t - 1} + x_{t}) + b_{f})

(3)

i_{t} = a_{t} + c_{t}

(4)

c_{t} = f_{t} \circ c_{t - 1} + i_{t} \circ c_{t}

(5)

o_{t} = σ (w_{o} (h_{t - 1} + x_{t}) + b_{o})

(6)

h_{t} = o_{t} \circ \tanh (c_{t})

(7)

In the equation,

c_{t}

represents the cell state at the current time step t, which is a key internal variable used to carry information through the network.

h_{t - 1}

is the hidden state from the previous time step t-1, and

x_{t}

is the input at the current time step. The weight matrices

w

(

w_{a}, w_{c}, w_{f}, w_{o}

) and bias vectors b(

b_{a}, b_{c}, b_{f}, b_{o}

) are parameters learned by the network. The hyperbolic tangent activation function tanh is used to generate the candidate state

c_{t}

and the activation value

a_{t}

, while the sigmoid activation function

σ

is used to control the forget gate

f_{t}

and the output gate

o_{t}

.

i_{t}

represents the value of the input gate, determining how much new information will be incorporated into the cell state.

c_{t - 1}

is the cell state from the previous time step.

h_{t}

is the hidden state at the current time step, calculated as the Hadamard product (element-wise multiplication) of the current output gate value

o_{t}

and the current cell state

c_{t}

processed through tanh.

ALSTM relying on an attention mechanism enables the model to focus on the most relevant parts of the input sequence for making predictions. Its input layer is similar to the LSTM model and receives sequential data. In the attention-based hidden layer, the attention mechanism is applied to process the input data. The so-called attention weights determine the model’s focus on different parts of the input sequence, and their calculation is as follows:

a_{t} = \frac{\exp (s c o r e (h_{t - 1}, x_{t}))}{\sum_{j = 1}^{T_{x}} \exp (s c o r e (h_{t - 1}, x_{j}))}

(8)

In attention mechanisms, a scoring function is commonly used to calculate the relevance score of each component in the input sequence (such as each hidden state in an LSTM network) to the current output. This score reflects the contribution size of each input element in generating the current output, enabling the model to dynamically focus attention on the most crucial parts of the input sequence. This mechanism, when applied to genomic sequences, allows the model to consider the critical features of the input sequence during prediction, enhancing the accuracy and efficiency of predictions. The method of calculating the score is as follows:

s c o r e (h_{t - 1}, x_{t}) = v_{a}^{T} \tanh (W_{a} [h_{t - 1}, x_{t}] + b_{a})

(9)

Calculate the integrated context information of the gene sequence using Formula (8) as:

c_{t} = \sum_{t = 1}^{T_{x}} a_{t} x_{t}

(10)

Subsequently, utilize the context vector to update the system’s output in LSTM calculations.

C_{t} = f_{t} \times C_{t - 1} + i_{t} \times {\tilde{C}}_{t} (c_{t})

(11)

h_{t} = o_{t} \times \tanh (C_{t})

(12)

We observed that ALSTM, similar to the LSTM model, calculates outputs but incorporates information processed by the attention mechanism. This allows its output to emphasize the contextual relationships of key genomic features, thereby enhancing the effectiveness of extracting and analyzing antimicrobial peptide genomic data.

2.3. Combined Antimicrobial Peptide Screening Deep Network

This prediction framework (as shown in Figure 1) is employed for the recognition of antimicrobial peptides and combines LSTM and ALSTM networks to form a composite model. This dual-model architecture aims to leverage the respective strengths of LSTM and ALSTM, resulting in more refined prediction outcomes. The sigmoid function can map any variable to a value between 0 and 1, making it well-suited for generating probability outputs. The linear function is essentially an identity function, aiming to directly output continuous prediction values.

The final prediction of AMPs is determined through a consensus method, where both the LSTM and ALSTM models must independently identify a peptide as an antimicrobial peptide. This fusion method ensures that only peptides identified as potential AMPs by both models are considered, thereby improving prediction accuracy and minimizing false positives. The validation process confirms that the fusion model is significantly better than individual LSTM or ALSTM models, demonstrating enhanced predictive power and reliability.

P_{c o n s e n s u s} = P_{L S T M} & & P_{A L S T M}

(13)

Here, “&&” represents the logical AND operator: it is considered an antimicrobial peptide only when equal to 1, and it is set to 1 only when the probabilities identified by both models are greater than 0.5.

3. Experiments

3.1. Experimental Process

In the experimental identification of novel antimicrobial peptides (AMPs) from sludge microbial communities using deep learning technology, the process includes training a combined deep network, constructing a microbial genome dataset for screening based on sludge samples, utilizing the combined deep network for AMP screening, and finally, validating the screening results of the deep model through molecular docking and dynamic simulations.

Specifically, the process comprises the following four steps (refer to Figure 2):

Step 1: Following the method provided in Section 2, train the combined deep network using the dataset and evaluate the effectiveness of model training using metrics such as accuracy, precision, recall, and F1 score. The calculation formulas are as follows:

a c c = \frac{T P + T N}{T P + T N + F N + F P}

(14)

p r e = \frac{T P}{T P + F P}

(15)

r e c a l l = \frac{T P}{T P + F P}

(16)

F 1 = \frac{p r e \cdot r e c a l l}{p r e + r e c a l l}

(17)

where TP represents True Positive, TN represents True Negative, FP represents False Positive, and FN represents False Negative.

Step 2: Construction of microbial genome dataset based on sludge samples.

A 10 L sludge sample was collected from Shandong Yingxuan Industrial Co., Ltd., (Shandong, China) Sterilized containers and sampling tools were used during sampling, avoiding direct contact with large chunks of sludge as much as possible. After sampling, the sample container was sealed promptly and transferred to the laboratory in the shortest time. To prevent the loss of microbial activity, the samples were stored under low-temperature and light-avoiding conditions.

The OMEGA kit (Beijing Kinglith Technology Co., Ltd., Beijing, China) was employed to extract DNA from the samples. NEBNext dsDNA Fragmentase (Beijing Kinglith Technology Co., Ltd., Beijing, China) was used to fragmentize the DNA into approximately 300 base pairs through enzymatic cleavage at 37 °C for 30 min. A mixture of T4 DNA polymerase, T4 polynucleotide kinase, and Klenow DNA polymerase was incubated at 20 °C for 30 min to smooth and phosphorylate the ends of the fragments. Klenow fragment (3′ to 5′ exo-) was then used to incubate at 37 °C for 30 min, adding an ‘A’ base at the 3′ end of all fragments. Subsequently, Illumina TruSeq DNA adapters with ‘T’ overhangs and T4 DNA ligase were used to connect the adapters at room temperature for 15 min. SPRI beads were employed to select fragments of 300–500 bp based on bead-to-sample ratios of 0.8X and 0.1X, removing shorter fragments and adapter dimers. The DNA fragments with connected adapters were PCR-amplified using high-fidelity DNA polymerase, dNTPs, MgCl2, and PCR buffer, with a forward primer (5′-AATGATACGGCGACCACCGAGATCTACAC-3′) and a reverse primer (5′-CAAGCAGAAGACGGCATACGAGAT-3′). The PCR conditions included an initial denaturation at 95 °C for 3 min, followed by 10 cycles of 95 °C for 30 s (denaturation), 55 °C for 30 s (annealing), and 72 °C for 30 s (extension), with a final extension at 72 °C for 5 min. Size distribution was assessed using an Agilent Bioanalyzer(Beijing Novogene Technology Co., Ltd., Beijing, China), and quantification was performed using Qubit to ensure an average size of approximately 300 bp and minimize adapter dimers. The concentration was adjusted based on the qPCR results for pooling. Finally, a mixing of equal molar amounts was performed, and the final concentration was adjusted to 4 nM. Raw sequencing files in FASTQ format were obtained on the Illumina sequencing platform. The experimental materials, including reagents and kits, were all supplied by Beijing Jingli Science and Technology Co., Ltd. (Beijing, China). Precision sequencing was entrusted to Beijing Novogene Technology Co., Ltd. (Beijing, China). The company was responsible for handling the sequencing process, and the specific materials used for sequencing were internally provided by the company.

The getorf function in EMBOSS was used to search for open reading frames in the DNA sequence and predict protein sequences. BLASTp was employed to remove non-protein family sequences. Subsequently, in the Protein Digestion Simulator software (AMBER20), pepsin 1 and 2, and trypsin were selected for simulated hydrolysis. The resulting peptide segments from simulated hydrolysis were merged with the original protein sequences to create a new dataset. CD-Hit was then used to remove known antimicrobial peptide sequences and duplicate sequences.

The amino acid sequences were digitally encoded, mapping each amino acid letter to a unique number. Subsequently, a length check was performed on the generated numeric encoding string. If it was less than 300 numbers, “0” was added for padding to ensure the encoded sequence length was 300. Finally, positive and negative sample labels were added to the sequences. The obtained sequences for screening totaled 5763.

Step 3: AMP screening based on the combined deep network.

Use the to-be-screened sequences as input to obtain the output of the trained model from Step 1, which serves as the screening result based on the combined deep learning model.

Step 4: Validate the screening results of the deep model using molecular docking and molecular dynamics simulations.

Using AutoDock (1.5.7), perform docking of antimicrobial peptides based on amino acid sequences and structure them in ChemDraw (20.0) and Chem3D (20.0). Before setting the docking parameters, remove excess ions and water molecules. After docking, select the top three models with the highest scores. Further validation is conducted through dynamic simulations using Discovery Studio (2019). Finally, RMSD and RMSF plots are used to assess the interaction between the screened peptides and potential antibiotic target proteins, validating the accuracy of the screening results from the combined deep model.

The code and raw data have been uploaded to GitHub, and the website is: https://github.com/LuckyCoder2023/Antimicrobial-Peptides-Screening-from-Microbial-Genomes-in-Sludge-Based-on-Deep-Learning.git (accessed on 10 July 2023).

3.2. Training of the Combined Deep Network and Screening Results

The training of the combined screening networks was conducted using the training set provided in Section 2.1. All experiments were performed on a desktop computer equipped with an AMD R7-5800 processor, 4.0 GHz, and 16 GB of memory. In the training experiments, we used the open-source deep learning library Pytorch to build the combined network model, and the Adam algorithm was employed to supervise the learning process of the model. Both LSTM and ALSTM had 2 layers.

The ROC curve for training is shown in Figure 3. In the ROC curve, the horizontal axis represents the false positive rate (FPR), and the vertical axis represents the recall rate. By comparing the ROC curves, we observed that both models exhibit excellent performance, with the ALSTM model performing particularly well. This indicates that critical information plays a crucial role in DNA fragments, and utilizing ALSTM allows for comprehensive modeling of these features, thereby enhancing the recognition of crucial information.

Based on the confusion matrix, we calculated evaluation metrics such as accuracy, precision, recall, and F1 score (as shown in Table 2 and Figure 4). An accuracy close to 1 indicates excellent overall model performance in sample classification, high precision suggests high accuracy in positive class predictions, and high recall indicates successful capture of true positive samples by the model. The F1 score combines precision and recall, and it tends to be high when both precision and recall are relatively high. Our fusion model demonstrates outstanding performance, especially in terms of precision and recall. Specifically, the LSTM model showed a high recall rate of 0.966, while the ALSTM model performed better in terms of precision with a score of 0.952.

By combining the predictions of the two models, our fusion strategy ensures that only highly reliable peptide segments are ultimately predicted as AMPs, providing a finely tuned prediction framework. After fusion, the model only has true positives and true negatives, with counts of 2211 and 2543, respectively. The accuracy, precision, recall, and F1 all reach 1, achieving successful accurate predictions.

Using the filtering model, we selected 17 short peptide sequences with high scores, as shown in Table 3.

3.3. Model Verification

We modeled all the short peptide sequences in Table 3 and performed molecular docking with the antimicrobial peptide receptor protein (PDB ID: 3lfz). In molecular docking, we employed three evaluation metrics: binding energy (Affinity), RMSD distance from the reference structure (Dist from RMSD), and the RMSD of the best mode (Best Mode RMSD). Binding energy is an indicator of the strength of binding between molecules, where a more negative value typically indicates a tighter binding. Dist from RMSD measures the RMSD distance between the molecular structure and the reference structure, where RMSD reflects the structural similarity of two molecules in three-dimensional space. We selected the conformation with the minimum RMSD distance as the best conformation to ensure its close resemblance to the reference structure. Best Mode RMSD measures the RMSD value between the best mode selected in the docking simulation and the reference structure, serving as a crucial indicator for assessing the accuracy of the docking model.

Considering the three metrics in Table 4, we selected three optimal models corresponding to the peptide sequences LLPRLLARRY, GVREIHGLNPGGCLHTVRLVCRR, and FRTTLAPHVLTRLLAPCW.

After molecular docking visualization, as illustrated in Figure 5, the structure and binding interactions of the receptor–ligand complex are presented. The first set of images indicates that the complex is primarily supported by van der Waals forces at various binding sites, facilitating interactions between the two molecules. Van der Waals interactions between molecules play a crucial role in intermolecular interactions. These interactions include electrostatic and van der Waals potentials. Studies have shown that van der Waals interactions play a significant role in the attractive forces between molecules, especially when intramolecular covalent bonds are stretched [17]. Additionally, in molecular docking, van der Waals forces significantly increase the binding energy between the two molecules compared to other models, as they involve a small number of pi bonds and attractive charges.

In the second and third sets of models, different sites form salt bridges, attractive charges, hydrogen bonds, hydrophobic interactions, pi bonds, and alkyl interactions between molecules. Studies have found that interactions between positively charged amino acid residues and negatively charged residues form salt bridges, enhancing the conformational stability of proteins [18]. Additionally, geometrically optimized salt bridges can accelerate folding speed and decelerate unfolding speed, while poorly shaped salt bridges slow down folding speed and slightly accelerate unfolding speed [19]. Hydrogen bonds play a crucial role in receptor–ligand binding, regulating the degree of protein structural fluctuations, thereby reducing the hydrogen/deuterium exchange (HDX) rate during ligand binding [20]. HDX is an analytical technique that measures the rate at which hydrogen atoms in the protein structure are replaced by deuterium. This replacement process is influenced by protein folding and ligand binding. The decrease in the HDX rate implies a reduction in the protein’s dynamics during ligand binding, leading to a more stable structure. This helps minimize errors and instability factors that may arise during the docking process, thereby enhancing the reliability and accuracy of the docking model. Additionally, changes in hydrogen bond energy during the binding process help filter out unrealistic poses during docking and integrate them into the free energy calculation model, improving the accuracy of binding energy calculations [21,22]. Therefore, the docking scores of the two sets of models are relatively high.

At the cellular level, antimicrobial peptides exhibit effective antibacterial activity against Escherichia coli, primarily achieved by increasing the permeability of the bacterial cell membrane. This process induces morphological changes, swelling, cytoplasmic lysis, and membrane damage in the cells. Such damage results in a decrease in the intracellular ATP concentration and the leakage of essential components, including critical enzymes and nucleic acids. These effects collectively act to disrupt vital physiological functions in bacteria, ultimately leading to bacterial death and, consequently, the manifestation of antimicrobial effects [23].

RMSF (root-mean-square fluctuation) and RMSD (root-mean-square deviation) plots are widely utilized in molecular simulation studies to reveal the dynamics and stability of molecular structures. The RMSF plot illustrates the root-mean-square fluctuations at each atomic position during molecular simulation, while the RMSD plot is used to compare the root-mean-square deviations of atoms between the simulated structure and the reference structure. This aids in gaining a deeper understanding of the evolutionary trajectory and overall stability of the simulated structure, providing information about the relative stability and flexibility of various structural components. Condic-Jurkic et al. [24] applied both RMSF and RMSD plots in their discussion of the challenges and limitations of using molecular dynamics simulations to study the multidrug transporter protein P-glycoprotein (p-gp) in a membrane environment. They confirmed the reliability of these plots.

We selected three models with the highest docking scores for dynamic simulation experiments and obtained RMSF and RMSD results as shown in Figure 6. In the RMSF plot, the error range consistently stays between 0.7 and 0.8, demonstrating a relatively good performance. In the RMSD plot, as the system stabilizes, the error remains between 0.5 and 0.6. The comprehensive analysis of the docking scores and dynamic simulation images indicates that the selected three models fall within an acceptable range of errors. However, further examination and validation are required for the results of the first set of RMSD plots. This comprehensive analysis provides crucial insights for a deeper understanding of the performance of these models.

4. Discussion

Deep learning models exhibit enormous potential in handling and learning the complex relationships within large peptide sequences and their structural properties, particularly in antimicrobial peptide prediction [25]. Through appropriate training and validation, these models can learn physicochemical features such as amino acid sequences and molecular properties, enabling a more comprehensive prediction. However, the quality and availability of data, especially high-quality annotated data, are crucial for the model’s generalization ability. Insufficient or inappropriate model design with respect to data can lead to overfitting or underfitting, thereby affecting prediction accuracy [26]. Therefore, when utilizing machine learning models for antimicrobial peptide prediction, the selection of an appropriate model, features, and data preprocessing methods, along with consideration of the biological background, become crucial factors [27].

As shown in Table 5, In a recent study, Ma et al. [28] utilized machine learning methods to synthesize 216 peptides from 241 segments and found that 181 of them exhibited antimicrobial activity, achieving a hit rate of 83.8%, surpassing previous efforts. Similarly, the University of Macau developed “deep-ampep30” [29] using deep learning to study peptides. This method, based on an optimal feature set and convolutional neural network, achieved an accuracy of 77% and an area under the ROC curve of 85%. “AMP-EBiLSTM” demonstrated the application of deep learning technology, specifically the enhanced bidirectional long short-term memory model, in AMP identification, achieving an accuracy of 92.39% and an AUC close to 0. 98. Bomin Wei’s DeepLPI model [30] achieved an AUROC of 0.857 on the BindingDB dataset and an AUROC of 0.925 on the Davis dataset. This model is composed of a one-dimensional convolutional neural network (1D CNN) based on ResNet and a bidirectional long short-term memory network (biLSTM). Considering the experimental data results of this study, ALSTM achieved a remarkable AUC of 0.99, while LSTM reached an AUC of 0.95. The LSTM model demonstrated a high recall of 0.966, and the ALSTM model excelled in precision with a value of 0.952. In comparison, this study shows superior performance in evaluation metrics.

The results of molecular docking revealed that peptide segments extracted from Clostridiales, Caloramator sp. E03, and Synergistaceae exhibited high affinity and low root-mean-square deviation (RMSD) values when interacting with the antimicrobial peptide receptor protein (PDB ID: 3LFZ). Similarly, Aliye et al. [32] utilized AutoDock Vina to conduct molecular docking studies on compounds isolated from the plant Ocimum cufodontii against Escherichia coli GyraseB (PDB ID: 6F86) to screen for antibacterial compounds. In this study, the model with the highest affinity showed a binding free energy of −7.4 kcal/mol, similar to Amer’s [33] findings on the antibacterial activity of nucleosides and Schiff base derivatives and their molecular docking interactions with target proteins. The analysis above indicates that antimicrobial peptides selected through machine learning techniques exhibit high binding efficiency and specificity at the molecular level. In comparison to related research, the experimental methods in this study demonstrate rationality and foresight in antimicrobial peptide screening and molecular docking analysis. The newly identified antimicrobial peptide candidates not only expand the resource library of antimicrobial peptides but also provide a new perspective and approach to addressing antibiotic resistance issues.

Recent research reveals significant differences between ribosomally synthesized antimicrobial peptides (RiPPs), such as lantibiotics, and non-ribosomally synthesized peptides (NRPs). RiPPs undergo ribosomal synthesis and post-translational modifications, resulting in distinctive structures like lantipeptides, containing unusual amino acids like dehydroalanine and lanthionine residues [34].

NRPs, a vital class of antimicrobial agents, are synthesized by non-ribosomal peptide synthetase (NRPS) complexes independent of mRNA templates. The structural diversity and biological activity of NRPs contribute significantly to antimicrobial drug development [35]. However, the current system faces a notable limitation in directly identifying and predicting NRPs due to the complexity of gene clusters involved in NRP synthesis, contrasting with a single open reading frame (ORF).

This limitation suggests that, while deep learning provides a powerful tool for screening and identifying potential ribosomal peptides (RPs), further method development and algorithm optimization are required for the comprehensive exploration of antimicrobial peptide resources in the environment, especially NRPs. Therefore, in future research, exploring a broader range of antimicrobial peptides, including NRPs, will be a crucial direction for our work.

At the same time, we must recognize that translating these assumed antimicrobial peptides from in vitro experiments to practical clinical applications still faces numerous challenges. Firstly, assessing the antimicrobial activity of these peptides relies on traditional microbiological techniques such as agar diffusion assays and minimum inhibitory concentration determinations. Furthermore, the transition from the laboratory to clinical application not only requires further stability and safety assessments but also includes pharmacokinetic studies and detailed clinical trials.

5. Conclusions

This study successfully employed machine learning techniques to screen potential antimicrobial peptides in sludge microbiota. By integrating deep learning and bioinformatics, a novel method for antimicrobial peptide screening was proposed, incorporating an attention mechanism model. This approach effectively identified antimicrobial peptides with potential therapeutic effects, offering a new perspective for combating multidrug-resistant bacterial infections. This study successfully overcame the limitations of traditional screening methods, significantly improving screening efficiency and accuracy.

Experimental results, validated through molecular docking and molecular dynamics simulations, indicated that the screened peptide segments LLPRLLARRY, GVREIHGLNPGGCLHTVRLVCRR, and FRTTLAPHVLTRLLAPCW exhibited efficient antimicrobial activity and good stability. These findings provide robust support for the future development of antimicrobial drugs. This process will necessitate interdisciplinary collaboration and sustained efforts to ensure that these newly discovered antimicrobial peptides can be safely and effectively applied to human health. Therefore, while this study has made significant progress in the discovery and identification of antimicrobial peptides, the ultimate clinical application of these findings remains a long-term and complex process.

Author Contributions

Conceptualization, Y.-X.L.; methodology, X.-B.J.; software, H.-J.M.; validation, C.-M.X.; formal analysis, H.-S.L.; investigation, Z.-M.L.; resources, Q.W.; data curation, X.-B.J.; writing—original draft preparation, Y.-X.L.; writing—review and editing, C.-M.X.; visualization, Y.-X.L.; supervision, H.-S.L.; project administration, C.-M.X.; funding acquisition, H.-J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China, No. 62173007.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Blair, J.M.A.; Webber, M.A.; Baylay, A.J.; Ogbolu, D.O.; Piddock, L.J.V. Molecular mechanisms of antibiotic resistance. Nat. Rev. Microbiol. 2015, 13, 42–51. [Google Scholar] [CrossRef] [PubMed]
Tarín-Pelló, A.; Suay-García, B.; Pérez-Gracia, M.-T. Antibiotic resistant bacteria: Current situation and treatment options to accelerate the development of a new antimicrobial arsenal. Expert Rev. Anti Infect. Ther. 2022, 20, 1095–1108. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.Y.; Yan, Z.B.; Meng, Y.M.; Hong, X.Y.; Shao, G.; Ma, J.J.; Cheng, X.R.; Liu, J.; Kang, J.; Fu, C.Y. Antimicrobial peptides: Mechanism of action, activity and clinical potential. Mil. Med. Res. 2021, 8, 48. [Google Scholar] [CrossRef] [PubMed]
Kardani, K.; Bolhassani, A. Antimicrobial/anticancer peptides: Bioactive molecules and therapeutic agents. Immunotherapy 2021, 13, 669–684. [Google Scholar] [CrossRef]
Ji, S.; An, F.; Zhang, T.; Lou, M.; Guo, J.; Liu, K.; Zhu, Y.; Wu, J.; Wu, R. Antimicrobial peptides: An alternative to traditional antibiotics. Eur. J. Med. Chem. 2023, 265, 116072. [Google Scholar] [CrossRef] [PubMed]
Lertampaiporn, S.; Vorapreeda, T.; Hongsthong, A.; Thammarongtham, C. Ensemble-AMPPred: Robust AMP Prediction and Recog-nition Using the Ensemble Learning Method with a New Hybrid Feature for Differentiating AMPs. Genes 2021, 12, 137. [Google Scholar] [CrossRef]
Wani, M.A.; Garg, P.; Roy, K.K. Machine learning-enabled predictive modeling to precisely identify the antimicrobial peptides. Med. Biol. Eng. Comput. 2021, 59, 2397–2408. [Google Scholar] [CrossRef]
Xu, J.; Li, F.; Leier, A.; Xiang, D.; Shen, H.-H.; Lago, T.T.M.; Li, J.; Yu, D.-J.; Song, J. Comprehensive assessment of machine learning-based methods for predicting antimicrobial peptides. Briefings Bioinform. 2021, 22, bbab083. [Google Scholar] [CrossRef]
Yaseen, A.; Gull, S.; Akhtar, N.; Amin, I.; Minhas, F. HemoNet: Predicting hemolytic activity of peptides with integrated feature learning. J. Bioinform. Comput. Biol. 2021, 19, 2150021. [Google Scholar] [CrossRef]
Bhadra, P.; Yan, J.; Li, J.; Fong, S.; Siu, S.W.I. AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci. Rep. 2018, 8, 1697. [Google Scholar] [CrossRef]
Li, C.; Warren, R.L.; Birol, I. Models and data of AMPlify: A deep learning tool for antimicrobial peptide prediction. BMC Res. Notes 2023, 16, 1–4. [Google Scholar] [CrossRef] [PubMed]
Singh, O.; Hsu, W.-L.; Su, E.C.-Y. Co-AMPpred for in silico-aided predictions of antimicrobial peptides by integrating composi-tion-based features. BMC Bioinform. 2021, 22, 389. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Zhao, J.; Zhao, H.; Li, H.; Wang, J. CL-ACP: A parallel combination of CNN and LSTM anticancer peptide recognition model. BMC Bioinform. 2021, 22, 1–22. [Google Scholar] [CrossRef] [PubMed]
Dee, W. LMPred: Predicting antimicrobial peptides using pre-trained language models and deep learning. Bioinform. Adv. 2022, 2, vbac021. [Google Scholar] [CrossRef] [PubMed]
Hussain, W. sAMP-PFPDeep: Improving accuracy of short antimicrobial peptides prediction using three different sequence encodings and deep neural networks. Brief. Bioinform. 2021, 23, bbab487. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Lin, J.; Zhao, L.; Zeng, X.; Liu, X. A novel antibacterial peptide recognition algorithm based on BERT. Brief. Bioinform. 2021, 22, bbab200. [Google Scholar] [CrossRef] [PubMed]
Distasio, R.A., Jr.; Von Lilienfeld, O.A.; Tkatchenko, A. Collective many-body van der Waals interactions in molecular systems. Proc. Natl. Acad. Sci. USA 2012, 109, 14791–14795. [Google Scholar] [CrossRef]
Kumar, S.; Nussinov, R. Salt bridge stability in monomeric proteins. J. Mol. Biol. 1999, 293, 1241–1255. [Google Scholar] [CrossRef]
Meuzelaar, H.; Vreede, J.; Woutersen, S. Influence of Glu/Arg, Asp/Arg, and Glu/Lys salt bridges on α-helical stability and folding kinetics. Biophys. J. 2016, 110, 2328–2341. [Google Scholar] [CrossRef]
Sowole, M.A.; Konermann, L. Effects of Protein–Ligand Interactions on Hydrogen/Deuterium Exchange Kinetics: Canonical and Noncanonical Scenarios. Anal. Chem. 2014, 86, 6715–6722. [Google Scholar] [CrossRef]
Zhao, H.; Huang, D. Hydrogen Bonding Penalty upon Ligand Binding. PLoS ONE 2011, 6, e19923. [Google Scholar] [CrossRef] [PubMed]
Ma, B.; Kumar, S.; Tsai, C.J.; Wolfson, H.; Sinha, N.; Nussinov, R. Protein–Ligand Interactions: Induced Fit. In Encyclopedia of Life Sciences; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar]
Zhou, L.; Lian, K.; Wang, M.; Jing, X.; Zhang, Y.; Cao, J. The antimicrobial effect of a novel peptide LL-1 on Escherichia coli by increasing membrane permeability. BMC Microbiol. 2022, 22, 220. [Google Scholar] [CrossRef] [PubMed]
Condic-Jurkic, K.; Subramanian, N.; Mark, A.E.; O’Mara, M.L. The reliability of molecular dynamics simulations of the multidrug transporter P-glycoprotein in a membrane environment. PLoS ONE 2018, 13, e0191882. [Google Scholar] [CrossRef] [PubMed]
Wang, G.; Vaisman, I.I.; Van Hoek, M.L. Machine learning prediction of antimicrobial peptides. In Computational Peptide Science: Methods and Protocols; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–37. [Google Scholar]
Plisson, F.; Ramírez-Sánchez, O.; Martínez-Hernández, C. Machine learning-guided discovery and design of non-hemolytic peptides. Sci. Rep. 2020, 10, 16581. [Google Scholar] [CrossRef] [PubMed]
Jin, X.-B.; Wang, Z.-Y.; Kong, J.-L.; Bai, Y.-T.; Su, T.-L.; Ma, H.-J.; Chakrabarti, P. Deep Spatio-Temporal Graph Network with Self-Optimization for Air Quality Prediction. Entropy 2023, 25, 247. [Google Scholar] [CrossRef]
Ma, Y.; Guo, Z.; Xia, B.; Zhang, Y.; Liu, X.; Yu, Y.; Tang, N.; Tong, X.; Wang, M.; Ye, X.; et al. Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat. Biotechnol. 2022, 40, 921–931. [Google Scholar] [CrossRef]
Wang, Y.; Wang, L.; Li, C.; Pei, Y.; Liu, X.; Tian, Y. AMP-EBiLSTM: Employing novel deep learning strategies for the accurate prediction of anti-microbial peptides. Front. Genet. 2023, 14, 1232117. [Google Scholar] [CrossRef] [PubMed]
Wei, B.; Zhang, Y.; Gong, X. DeepLPI: A novel deep learning-based model for protein–ligand interaction prediction for drug repurposing. Sci. Rep. 2022, 12, 18200. [Google Scholar] [CrossRef]
Yan, J.; Bhadra, P.; Li, A.; Sethiya, P.; Qin, L.; Tai, H.K.; Wong, K.H.; Siu, S.W. Deep-AmPEP30: Improve Short Antimicrobial Peptides Prediction with Deep Learning. Mol. Ther. Nucleic Acids 2020, 20, 882–894. [Google Scholar] [CrossRef]
Aliye, M.; Dekebo, A.; Tesso, H.; Abdo, T.; Eswaramoorthy, R.; Melaku, Y. Molecular docking analysis and evaluation of the antibacterial and antioxidant activities of the constituents of Ocimum cufodontii. Sci. Rep. 2021, 11, 10101. [Google Scholar] [CrossRef]
Amer, H.H.; Eldrehmy, E.H.; Abdel-Hafez, S.M.; Alghamdi, Y.S.; Hassan, M.Y.; Alotaibi, S.H. Antibacterial and molecular docking studies of newly synthesized nucleosides and Schiff bases derived from sulfadimidines. Sci. Rep. 2021, 11, 17953. [Google Scholar] [CrossRef]
Zhao, X.; Kuipers, O.P. Identification and classification of known and putative antimicrobial compounds produced by a wide variety of Bacillales species. BMC Genom. 2016, 17, 882. [Google Scholar] [CrossRef]
Martínez-Núñez, M.A.; López, V.E.L.Y. Nonribosomal peptides synthetases and their applications in industry. Sustain. Chem. Process. 2016, 4, 13. [Google Scholar] [CrossRef]

Figure 1. The structure of the combined AMP screening models. Gene sequences are encoded and predicted using both LSTM and ALSTM. The sigmoid function serves as the activation function, the linear function is employed, and the “AND” operation is used for element-wise conjunction to obtain the results.

Figure 2. The experimental workflow for deep learning and biological validation. A large dataset of positive and negative samples was collected and divided into training and test sets. An attention-based LSTM (ALSTM) model was developed for processing sequence data to predict the antimicrobial activity of peptides. Post-training, the model’s efficacy was validated with the test set, and molecular docking simulations further verified the interactions between the peptides and the target proteins.

Figure 3. ROC curves for LSTM, ALSTM, and fusion models. The blue curve represents LSTM with an AUC of 0.95, the orange curve represents ALSTM with an AUC of 0.99, and the green curve represents the merged model with an AUC of 1.00. The dashed line represents the line y = x. The points on the line represent the results of a classifier using a random guessing strategy, included as a reference to highlight model performance in the figure.

Figure 4. Confusion matrices for LSTM, ALSTM, and fusion models. (a) ALSTM, (b) Combined, (c) LSTM. The four parts of the matrix represent True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

Figure 5. Visualization of molecular docking for LLPRLLARRY, GVREIHGLNPGGCLHTVRLVCRR, and FRTTLAPHVLTRLLAPCW. (a) Receptor–ligand complex, (b) Receptor–ligand complex (zoomed-in view), (c) Ligands (antimicrobial peptide candidate), (d) Key interactions in the receptor–ligand binding.

Figure 6. Molecular dynamics simulation results for LLPRLLARRY, GVREIHGLNPGGCLHTVRLVCRR, and FRTTLAPHVLTRLLAPCW. The red dashed line represents the RMSD plot, and the blue dashed line represents the RMSF plot.

Table 1. The dataset used for model construction.

Dataset	Positive	Negative
Training set	4370	5132
Test set	2211	2543

Table 2. Model evaluation parameters.

Model	Accuracy	Precision	Recall	F1 Score
LSTM	0.850	0.784	0.966	0.866
ALSTM	0.952	0.992	0.912	0.950
Combined	1	1	1	1

Table 3. Short peptide prediction results.

Prediction Sequence	Systematics
LLPRLLARRY	Clostridiales
FRVPLAPYVLPPLLARC	Fervidobacterium
GVREIHGLNPGGCLHTVRLVCRR	Caloramator sp. E03
IRTTLPPYVFPRLLARCW	Actinobacteria
FRITPSPHVLPPLRGRVC	Peptostreptococcaceae
FRITLTPHVLPRLLARS	Romboutsia
FRLTFRTHVLPRPLGRC	Betaproteobacteria
GLLHTRGIAGSGLRPLSKIPHCCRP	Gammaproteobacteria
RLRRWPCKSCVKTPGSTREVPGKPAGWCAA	Thermosipho africanus TCF52B
FRITLATYVLRRLLLPCS	Turicibacter sp. H121
FRSTLAPYGLPRLLGRC	Tepidanaerobacter acetatoxydans Re1
FRTTLAPHVLTRLLAPCW	Synergistaceae
RLHPLCYRGCWHRVSRCLFCE	Acinetobacter
GLHHSRGMAGSGLPPLSNIPHCSHP	Bradyrhizobiaceae
FRTTLAPYVFPRLLHRS	Rubrobacter xylanophilus
FRTTLAPYVLPRLLARCW	Nocardioides sp. CF8
GLLHSRGIAGSGLPPLSNIPHCSLP	Defluviicoccus vanus

Table 4. Molecular docking data sheet.

Model	Affinity (kcal/mol)	Dist From RMSD 1. b.	Best Mode RMSD u. b.
1	−7.4	14.129	20.562
2	−6.2	2.168	3.017
3	−7.3	1.326	2.133
4	−5.9	3.126	10.883
5	−7.0	2.274	4.324
6	−6.9	3.077	6.023
7	−6.9	7.883	17.806
8	−6.3	3.657	11.755
9	−7.0	2.395	5.414
10	−7.0	12.373	20.499

Table 5. Comparison of the proposed method with other antimicrobial peptide screening methods.

Methods	Major Contribution	Recall	AUC
Machine learning [28]	The synthesis yielded 216 peptides from a pool of 241 segments, revealing that 181 of them possessed antimicrobial activity.	83.80%	×
Deep-AmPEP30 [31]	A new method, Deep-AmPEP30, has been proposed for predicting short-chain (≤30 amino acids) antimicrobial peptides (AMPs). This approach combines the optimal feature set reduced by PseKRAAC amino acid composition and convolutional neural networks. The genome sequences of gut commensal yeast pseudohyphae were screened, leading to the discovery of a peptide comprising 20 amino acids.	×	0.85
AMP-EBiLSTM [29]	A deep learning strategy named AMP-EBiLSTM has been proposed for the accurate prediction of antimicrobial peptides. In the realms of deep learning and ensemble learning, the authors effectively utilized binary profile features (BPF) and pseudo-amino acid composition (PSEAAC) to capture local sequences and extract amino acid information.	92.39%	0.97
DeepLPI [30]	With an AUROC of 0. 857 on the BindingDB dataset and an AUROC of 0. 925 on the Davis dataset, the authors introduced a novel deep learning-based model for predicting protein–ligand interactions. This model is particularly suitable for drug repurposing and is primarily composed of a one-dimensional convolutional neural network (1D CNN) based on ResNet and a bidirectional long short-term memory network (biLSTM).	×	0.925
Deep learning	The application of the ALSTM model, which combines the LSTM model with an attention mechanism, successfully predicted and screened antimicrobial peptides, ensuring high accuracy.	96.60%	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.-X.; Jin, X.-B.; Xu, C.-M.; Ma, H.-J.; Wu, Q.; Liu, H.-S.; Li, Z.-M. Antimicrobial Peptide Screening from Microbial Genomes in Sludge Based on Deep Learning. Appl. Sci. 2024, 14, 1936. https://doi.org/10.3390/app14051936

AMA Style

Liu Y-X, Jin X-B, Xu C-M, Ma H-J, Wu Q, Liu H-S, Li Z-M. Antimicrobial Peptide Screening from Microbial Genomes in Sludge Based on Deep Learning. Applied Sciences. 2024; 14(5):1936. https://doi.org/10.3390/app14051936

Chicago/Turabian Style

Liu, Yin-Xuan, Xue-Bo Jin, Chun-Ming Xu, Hui-Jun Ma, Qi Wu, Hao-Si Liu, and Zi-Meng Li. 2024. "Antimicrobial Peptide Screening from Microbial Genomes in Sludge Based on Deep Learning" Applied Sciences 14, no. 5: 1936. https://doi.org/10.3390/app14051936

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Antimicrobial Peptide Screening from Microbial Genomes in Sludge Based on Deep Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset for Training the Network

2.2. ALSTM

2.3. Combined Antimicrobial Peptide Screening Deep Network

3. Experiments

3.1. Experimental Process

3.2. Training of the Combined Deep Network and Screening Results

3.3. Model Verification

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI