Next Article in Journal
Whole-Genome Sequencing Data Reveal New Loci Affecting Milk Production in German Black Pied Cattle (DSN)
Previous Article in Journal
Efficient Selection of Gaussian Kernel SVM Parameters for Imbalanced Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Grid Search-Based Multilayer Dynamic Ensemble System to Identify DNA N4—Methylcytosine Using Deep Learning Approach

1
Department of Computer Science and Engineering, Jagannath University, Dhaka 1100, Bangladesh
2
School of Information Technology, Deakin University, Geelong 3125, Australia
3
Department of Computer Science and Engineering, Hamdard University Bangladesh, Munshiganj 1510, Bangladesh
4
Department of Computer Science and Engineering, Eastern University, Dhaka 1345, Bangladesh
5
School of IT, Melbourne Institute of Technology, Melbourne 3000, Australia
*
Authors to whom correspondence should be addressed.
Genes 2023, 14(3), 582; https://doi.org/10.3390/genes14030582
Submission received: 28 January 2023 / Revised: 17 February 2023 / Accepted: 18 February 2023 / Published: 25 February 2023
(This article belongs to the Topic Complex Systems and Artificial Intelligence)

Abstract

:
DNA (Deoxyribonucleic Acid) N4-methylcytosine (4mC), a kind of epigenetic modification of DNA, is important for modifying gene functions, such as protein interactions, conformation, and stability in DNA, as well as for the control of gene expression throughout cell development and genomic imprinting. This simply plays a crucial role in the restriction–modification system. To further understand the function and regulation mechanism of 4mC, it is essential to precisely locate the 4mC site and detect its chromosomal distribution. This research aims to design an efficient and high-throughput discriminative intelligent computational system using the natural language processing method “word2vec” and a multi-configured 1D convolution neural network (1D CNN) to predict 4mC sites. In this article, we propose a grid search-based multi-layer dynamic ensemble system (GS-MLDS) that can enhance existing knowledge of each level. Each layer uses a grid search-based weight searching approach to find the optimal accuracy while minimizing computation time and additional layers. We have used eight publicly available benchmark datasets collected from different sources to test the proposed model’s efficiency. Accuracy results in test operations were obtained as follows: 0.978, 0.954, 0.944, 0.961, 0.950, 0.973, 0.948, 0.952, 0.961, and 0.980. The proposed model has also been compared to 16 distinct models, indicating that it can accurately predict 4mC.

1. Introduction

DNA methylation is an epigenetic modification in which chromatin structure, DNA orientation, DNA integrity, inactivation of the X chromosome, regulation of gene expression, cell differentiation, cancer development, and DNA–protein interactions are altered, keeping the original gene sequence unmodified. This modification plays a significant role in developmental and pathological processes, such as aging, carcinogenesis, genomic imprinting, transposable elements repression, X chromosome inactivation, etc. [1,2,3]. On the other hand, changes in DNA methylation cause several illnesses including tumorigenesis, abnormalities of imprinting, cardiovascular diseases, autoimmune diseases, neurological diseases, cancer, etc. [1,4]. N4-Methylcytosine (4mC) occurs at the C4 position of cytosine in both prokaryotic and eukaryotic cells. It is responsible for participation in the restriction–modification system to provide a bacterial immune response against occupied DNA, as well as DNA repair, expression, or replication. As for experimental methods and data, current knowledge about other biological functions of 4mC is insufficient, and the dataset used to identify 4mC and studies related to 4mC is limited [5]. Although there are several approaches to identifying 4mC methylation, such as mass spectrometry, methylation-precise PCR, single-molecule real-time (SMRT) sequencing, and 4mC-Tet-assisted bisulfite, those experiments on the functional or biological behavior of 4mC sites are expensive, time-consuming, and fail to identify 4mC areas with time efficiency when applied to large sequencing data [5,6,7,8]. Recently, machine learning approaches have been applied to identify 4mC sites with promising outcomes. Machine learning-based approaches have enhanced 4mC identification research and have been highly successful in predicting 4mC sites. The success of machine learning-based techniques (i.e., their predictive power) in distinguishing 4mC sites from non-4mC sites is highly dependent on the quality of the extracted features [1]. Deep learning is also used extensively in studying proteins, DNA sequences, and RNA sequences as a powerful and popular tool in machine learning [9]. Ensemble methods in machine learning have also been used in several studies to detect 4mC sites. Although the state-of-the-art methods have continuously yielded promising findings, their lack of generalizability necessitates the creation and development of new prediction algorithms to accurately detect 4mC sites. Therefore, a reliable prediction model for detecting 4mC sites at a large scale in a gene is highly desirable to fully understand the biological role of 4mC, as 4mC may play a supporting role in genetic stabilization, regeneration, and development. The following summarizes the aims and major contributions of this research work to achieving the objectives:
  • To develop a grid search-based weighted average ensemble (WAE) system for identifying DNA N4-Methylcytosine where learned information is transmitted along from a layer to the next (details are provided in Section 4).
  • The model applies multi-configuration 1D CNN to perform the ensemble method in each layer to improve prediction efficiency and find the best results, in addition to transferring learning knowledge from one layer to another.
  • A grid search-based weighted average ensemble (WAE) technique has been applied to find the optimal accuracy at each layer.
  • We trained the model with eight large publicly accessible datasets and compared its performance with the existing models.
The remaining paper is organized as follows. Section 2 describes tasks related to the determination of DNA N4-methylcytosine, existing methods, and available techniques. In Section 3, we give details about our proposed methodology. Section 4 contains the experimental results, as well as a discussion of the proposed model. Section 5 ends with a conclusion and future work.

2. Literature Review

To identify 4mC sites, several experimental studies have been conducted. We evaluated existing works in this section by analyzing these related works, and the limitations of these works are identified and documented in depth below.
S. Zhang et al. (2022) [10] proposed a multi-source feature and gradient boosting decision tree-based identification model of DNA N4-methylcytosine sites. They proceeded by extracting features from original sequences using multi-source feature representation methods, which include the mononucleotide binary and K-mer frequency, dinucleotide binary and position-specific frequency, ring-function hydrogen-chemical properties, dinucleotide-based DNA properties, and trinucleotide-based DNA properties. Following this, a gradient boosting decision tree was used to find the optimal feature set and to remove redundant information. Finally, a support vector machine was utilized to predict whether a site was 4mC or not.
Yu. Lezheng et al. (2022) [11] proposed a convolutional recurrent neural network model to identify DNA N4-methylcytosine. For representing DNA sequences, they considered one-hot and dictionary encoding methods. Following that, three representative deep learning algorithms with different network architectures for classification were chosen, including a convolutional neural network (CNN), a recurrent neural network (RNN), and a recurrent neural network (RNN) with bidirectional long-short-term memory cells (BiLSTM).
L. Wang et al. (2022) [12] proposed 4mCPred-FSVM, a tool that uses a fuzzy model to identify DNA N4-methylcytosine sites. They used position-specific trinucleotide propensity to construct the feature vectors, which were then fed into the FSVM (fuzzy support vector machine) to build the final model.
J. Jin et al. (2022) [13] proposed a deep learning model for predicting DNA N4-methylcytosine sites in the mouse genome. This network architecture is composed of three major segments: (A) the adaptive embedding module, (B) the GRU module, and (C) the classification module. The adaptive embedding module was created to tokenize an original DNA sequence and obtain appropriate embeddings adaptively using the encoding matrix. Following that, the GRU module was then utilized to extract long and short distant information in sequence. The classification module used a max-pooling layer to determine which feature in each GRU unit is the most relevant or important.
J. Khanal et al. (2021) [3] proposed a deep learning model to identify DNA N4-methylcytosine sites in the Rosaceae genome, relying on distributed feature representation. In the first stage, the K-mer (k = 3) feature encoding technique was used to represent each DNA sequence as a fixed length of words. Then, the ‘word2vec’ embedding method was applied to map each word to its corresponding vector form. The authors used convolutional neural networks (CNN) in the second stage to classify 4mCs and non-4mCs based on selected features.
A. Wahab et al. (2021) [5] proposed an identification model of N4-methylcytosine using deep learning, and natural language processing. The K-mer (k = 2, 3, 4) feature encoding technique was used to transfer the combination of nucleotides (A, C, G, T) into the sequence of words in this model. The ‘word2vec’ embedding method was applied to present each word in the vector form. A continuous bag-of-word (CBOW) approach was employed to train the word2vec model. Finally, a CNN was used to classify the 4mC and non-4mC sites. Three different k-fold cross-validations (three-fold, five-fold, and ten-fold) were applied to carry out the preeminent identification.
G. Fang et al. (2021) [6] proposed a word2vec based deep learning network to predict DNA N4-methylcytosine sites. Using the word2vec embedding method, each feature derived from K-mers (K = K = 3, 4, 5, 6) was presented in vector form. The feature matrix sequence was then fed into 3-CNN to categorize the 4mCs and non-4mCs.
M. Tahir et al. (2021) [9] proposed an intelligent and robust computational prediction model for DNA N4-methylcytosine sites via natural language processing. The authors examined two input sequence representations. The initial encoding method was one-hot encoding, which encoded each DNA sequence with (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), and (0, 0, 0, 1) for A, C, G, and T, respectively. The distributed feature representation learnt by the word2vec model, on the other hand, was the second encoding approach. Then, new word representation was fed into the convolution neural network (CNN) model. The CNN model has various layers, including a convolution layer, pooling layer, ReLU layer, normalization layer, dropout layer, fully connected layer, etc.
H. Zulfiqar et al. (2021) [14] proposed Deep-4mCW2V, where a sequence-based predictor was used to identify N4-methylcytosine sites. In this model, the training data were first converted into numerical feature vectors using the “word2vec” technique. Following this, the feature vectors were fed into the 1-D CNN for classification. A 10-fold CV was applied to split the entire dataset into 10 groups of relatively equal size.
D.Y. Lim et al. (2021) [15] developed the iRG-4mC tool, which is based on the CNN-LSTM model. The DNA sequence was encoded using a combination of one-hot encoding and nucleotide chemical properties in the proposed system (NCP). The final sequence was transmitted into the LSTM for feature optimization, and three fully connected layers were used for the final outcome.
M.M. Hasan et al. (2020) [8] proposed a model named i4mC-Mouse to identify DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes. The authors used six encoding techniques, namely K-space nucleotide composition (KSNC), K-mer nucleotide composition (K-mer), mononucleotide binary encoding (MBE), dinucleotide binary encoding (DBE), electron–ion interaction pseudopotential (EIIP), and dinucleotide physicochemical composition (DPC), to represent a DNA sequence as fixed-length feature vectors. The WR feature selection method was used to remove the noisy feature. Finally, five machine learning classifiers were employed to differentiate between 4mC and non-4mC sites, namely random forest (RF), Bayesian network (NB), support vector machine (SVM), K-nearest neighbor (KNN), and AdaBoost (AB). A 10-fold cross-validation was applied to partition the whole data into training and testing sets.
A. Wahab et al. (2020) [16] proposed a model named DNC4mC-Deep to identify DNA N4-methylcytosine sites based on different encoding schemes by using deep learning. Deep learning techniques were used to classify a DNA sequence that was represented as a fixed length of feature vectors. To generate methylcytosine samples, six types of feature encoding methods were used: binary encoding (BE), DNC (2-mer), TNC (3-mer), multivariate mutual information (MMI), nucleotide chemical property (NCP), and nucleotide chemical property and nucleotide frequency (NCPNF).
Z. Zhao et al. (2020) [17] proposed a model of DNA N4-methylcytosine sites via boost-learning various types of sequence features. The DNA sequence was first encoded with one-hot binary (OHB), sequential nucleotide frequency (SNF), K-nucleotide frequency (KNF), K-spectrum nucleotide pair frequency (KSNPF), and PseDNC. Top-ranked features from the XGBoost training procedure were chosen. Finally, the chosen features were used to train the support vector machine (SVM) classification model. A K-fold cross-validation (K = 10) was used to split the dataset into training and test sets.
J. Khanal et al. (2019) [18] proposed an identification model of N4-methylcytosine sites in prokaryotes using a convolutional neural network. In this model, every DNA sequence was represented as a binary vector using a one-hot encoding technique. The K-fold cross-validation technique was used to split the datasets into the training and testing sets and, finally, CNN was employed to classify the 4mCs and non-4mCs sequences.
B. Manavalan et al. (2019) [19] proposed a sequence-based meta-predictor for accurate DNA 4mC site prediction using effective feature representation. They began by generating 56 probabilistic features using four ML algorithms: SVM, random forest [RF], gradient boosting [GB], and extremely randomized tree [ERT], as well as seven feature encodings: K-mer composition, binary profile [BPF], dinucleotide binary profile encoding [DPE], local position-specific dinucleotide frequency [LPDF], ring-function-hydrogen-chemical properties [RFHC], and dinucleotide physicochemical composition. They subsequently fed these probabilistic features into a SVM to construct a final prediction model.
Limitations: All the aforementioned models are unable to expand their current knowledge from their resources. We have proposed a multilayer dynamic system in which the learning knowledge flows from layer to layer, and every layer can achieve an optimal accuracy that enhances the overall model’s performance. The related studies covered in this section is summarized in Table 1 with respect to diverse attributes.

3. Materials and Methods

3.1. Dataset Construction

The benchmark and independent datasets in this paper are collected from different sources [3,5,6,8,9,16,17,18,19] to measure the efficiency of the proposed model for a fair comparison with current predictors. These datasets contain eight different species, namely Caenorhabditis elegans (C. elegans), Drosophila melanogaster (D. melanogaster), Arabidopsis thaliana (A. thaliana), Escherichia coli (E. coli), Geoalkalibacter subterraneus (G. subterraneus), Geobacter pickeringii (G. pickeringi), Fragaria vesca (F. vesca), and Rosa chinensis (R. chinensis). All samples are 41 bp long, with the 4mC site in the middle. The threshold of CD-HIT was adjusted at 80% in order to eliminate redundant sequences and to prevent the predictor from overfitting. As a result, the number of negative samples will outnumber the number of positive samples. An identical number of negative samples were picked at random from the eight species in order to equalize the positive and negative samples. Therefore, these datasets can be written in the following form:
S i = S i + S i   Where   i = 1 , 2 , 3 , 4 , ....... , 8
where S i represents the total number of positive and negative samples for the eight species. The eight positive datasets from the various species are included in the subsets S i + ( i = 1 , 2 , 3 , 4 , ....... , 8 ) ; S i   ( i = 1 , 2 , 3 , 4 , ....... , 8 ) contains the negative samples. In set theory, the union is represented by the symbol . Table 2 lists the specifics of the eight benchmark datasets.

3.2. Feature Representation

A DNA sequence is represented as a fixed length of feature vectors which can be classified by deep learning algorithms. We used the K-mer approach to break the genomic sequence into fixed-length words (3-mer). It is a standard feature encoding algorithm widely used in various prediction tasks. When K = 2, we call this method dinucleotide composition (DNC), and when K = 3, we call this method trinucleotide composition (TNC). In this study, we have used K = 3. Trinucleotide composition in K-mers is commonly used to predict DNA N4-methylcytosine (4mC) using deep learning approaches, because 4mC modification tends to occur in specific patterns within DNA sequences, and these patterns are often related to the surrounding trinucleotide context of the 4mC site. In K-mer analysis, the K-mer size is the length of the substring that is extracted from the DNA sequence for analysis. Trinucleotide composition specifically refers to K-mers of length 3, where the substring consists of three adjacent nucleotides. By using trinucleotide composition in K-mers of length 3, we can capture information about the surrounding nucleotides in a DNA sequence that are most relevant for predicting 4mC modification. This is because the presence of 4mC is known to be influenced by the nucleotides that immediately precede and follow the modification site. Therefore, analyzing trinucleotide composition allows the model to capture these important contextual features in the data. In addition, using trinucleotide composition in K-mers of length 3 also reduces the total number of K-mers that need to be considered in the analysis, compared to using larger K-mer sizes. This can improve the efficiency of the analysis and reduce the computational requirements for training the deep learning model. In TNC, all samples of 41 nt produce 39 components with the equation of L − k + 1. Here, L stands for the sequence length, and k stands for the K-mer value as an integer [11,12,13,14,15,16,17,18,19,20,21,22,23,24]. ATG, TGC, GCG, and CGA are the four 3-mers that can be tokenized from the DNA sequence “ATGCGA,” for instance.

3.3. Distributed Feature Representation

We converted each K-mer word into a 100-dimensional vector format in attempt to acquire discriminating information between the two classes. We used a word embedding approach known as word to vector (word2vec) to transfer the sequences into vector form. This technique generates an optimal set of feature vectors based on the distributional hypothesis. Each word (K-mer) in word2vec is represented by an n-dimensional vector. Here, (L-2) × n array shapes represent the length of each sequence. It is a two-layer neural network that processes text by vectorizing words. It receives input as a text corpus, and its output is feature vectors representing words in that corpus. This technique decreases computational complexity and reduces the noise, ultimately leading to improved performance in the resultant computational model [3,6,25,26]. Either the continuous bag-of-words (CBOW) approach or the skip-gram method can be used to apply the word2vec model. The current word ( w ( t ) ) or input is employed in the skip-gram model to predict the surrounding window of context words. The CBOW technique, in contrast, makes an attempt to infer the target word from its nearby (context) words [3]. A five-window CBOW model was developed using the following inputs:
k = 2 , k 0 2 w ( t + k )
The skip-gram is more practical and produces better results for infrequent words. We used CBOW for word2vec training since in our study we are interested in frequent words. Each word sequence (3-mer) was fed into a word2vec model with two layers. Each sequence of length L was represented by an array of form (L-2) 100, and each word had its own 100-dimension (D) vector representation. As an illustration, the words “ATG” and “TGC” were each represented as a 100-(D) vector of letters [ 0.14 1 , 0.32 2 , 0.33 3 , ........... , 0.11 100 ] and [ 0.32 1 , 0.14 2 , 0.31 3 , ........... , 0.23 100 ] , respectively. Table 3 provides a list of the word2vec training parameters. The abovementioned three steps are illustrated in Figure 1a.

3.4. Proposed Deep Learning Model

In the final step, the weighted average ensemble technique is adopted on the training and testing set to identify 4mC and minimize generalization error. It combines the predictions from multiple models, where the contribution of each model is weighted proportionally to its capability or skill. As the basic learning model for this work, we used a 1D CNN (deep feed forward neural network) that included several parameters to tune. Afterwards, the convolution neural network (CNN) model was fed the feature vectors as input. Convolution layer, pooling layer, activation layer, normalizing layer, dropout layer, fully connected layer, etc., are some of the layers in the CNN model. The training phase involves adjusting the hyper-parameters. The number of convolution layers, the number of filters within each convolution layer, the size of these filters, and the dropout rate are the hyper-parameters of each CNN. Table 4 displays the ranges of these parameters. Each CNN in the proposed approach is composed of a pair of one-dimensional convolution layers. Each basic learning model has 16 and 32, 32 and 42, and 42 and 64 filters in the first and second layers, respectively. Each model’s first and second layers have filters with sizes of 3 and 5, 5 and 7, and 7 and 9, respectively. A ReLU activation function is applied after each convolution layer. Each convolution layer was followed by a dropout layer with a rate of 0.7 in order to address the overfitting issue. Each model’s two convolution layers’ best features are given to a dropout layer with a probability of 0.5, 0.3, and 0.6 separately, followed by a fully connected layer with one node and a sigmoid function for prediction. Furthermore, the activation layer predicts whether a given DNA sample contains N4-methylcytosine or not, depending on the target class. This layer’s output is normalized to lie within the interval [0,1]. The detailed configurations of the proposed 4mCCNN model are shown in Table 5. This model utilizes a stratified prediction technique. Based on the same training dataset, the three models at each level categorize the same test set. After each layer’s classification is complete, the correctly classified data is added to the pre-training data and compared with the predefined target values in the original data set and inserted into the next layer. In the next iteration, the misclassified data is used as new test data. This process continues until all models have improved performance (TP and TN = 0). However, a complicated aspect of using weighted average ensemble is choosing the relative weight for each ensemble member to achieve the optimal accuracy in each layer. At each layer, a grid search method is used to find the optimal weights for the model, resulting in more “accurate” predictions. The whole approach is called a grid search-based multilayer dynamic system (GS-MLDS). These steps are illustrated in Figure 1b. The working procedure is demonstrated in Algorithm 1. The overall accuracy for GS-MLDS can be calculated using the following formula:
Number of train data in layer, i : Number of train data in layer ( i 1 ) + Number of TP, TN in layer ( i 1 ) .
Test data in layer, i : Number of FP, FN in layer ( i 1 ) .
Where i = layer number; i = 1, 2, 3, …, n.
T A = i = 1 n ( T P + T N ) i S T e s t ( S + + S )
where T A , ( T P + T N ) i , and S T e s t represent the total accuracy, total number of correctly classified data in each layer, and total number of testing data, respectively. Furthermore, S + , and S represent the total number of 4mC and non 4mC sites, respectively.
In Table 5, the Conv1D ( f , s , t ) is a one-dimensional convolution operator where f , s , and t stand for the number, size, and stride of filters, respectively. The operator Dropout ( p ) represents a dropout layer with a probability of p . Dense ( n ) is a densely connected layer with n nodes. Finally, the sigmoid function is then used to determine if a sequence has a 4mC site or not.
Algorithm 1: Proposed Algorithm
Input:
The DNA sequence dataset.
Output:
Classify 4mC and non-4mC
1. Begin
2. Remove the redundant sequences.
3. If total number of ( S i + ) total number of ( S i ) then
5.   Randomly select the equal number of ( S i + ) and ( S i )
6. Else
7.    S i = S i + S i   [Where i = 1 , 2 , 3 , 4 , ....... , 8 ]
8. End If
9. Transform each DNA sequence into a fixed length of word.
10. Convert each word into n-dimensional vector form.
11. Split the dataset into training ( S T r a i n ) and testing ( S T e s t ) set. Where ( S T r a i n ) > ( S T e s t ) . The ratio between S i +
   and S i of each data set is the same as in the entire dataset.
12. Tuning the base classifiers with numerous hyper-parameters.
13. Apply training set to fit each classifier.
14. Integrate the base classifiers and apply gird search to choose the relative weight for each ensemble member.
15. Predict new data.
16. For i = 2 to n do
17.    If TN and TP 0 then
18.      S T r a i n = S T r a i n   ( i 1 ) + Number of TP, TN in layer ( i 1 )
19.      S T e s t = Number of FP, FN in layer ( i 1 )   [Where i = layer number]
20.    Repeat Step 13 to 15.
21.   Else
22.    Calculate total accuracy using the following formula:
       T A = i = 1 n ( T P + T N ) i S T e s t ( S + + S )   [Where = 1, 2, 3,……., n]
23.   End If
24. End For
25. Stop

4. Result and Discussion

The effectiveness of a machine learning model is evaluated using a performance matrix. The “scikit-learn” library’s matrix module contains the methods required to compute performance evaluation metrics. The model’s performance is calculated using a confusion matrix. The confusion matrix produces four results based on the datasets: TP (true positive), TN (true negative), FP (false positive), and FN (false negative). The confusion matrix results of our proposed GS-MLDS are shown graphically in Figure 2a,b. The accuracy, precision, true positive rate, false positive rate, true negative rate, and false negative rate [18] are calculated by employing the following equations:
  Accuracy = 1 R + + R + R + + R
  Precision = R + R + + R +
  Sensitivity / True   Positive   Rate / Recall = 1 R + R +
  False   Positive   Rate = R + R + + R
  Specificity / True   Negative   Rate = 1 R + R
  False   Negative   Rate = R + R
where R + is the total number of 4mC investigated, while R + is the number of 4mC incorrectly classified as non-4mC sequences. Here, R is the total number of non-4mC investigated, while R + is the number of non-4mC incorrectly classified as 4mC.
True positive and true negative are represented as TP and TN in Figure 2a,b above, with the correct number of identified sequences related to 4mC and non-4mC, respectively. False positive and false negative denoted are as FP and FN, respectively, and indicate the false number of identified sequences for 4mC and non-4mC. We have split each dataset 80:20, where 80% is the training set and 20% is the testing set. The total number of 4mC sites correctly identified by this proposed model for C. elegans, D. melanogaster, A. thaliana, E. coli, G. subterraneus, G. pickeringi, F. vesca, and R. chinensis datasets are 226 out of 231, 337 out of 354, 379 out of 396, 75 out of 78, 177 out of 182, 112 out of 114, 852 out of 865, and 476 out of 485, respectively. The non-4mC sites correctly identified by this proposed model for the same datasets are 222 out of 231, 331 out of 354, 352 out of 396, 74 out of 78, 179 out of 181, 108 out of 114, 799 out of 864, and 461 out of 484, respectively.
ROC curves are a tool that aids in the understanding of probabilistic forecasts for binary (two-class) classifications of predictive modeling problems. The ROC curve demonstrates how well a classification model performs at different thresholds. ROC is a probability curve. TPR and FPR are used to plot the ROC curve, with TPR on the y-axis and FPR on the x-axis. Compared to the existing model, the proposed model outperforms it at classifying and testing datasets when the ROC curve has a lower x-axis and a higher y-axis value. The level or measurement of separability is represented by AUC. The entire two-dimensional region from (0, 0) to (1, 1) below the entire ROC is measured by AUC. A model with a higher AUC predicts 0 s as 0 s and 1 s as 1 s more frequently. The AUC ROC values are graphically presented in Figure 3.
In Figure 3, the blue dots represent the True Positive Rate (TPR) at different classification thresholds, while the orange dots represent the AUC of this proposed model. The proposed model obtained AUC values of 96.97, 94.35, 92.30, 95.51, 98.07, 96.49, 95.49, and 96.70 for the C. elegans, D. melanogaster, A. thaliana, E. coli, G. subterraneus, G. pickeringi, F. vesca, and R. chinensis datasets, respectively. That is, the proposed model has 96.97%, 94.35%, 92.30%, 95.51%, 98.07%, 96.49%, 95.49%, and 96.70% probability of distinguishing 4mC from non-4mC for each dataset, respectively. The results of other performance evaluation matrices are shown in Table 6.
The proposed predictor obtained accuracies greater than 90% for each dataset. This model enhances prediction accuracy through layer-by-layer classification techniques. To maximize the accuracy at each layer, we used a grid-based search technique to find the best variety of weights for each layer. For example, Table 7 demonstrate the classification performance of GS-MLDS in each layer. The precision values of this suggested model for each dataset are 96.17%, 93.61%, 89.59%, 94.93%, 98.88%, 94.91%, 92.91%, and 95.39%, respectively. A model’s accuracy specifies how many identified objects are genuinely relented. Precision can also be thought of as the likelihood that a randomly chosen item that is marked as “important” is a true positive, in addition to being a measurement of model performance. Recall and TNR values of more than 90% were obtained for each dataset using the proposed approach. The recall is determined as the proportion of positive samples that were correctly identified as positive to all positive samples. The recall measures how well the model can identify positive samples. The more positive samples that are identified, the larger the recall. This is often referred to as sensitivity or true positive rate. The specificity, also known as the true negative rate, is the likelihood of a negative test if it is actually negative. For each dataset, the corresponding false positive and true negative ratios are 3.89%, 6.49%, 11.11%, 5.12%, 1.10%, 5.26%, 7.52%, and 4.75%, respectively. For each dataset, the false negative and actual positive ratios are 2.16%, 4.80%, 4.29%, 3.8%, 2.7%, 1.75%, 1.5%, and 1.85%, respectively.
Based on experimental results [27], we mentioned, “The appropriately categorized data from one layer may be utilized as a new training set for the following layer. In circumstances, such as these, this training set can reveal new hidden patterns in incorrectly classified data. That is, appropriately identified input from the preceding layer contributes to the model gaining new knowledge”. Although the prior version of the approach gradually improves the accuracy in each layer based on an improved training set, one drawback of this method is that it does not provide guidance on how to achieve maximum accuracy in a layer. As a result, this is a time-consuming process due to the increased number of layers. Furthermore, our main concern in this research work is how to determine the maximum accuracy at each layer to limit the number of additional layers in order to improve the training set to extract new hidden patterns from its resources. We found that each layer’s best combination of weight sets can overcome this limitation. Other researchers commonly used these eight datasets. We have tested the accuracy and compared the proposed model with the existing models on the same independent datasets, as shown in Table 8. We directly submitted the independent datasets to the GS-MLDS. The proposed GS-MLDS yielded accuracies of 0.978, 0.954, 0.944, 0.961, 0.950, 0.973, 0.948, 0.952, 0.961, 0.953, and 0.980. The GS-MLDS outperformed other models with different ratios. GS-MLDS performs well due to the following factors: extending its learning knowledge from its resources to continue the classification process from one layer to another layer, using multi-configured 1D CNN to perform ensemble methods at each layer, and using the grid search technique to determine the optimal combination of weights for each layer.
The benefits of the model that we have developed are as follows. This model can carry out the classification process from one layer to another by expanding its learning knowledge. During the classification phase, the learning knowledge is improved from its internal resources. Each layer determines the ideal weighting combination to produce the best results. Finally, the predictions from multiple CNN models are combined to reduce the variance of predictions and reduce generalization error. The limitations of our proposed model are as follows. The computational time is high for a high dimensional dataset. Lots of training data is required. Furthermore, it cannot encode the position or orientation of a DNA sequence and, finally, it needs more memory space.

5. Conclusions and Future Work

DNA 4mC is a crucial epigenetic modification that causes various diseases and is also a restriction–modification system. Therefore, accurately identifying 4mC sites is an essential step towards understanding the exact biological functions. This work presented an influential computational model named GS-MLDS for predicting 4mC and non-4mC sites. We have proposed a reusable knowledge model that aids in the transmission of a previous layer’s knowledge to the next layer. This study finds the optimal accuracy of each layer by applying a grid-based search technique to find the best weight set combinations in each layer. This study has experimented with eight different types of datasets. With an 80:20 split between the training and testing sets, our proposed model attained accuracy levels of 0.978, 0.954, 0.944, 0.961, 0.950, 0.973, 0.978, 0.954, 0.948, 0.952, 0.953, and 0.980. The correct classification probability of 4mC and non-4mC of the proposed system has been pointed out by the AUC curve. That is, the proposed model has 96.97%, 94.35%, 92.30%, 95.51%, 98.07%, 96.49%, 95.49%, and 96.70% probability of distinguishing 4mC from non-4mC for each dataset, respectively. The suggested methodology has shown improved performance compared to other machine learning models. Our proposed approach outperformed the state-of-the-art predictors for each dataset in terms of detecting 4mC sites in both balanced and unbalanced class labels. The study presented in the paper may be useful for more pervasive bioinformatics applications. The goal of this research’s future work is to produce more semantic characteristics and create a model that has already been trained on a big dataset to address a similar issue.

Author Contributions

Conceptualization, R.K.H.; data curation, F.H. and N.J.; formal analysis, R.K.H., M.A.U., S.A., M.A.I., F.H., A.K. and A.A.; funding acquisition, R.K.H., M.N.U., M.A.U., S.A., M.A.I., F.H., N.J., A.K. and A.A.; investigation, M.N.U., M.A.U., S.A. and M.A.I.; methodology, R.K.H.; project administration, M.N.U.; resources, R.K.H., F.H. and N.J.; supervision, M.A.U.; validation, F.H.; visualization, S.A.; writing—original draft, R.K.H. and M.N.U.; writing—review and editing, M.N.U., M.A.U., S.A., M.A.I. and F.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the Air Force Office of Scientific Research (Grant under number FA2386-20-1-4005), and Cyber Security, and Computing Technology Enhancement Lab, Jagannath University, Dhaka, Bangladesh.

Institutional Review Board Statement

Not applicable, as the publicly available dataset is used.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets are available at the link below: https://github.com/rajib1346/GS-MLDS.git; and are available from the corresponding author upon request (2 November 2022).

Conflicts of Interest

The authors declare that they have no conflict of interest and no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Rehman, M.U.; Tayara, H.; Chong, K.T. DCNN-4mC: Densely connected neural network based N4-methylcytosine site prediction in multiple species. Comput. Struct. Biotechnol. J. 2021, 19, 6009–6019. [Google Scholar] [CrossRef] [PubMed]
  2. Manavalan, B.; Hasan, M.; Basith, S.; Gosu, V.; Shin, T.-H.; Lee, G. Empirical Comparison and Analysis of Web-Based DNA N4-Methylcytosine Site Prediction Tools. Mol. Ther. Nucleic Acids 2020, 22, 406–420. [Google Scholar] [CrossRef] [PubMed]
  3. Khanal, J.; Tayara, H.; Zou, Q.; Chong, K.T. Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation. Comput. Struct. Biotechnol. J. 2021, 19, 1612–1619. [Google Scholar] [CrossRef] [PubMed]
  4. Kandi, V.; Vadakedath, S. Effect of DNA Methylation in Various Diseases and the Probable Protective Role of Nutrition: A Mini-Review. Cureus 2015, 7, e309. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Wahab, A.; Tayara, H.; Xuan, Z.; Chong, K.T. DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine. Sci. Rep. 2021, 11, 1–9. [Google Scholar] [CrossRef]
  6. Fang, G.; Zeng, F.; Li, X.; Yao, L. Word2vec based deep learning network for DNA N4-methylcytosine sites identification. Procedia Comput. Sci. 2021, 187, 270–277. [Google Scholar] [CrossRef]
  7. Zulfiqar, H.; Sun, Z.-J.; Huang, Q.-L.; Yuan, S.-S.; Lv, H.; Dao, F.-Y.; Lin, H.; Li, Y.-W. Deep-4mCW2V: A sequence-based predictor to identify N4-methylcytosine sites in Escherichia coli. Methods 2021, 203, 558–563. [Google Scholar] [CrossRef] [PubMed]
  8. Hasan, M.; Manavalan, B.; Shoombuatong, W.; Khatun, M.; Kurata, H. i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes. Comput. Struct. Biotechnol. J. 2020, 18, 906–912. [Google Scholar] [CrossRef]
  9. Tahir, M.; Tayara, H.; Hayat, M.; Chong, K.T. Intelligent and robust computational prediction model for DNA N4-methylcytosine sites via natural language processing. Chemom. Intell. Lab. Syst. 2021, 217, 1–6. [Google Scholar] [CrossRef]
  10. Zhang, S.; Yao, Y.; Wang, J.; Liang, Y. Identification of DNA N4-methylcytosine sites based on multi-source features and gradient boosting decision tree. Anal. Biochem. 2022, 652, 1–10. [Google Scholar] [CrossRef]
  11. Yu, L.; Zhang, Y.; Xue, L.; Liu, F.; Chen, Q.; Luo, J.; Jing, R. Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning. Front. Microbiol. 2022, 13, 1–16. [Google Scholar] [CrossRef] [PubMed]
  12. Wang, L.; Ding, Y.; Xu, J.; Lu, W.; Tang, J.; Guo, F. Identification of DNA N4-methylcytosine sites via fuzzy model on self representation. Appl. Soft Comput. 2022, 122, 108840. [Google Scholar] [CrossRef]
  13. Jin, J.; Yu, Y.; Wei, L. Mouse4mC-BGRU: Deep learning for predicting DNA N4 methylcytosine sites in mouse genome. Methods 2022, 204, 258–262. [Google Scholar] [CrossRef] [PubMed]
  14. Zulfiqar, H.; Khan, R.S.; Hassan, F.; Hippe, K.; Hunt, C.; Ding, H.; Song, X.-M.; Cao, R. Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method. Math. Biosci. Eng. 2021, 18, 3348–3363. [Google Scholar] [CrossRef] [PubMed]
  15. Lim, D.; Rehman, M.; Chong, K. iRG-4mC: Neural Network Based Tool for Identification of DNA 4mC Sites in Rosaceae Genome. Symmetry 2021, 13, 899. [Google Scholar] [CrossRef]
  16. Wahab, A.; Mahmoudi, O.; Kim, J.; Chong, K. DNC4mC-Deep: Identification and Analysis of DNA N4-Methylcytosine Sites Based on Different Encoding Schemes By Using Deep Learning. Cells 2020, 9, 1756. [Google Scholar] [CrossRef]
  17. Zhao, Z.; Zhang, X.; Chen, F.; Fang, L.; Li, J. Accurate prediction of DNA N4-methylcytosine sites via boost-learning various types of sequence features. BMC Genom. 2020, 21, 1–11. [Google Scholar] [CrossRef]
  18. Khanal, J.; Nazari, I.; Tayara, H.; Chong, K.T. 4mCCNN: Identification of N4-Methylcytosine Sites in Prokaryotes Using Convolutional Neural Network. IEEE Access 2019, 7, 145455–145461. [Google Scholar] [CrossRef]
  19. Manavalan, B.; Basith, S.; Shin, T.H.; Wei, L.; Lee, G. Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation. Mol. Ther. Nucleic Acids 2019, 16, 733–744. [Google Scholar] [CrossRef] [Green Version]
  20. Zulfiqar, H.; Huang, Q.-L.; Lv, H.; Sun, Z.-J.; Dao, F.-Y.; Lin, H. Deep-4mCGP: A deep learning approach to predict 4MC sites in Geobacter pickeringii by using correlation-based feature selection technique. Int. J. Mol. Sci. 2022, 23, 1251. [Google Scholar] [CrossRef]
  21. Hasan, M.; Manavalan, B.; Khatun, M.S.; Kurata, H. i4mC-ROSE, a bioinformatics tool for the identification of DNA N4-methylcytosine sites in the Rosaceae genome. Int. J. Biol. Macromol. 2019, 157, 752–758. [Google Scholar] [CrossRef] [PubMed]
  22. Yang, H.; Yang, W.; Dao, F.; Lv, H.; Ding, H.; Chen, W.; Lin, H. A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. Brief. Bioinform. 2019, 21, 1568–1580. [Google Scholar] [CrossRef]
  23. Zhou, Y.; Zeng, P.; Li, Y.H.; Zhang, Z.; Cui, Q. SRAMP: Prediction of mammalian N6- methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res. 2016, 44, e91. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  24. Li, Y.; Zhao, Z.; Teng, Z. i4mC-EL: Identifying DNA N4-Methylcytosine Sites in the Mouse Genome Using Ensemble Learning. BioMed Res. Int. 2021, 2021, 1–11. [Google Scholar] [CrossRef]
  25. Tian, W.; Li, J.; Li, H. A Method of Feature Selection Based on Word2Vec in Text Categorization. In Proceedings of the 2018 37th Chinese Control Conference (CCC), Wuhan, China, 25–27 July 2018; pp. 9452–9455. [Google Scholar] [CrossRef]
  26. Muhammad, P.F.; Kusumaningrum, R.; Wibowo, A. Sentiment Analysis Using Word2vec And Long Short-Term Memory (LSTM) For Indonesian Hotel Reviews. Procedia Comput. Sci. 2021, 179, 728–735. [Google Scholar] [CrossRef]
  27. Uddin, M.N.; Halder, R.K. An ensemble method based multilayer dynamic system to predict cardiovascular disease using machine learning approach. Informatics Med. Unlocked 2021, 24, 100584. [Google Scholar] [CrossRef]
Figure 1. (a) (Unit A) Dataset contraction, feature representation, and distributed feature representation block.
Figure 1. (a) (Unit A) Dataset contraction, feature representation, and distributed feature representation block.
Genes 14 00582 g001a
Figure 1. (b) (Unit B (Classification Block)) Grid search-based multilayer dynamic system (GS-MLDS) to predict 4mC sites.
Figure 1. (b) (Unit B (Classification Block)) Grid search-based multilayer dynamic system (GS-MLDS) to predict 4mC sites.
Genes 14 00582 g001b
Figure 2. (a) Comparison between TP and FN, (b) Comparison between TN and FP.
Figure 2. (a) Comparison between TP and FN, (b) Comparison between TN and FP.
Genes 14 00582 g002aGenes 14 00582 g002b
Figure 3. ROC curve and AUC of the proposed model.
Figure 3. ROC curve and AUC of the proposed model.
Genes 14 00582 g003
Table 1. Brief description of related studies adopted from the Refs. [3,5,6,7,8,9,10,11,12,13,15,16,17,18,19].
Table 1. Brief description of related studies adopted from the Refs. [3,5,6,7,8,9,10,11,12,13,15,16,17,18,19].
AuthorsContributionsDatasetsPerformances
S. Zhang et al. (2022) [10]1. Applied three types of feature extraction methods to extract sequence information.
2. Used a gradient boosting decision tree (GBDT) to select the important features and remove redundant information.
3. Used six benchmark datasets to test the performance of the model.
C. elegans (3108 samples)Acc: 0.851, Mcc: 0.703, Sn: 0.872, Sp: 0.83
D. melanogaster (3538 samples)Acc: 0.859, Mcc: 0.717, Sn: 0.868, Sp: 0.849
A. thaliana (3956 samples) Acc: 0.801, Mcc: 0.602, Sn: 0.793, Sp: 0.81
E. coli (776 samples)Acc: 0.881, Mcc: 0.763, Sn: 0.886, Sp: 0.877
G. subterraneus (1812 samples)Acc: 0.859, Mcc: 0.719, Sn: 0.862, Sp: 0.856
G. pickeringi (1138 samples)Acc: 0.901, Mcc: 0.802, Sn: 0.898, Sp: 0.905
Yu. Lezheng et al. (2022) [11]1. Focused on three types of deep learning architectures: convolutional neural networks (CNNs), recurrent neural networks (RNNs), and convolutional recurrent neural networks (CNN-RNNs).
2. Analyzed several important factors, such as model architecture and its hyperparameters (the number of filters, kernel, pooling, and BiLSTM sizes, etc.), encoding methods, and attention mechanisms.
3. Introduced UMAP and Deep SHAP to better analyze and understand deep learning models.
C. elegans (3108 samples)Acc: 0.839, Mcc: 0.678, Recall: 0.84
D. melanogaster (3538 samples)Acc: 0.894, Mcc: 0.789, Recall: 0.903
A. thaliana (3956 samples)Acc: 0.874, Mcc: 0.749, Recall: 0.896
L. Wang et al. (2022) [12]1. Fuzzy support vectors were used to design the machine learning model.
2. Trinucleotide positional preference was used to convert the DNA sequence into a numerical vector that fully extracts the information from the benchmark dataset.
3. Analyzed model architecture and hyperparameters (for example, the number of filters, kernel, pooling, and BiLSTM sizes), encoding methods, and attention mechanisms.
4. Applied a grid search algorithm to find the best parameters.
C. elegans (3108 samples)Acc: 0.875, Mcc: 0.750, Sn: 0.873, Sp: 0.876
D. melanogaster (3538 samples)Acc: 0.871, Mcc: 0.743, Sn: 0.874, Sp: 0.868
A. thaliana (3956 samples)Acc: 0.828, Mcc: 0.756, Sn: 0.824, Sp: 0.834
E. coli (776 samples)Acc: 0.962, Mcc: 0.923, Sn: 0.962, Sp: 0.961
G. subterraneus (1812 samples)Acc: 0.901, Mcc: 0.804, Sn: 0.909, Sp: 0.893
G. pickeringi (1138 samples)Acc: 0.911, Mcc: 0.824, Sn: 0.914, Sp: 0.910
J. Jin et al. (2022) [13]1. The adaptive embedding approach that has been proposed may automatically change the original input feature to better represent the prediction task.
2. An effective network of bidirectional gated recurrent units was applied to efficiently extract the complete and meaningful representation of the full DNA sequence from both near and far information.
G. subterraneus (1812 samples)Acc: 0.825, Mcc: 0.651, Sn: 0.8, Sp: 0.85
J. Khanal et al. (2021) [3]1. Applied a word embedding method to capture the high-level input features.
2. Double-layer 1D CNN was applied to process the captured features.
F. vesca (25,922 samples), Acc: 0.869, Auc: 0.940, Sn: 0.897, Sp: 0.860
R. chinensis (14,502 samples)Acc: 0.854, Auc: 0.937, Sn: 0.871, Sp: 0.885
A. Wahab et al. (2021) [5]1. Used word2vec embedding to transfer the sequences into the form of vectors.
2. Fed the vectors of word embedding into CNN by applying grid search-based algorithm.
C. elegans(12,726 samples).Acc: 0.935, Auc: 0.973, Sn: 0.956, Sp: 0.899
G. Fang et al. (2021) [6]1. Used word2vec encoding and one-hot encoding to conduct comparative tests on the same dataset.
2. Designed a deep learning framework to combine a three-CNN module to extract the hidden high-level features and more biological features.
C. elegans(17,808 samples).Acc: 0.932, Auc: 0.971, Sn: 0.950, Sp: 0.916
H. Zulfiqar et al. (2021) [7]1. Applied a word embedding method to capture the high-level input features.
2. Double-layer 1D CNN was applied to process the captured features on the basis of 10-fold cross-validation.
E. coli (776 samples)Acc: 0.861, Mcc: 0.670, Sn: 0.876, Sp: 0.773
D.Y. Lim et al. (2021) [15]1. Applied LSTM to solve the vanishing gradient problem by adding extra interactions.
2. Hyper-parameter tuning was applied to optimize filter sizes, number of filters, number of convolution layers, pool sizes, stride length, and dropout values.
F.vesca (12,942 samples)Acc: 0.859, Mcc: 0.706, Sn: 0.835, Sp: 0.882
R. chinensis (6232 samples)Acc: 0.865, Mcc: 0.714, Sn: 0.854, Sp: 0.875
M. Tahir et al. (2021) [9]1. Applied “word2vec” to automatically extract features from DNA sequences and feed the vectors into CNN.
2. Compared to the state-of-the-art methods for all six benchmark datasets and independent datasets.
C. elegans (3108 samples)Acc: 0.87, Auc: 0.937, Sn: 0.887, Sp: 0.857
D. melanogaster (3538 samples)Acc: 0.882, Auc: 0.938, Sn: 0.894, Sp: 0.870
A. thaliana (3956 samples)Acc: 0.840, Auc: 0.902, Sn: 0.894, Sp: 0.829
E. coli (776 samples)Acc: 0.868, Auc: 0.946, Sn: 0.873, Sp: 0.866
G. subterraneus (1812 samples)Acc: 0.886, Auc: 0.949, Sn: 0.888, Sp: 0.884
G. pickeringi (1138 samples)Acc: 0.925, Auc: 0.968, Sn: 0.912, Sp: 0.938
M.M. Hasan et al. (2020) [8]1. Employed six encoding schemes to cover various aspects of DNA sequences.
2. Optimized the successive features via the WR feature selection method.
G. subterraneus (1812 samples)Acc: 0.816, Auc: 0.920, Sn: 0.807, Sp: 0.825
A. Wahab el al. (2020) [16]1. Applied six encoding techniques.
2. Applied a grid search algorithm to obtain the optimal model.
F. vesca (12,942 samples)Acc: 0.815, Auc: 0.89, Sn: 0.878, Sp: 0.753
R. chinensis (6232 samples)Acc: 0.783, Auc: 0.87, Sn: 0.801, Sp: 0.765
Cross-species (17,722 Samples)Acc: 0.780, Auc: 0.85, Sn: 0.849, Sp: 0.706
Z. Zhao et al. (2020) [17]1. Applied embedded feature selection scheme to rank with the feature relevance scores.
2. Extracted the sequence properties of contiguous nucleotides as features to characterize the sequences.
C. elegans (3108 samples),Acc: 0.826, Mcc: 0.653, Sn: 0.849, Sp: 0.804
D. melanogaster (3538 samples)Acc: 0.843, Mcc: 0.686, Sn: 0.854, Sp: 0.832
A. thaliana (3956 samples)Acc: 0.794, Mcc: 0.589, Sn: 0.783, Sp: 0.805
E. coli (776 samples)Acc: 0.843, Mcc: 0.686, Sn: 0.861, Sp: 0.825
G. subterraneus (1812 samples)Acc: 0.847, Mcc: 0.694, Sn: 0.836, Sp: 0.857
G. pickeringi (1138 samples).Acc: 0.877, Mcc: 0.754, Sn: 0.863, Sp: 0.891
J. Khanal et al. (2019) [18]1. Selected the best performing hyper-parameters using the grid search method.
2. Extracted the features of the 4mC sites from DNA sequence automatically using the CNN model.
C. elegans (3108 samples)Acc: 0.842, Mcc: 0.694, Sn: 0.894, Sp: 0.825
D. melanogaster (3538 samples)Acc: 0.853, Mcc: 0.686, Sn: 0.864, Sp: 0.853
A. thaliana (3956 samples),Acc: 0.797, Mcc: 0.621, Sn: 0.803, Sp: 0.792
E. coli (776 samples)Acc: 0.859,Mcc: 0.687, Sn: 0.881, Sp: 0.788
G. subterraneus (1812 samples)Acc: 0.860, Mcc: 0.703, Sn: 0.851, Sp: 0.843
G. pickeringi (1138 samples)Acc: 0.871, Mcc: 0.750, Sn: 0.857, Sp: 0.893
B. Manavalan et al. (2019) [19]1. Applied a feature representation learning scheme and generated 56 probabilistic features based on 4 different ML algorithms and 7 feature encodings covering diverse sequence information, including compositional, physicochemical, and NT position-specific information.
2. Compared the performance of the proposed predictor with those of three state-of-the art predictors.
C. elegans (3108 samples)Acc: 0.826, Auc: 0.892, Sn: 0.840, Sp: 0.812
D. melanogaster (3538 samples)Acc: 0.842, Auc: 0.904, Sn: 0.831, Sp: 0.854
A. thaliana (3956 samples)Acc: 0.792, Auc: 0.861, Sn: 0.761, Sp: 0.822
E. coli (776 samples)Acc: 0.848, Auc: 0.911, Sn: 0.869, Sp: 0.827
G. subterraneus (1812 samples)Acc: 0.855, Auc: 0.856, Sn: 0.854, Sp: 0.904
G. pickeringi (1138 samples)Acc: 0.891, Auc: 0.951, Sn: 0.884, Sp: 0.898
Table 2. The number of positive and negative samples collected from each of the eight species.
Table 2. The number of positive and negative samples collected from each of the eight species.
Dataset Name S i + S i S i = S i + S i
Caenorhabditis elegans (C. elegans)115411542308
Drosophila melanogaster (D. melanogaster)176917693538
Arabidopsis thaliana (A. thaliana)197819783956
Escherichia coli (E. coli)388388776
Geoalkalibacter subterraneus (G. subterraneus)9069061812
Geobacter pickeringii (G. pickeringi)5695691165
Fragaria vesca (F. vesca)432143218642
Rosa chinensis (R. chinensis)242124214842
Table 3. Training parameters used in word2vec.
Table 3. Training parameters used in word2vec.
Parametersword2vec Model
Training MethodCBOW
CorpusC. elegans, D. melanogaster, A. thaliana, E. coli, G. subterraneus, G. pickeringi, F. vesca, and
R. chinensis.
Context word3-mer
Vector size100
Window size5
Minimum count5
Negative sampling5
Epochs25
Table 4. Hyper-parameters tuning demonstration.
Table 4. Hyper-parameters tuning demonstration.
ParametersRange
The number of convolutional layers(1, 2, 3, 4, 5)
The number of filters in each convolutional layer(16, 32, 42, 64)
The size of filters of the convolutional layers(3, 5, 7, 9)
Dropout rate(0.2, 0.4, 0.5, 0.6)
Table 5. The detailed architecture of the proposed CNN.
Table 5. The detailed architecture of the proposed CNN.
Layer (Model: 1)Layer (Model: 2)Layer (Model: 3)
Input; Shape (39, 100)Input; Shape (39, 100)Input; Shape (39, 100)
Conv 1D (16, 3, 1)Conv 1D (32, 5, 1)Conv 1D (42, 7, 1)
Conv 1D (32, 5, 1)Conv 1D (42, 7, 1)Conv 1D (64, 9, 1)
Dropout (0.5)Dropout (0.3)Dropout (0.6)
Dense (1)Dense (1)Dense (1)
SigmoidSigmoidSigmoid
Table 6. Performance evaluation metrics result from the proposed approach (GS-MLDS) based on a different datasets.
Table 6. Performance evaluation metrics result from the proposed approach (GS-MLDS) based on a different datasets.
Dataset NameParameters (%)
ACCAUCPrecisionRecall
/TRP
FPRTNRFNR
C. elegans96.9696.9796.1797.833.8996.102.16
D. melanogaster94.3594.3593.6195.196.4993.504.80
A. thaliana92.2992.3089.5995.7011.1188.884.29
E. coli95.5195.5194.9396.155.1294.873.84
G. subterraneus98.0798.0798.8897.251.1098.892.74
M. pickeringi96.4996.4994.9198.245.2694.731.75
F. vesca95.4895.4992.9198.497.5292.471.50
R. chinensis96.6996.7095.3998.144.7595.241.85
Table 7. Layer to layer classification performance of GS-MLDS.
Table 7. Layer to layer classification performance of GS-MLDS.
Dataset NameLayer
No.
Train   Data ,   x i =
x i 1 + ( T P , T N ) i 1
Test   Data ,   y i = ( F P , F N ) i 1 MethodTotal
Accuracy,
i = 1 N ( T P , T N ) i i n i t i a l ,   y
AE
(Average Ensemble)
WAE
(Weighted Average
Ensemble)
TP|TNFP|FNTP|TNFP|FNWeights
C. elegans
(17,808 Instances)
118464621243381962660.3, 0.1, 0.242.42
22042266782601181480.0, 0.1, 0.367.96
321601482323788600.0, 0.1, 0.287.01
42248601222546140.1, 0.5, 0.296.96
D. melanogaster
(3538 Instances)
128307083623463843240.2, 0.1, 0.354.23
232143841941521971270.2, 0.1, 0.082.06
33411197668687400.0, 0.3, 0.294.35
A. thaliana
(3956 Instances)
131647923214713374550.0, 0.4, 0.242.55
235014251033682272280.1, 0.3, 0.771.21
3372819887281167610.3, 0.5, 0.292.29
E. coli
(776 Instances)
16201568769110460.3, 0.5, 0.270.51
27304637323970.2, 0.1, 0.495.51
G. subterraneus
(1812 Instances)
114493631232402001630.1, 0.0, 0.355.00
21649645518599640.1, 0.2, 0.582.36
317487361495770.1, 0.0, 0.298.07
G. pickeringi
(1165 Instances)
19102281061221081200.4, 0.1, 0.247.36
21018120675577430.0, 0.2, 0.181.14
310954322333580.4, 0.2, 0.696.49
F. vesca
(25,922 Instances)
16913172985787211845450.2, 0.1, 0.368.47
280975451277453432020.1, 0.4, 0.688.31
3844020253692124780.1, 0.0, 0.295.48
R. chinensis
(14,502 Instances)
138379693536166892800.1, 0.4, 0.271.10
24526280157459248320.5, 0.3, 0.296.69
Table 8. Comparison of GS-MLDS with another author’s model utilizing their datasets.
Table 8. Comparison of GS-MLDS with another author’s model utilizing their datasets.
AuthorsSplitting RatioDataset NameAuthors Model
Accuracy
GS-MLDS
Accuracy
S. Zhang et al. (2022) [10]10-foldC. elegans0.8510.978
D. melanogaster0.8590.954
A. thaliana0.8010.944
E. coli0.8700.961
G. subterraneus0.8590.950
G. pickeringi0.9010.973
Yu. Lezheng et al. (2022) [11]10-foldC. elegans0.8940.978
D. melanogaster0.8740.954
A. thaliana0.8390.944
L. Wang et al. (2022) [12]10-foldC. elegans0.8750.978
D. melanogaster0.8710.954
A. thaliana0.8280.944
E. coli0.9620.961
G. subterraneus0.9010.950
G. pickeringi0.9110.973
S. Zhang et al. (2022) [20]10-foldC. elegans0.8510.978
D. melanogaster0.8590.954
A. thaliana0.8010.944
E. coli0.870.961
G. subterraneus0.8590.950
G. pickeringi0.9010.973
J. Jin et al. (2022) [13]10-foldG. subterraneus0.8250.950
J. Khanal et al. (2021) [3]5-foldF. vesca0.8690.948
R. chinensis0.8540.952
G. Fang et al. (2021) [6]3-foldC. elegans0.9320.961
A. Wahab et al. (2021) [5]3-foldC. elegans0.9350.961
M. Tahir et al. (2021) [9]10-foldC. elegans0.8720.978
D. melanogaster0.8820.954
A. thaliana0.8400.944
E. coli0.8680.961
G. subterraneus0.8860.950
G. pickeringi0.9250.973
H. Zulfiqar et al. (2021) [14]10-foldE. coli0.8610.961
D.Y. Lim et al. (2021) [15]10-foldF. vesca0.8590.953
R. chinensis0.8650.953
A. Wahab el al. (2020) [16]10-foldF. vesca0.8150.953
R. chinensis0.7830.953
Z. Zhao et al. (2020) [17]10-foldC. elegans0.8260.978
D. melanogaster0.8420.954
A. thaliana0.7940.944
E. coli0.8430.961
G. subterraneus0.8470.950
G. pickeringi0.8770.973
M.M. Hasan et al. (2020) [8]80:20G. subterraneus0.8160.980
J. Khanal et al. (2019) [18]10-foldC. elegans0.8420.978
D. melanogaster0.8530.954
A. thaliana0.7970.944
E. coli0.8590.961
G. subterraneus0.8600.950
G. pickeringi0.8710.973
B. Manavalan et al. (2019) [19]10-foldC. elegans0.8260.978
D. melanogaster0.8420.954
A. thaliana0.7920.944
E. coli0.8480.961
G. subterraneus0.8550.950
G. pickeringi0.8910.973
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Halder, R.K.; Uddin, M.N.; Uddin, M.A.; Aryal, S.; Islam, M.A.; Hossain, F.; Jahan, N.; Khraisat, A.; Alazab, A. A Grid Search-Based Multilayer Dynamic Ensemble System to Identify DNA N4—Methylcytosine Using Deep Learning Approach. Genes 2023, 14, 582. https://doi.org/10.3390/genes14030582

AMA Style

Halder RK, Uddin MN, Uddin MA, Aryal S, Islam MA, Hossain F, Jahan N, Khraisat A, Alazab A. A Grid Search-Based Multilayer Dynamic Ensemble System to Identify DNA N4—Methylcytosine Using Deep Learning Approach. Genes. 2023; 14(3):582. https://doi.org/10.3390/genes14030582

Chicago/Turabian Style

Halder, Rajib Kumar, Mohammed Nasir Uddin, Md. Ashraf Uddin, Sunil Aryal, Md. Aminul Islam, Fahima Hossain, Nusrat Jahan, Ansam Khraisat, and Ammar Alazab. 2023. "A Grid Search-Based Multilayer Dynamic Ensemble System to Identify DNA N4—Methylcytosine Using Deep Learning Approach" Genes 14, no. 3: 582. https://doi.org/10.3390/genes14030582

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop