A Novel Predictor for the Analysis and Prediction of Enhancers and Their Strength via Multi-View Features and Deep Forest

Gill, Mehwish; Ahmed, Saeed; Kabir, Muhammad; Hayat, Maqsood

doi:10.3390/info14120636

Open AccessArticle

A Novel Predictor for the Analysis and Prediction of Enhancers and Their Strength via Multi-View Features and Deep Forest

¹

School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan

²

Biomedical Center, Lund University, 22362 Lund, Sweden

³

Department of Computer Science, Abdul Wali Khan University, Mardan 23200, Pakistan

^*

Author to whom correspondence should be addressed.

Information 2023, 14(12), 636; https://doi.org/10.3390/info14120636

Submission received: 20 October 2023 / Revised: 19 November 2023 / Accepted: 21 November 2023 / Published: 28 November 2023

(This article belongs to the Special Issue Applications of Deep Learning in Bioinformatics and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Enhancers are short DNA segments (50–1500 bp) that effectively activate gene transcription when transcription factors (TFs) are present. There is a correlation between the genetic differences in enhancers and numerous human disorders including cancer and inflammatory bowel disease. In computational biology, the accurate categorization of enhancers can yield important information for drug discovery and development. High-throughput experimental approaches are thought to be vital tools for researching enhancers’ key characteristics; however, because these techniques require a lot of labor and time, it might be difficult for researchers to forecast enhancers and their powers. Therefore, computational techniques are considered an alternate strategy for handling this issue. Based on the types of algorithms that have been used to construct predictors, the current methodologies can be divided into three primary categories: ensemble-based methods, deep learning-based approaches, and traditional ML-based techniques. In this study, we developed a novel two-layer deep forest-based predictor for accurate enhancer and strength prediction, namely, NEPERS. Enhancers and non-enhancers are divided at the first level by NEPERS, whereas strong and weak enhancers are divided at the second level. To evaluate the effectiveness of feature fusion, block-wise deep forest and other algorithms were combined with multi-view features such as PSTNPss, PSTNPdss, CKSNAP, and NCP via 10-fold cross-validation and independent testing. Our proposed technique performs better than competing models across all parameters, with an ACC of 0.876, Sen of 0.864, Spe of 0.888, MCC of 0.753, and AUC of 0.940 for layer 1 and an ACC of 0.959, Sen of 0.960, Spe of 0.958, MCC of 0.918, and AUC of 0.990 for layer 2, respectively, for the benchmark dataset. Similarly, for the independent test, the ACC, Sen, Spe, MCC, and AUC were 0.863, 0.865, 0.860, 0.725, and 0.948 for layer 1 and 0.890, 0.940, 0.840, 0.784, and 0.951 for layer 2, respectively. This study provides conclusive insights for the accurate and effective detection and characterization of enhancers and their strengths.

Keywords:

enhancers; deep learning; deep forest; bioinformatics; feature representation; learning algorithms; classification; sequence-based models

1. Introduction

Cells are the basic unit of life and are classified as prokaryotic (the primitive form of cells) or eukaryotic (more advanced and complex forms of cells). The nucleus is an organelle that contains chromosomes in eukaryotic cells. Chromosomes are composed of both DNA and proteins. DNA is composed of nucleotide chains [1]. All cellular information is stored in DNA in the form of a specific arrangement of nucleotides. It is responsible for passing this information from the parent to the child cells. During the flow of information in a cell, DNA is used to make an RNA molecule (transcription), and then the RNA molecule is used to make a protein (translation). Transcription involves the synthesis of RNA molecules from DNA [2]. Promoters, enhancers, silencers, and transcription factors are involved in the regulation of transcription and are located within the DNA. DNA expression (transcription and translation) can be controlled by regulatory elements (enhancers and silencers, respectively). Promoters and enhancers increase transcription, while silencers decrease this process. Transcription factors are not part of the DNA molecule itself but are proteins that bind to promoters and enhancers to regulate transcription [2].

Enhancers are the distal regulatory elements which affect a number of cell processes, such as gene expression, tissue specificity, cell growth, carcinogenesis, and viral activity in the cell. However, not all enhancers in the DNA affect the promoter sites of the gene, as some of these are active in the cell, whereas others are inactive. Enhancers are of particular interest to researchers because they can alter cell expression, and mutations (permanent alterations in DNA) in enhancers can cause serious genetic diseases [3]. In addition to other regulatory elements, enhancers must be classified because they exist in several states, each with unique biological activities and effects on genes [4].

Various conventional methods have been investigated to predict enhancers and their strengths (details are provided in the Related Work section). Although existing methods have yielded significant results, there is still room for improvement, as the recent accumulation of high-throughput data on enhancers has raised the need for efficient computational methods capable of accurately predicting enhancer positions at the genome-wide level. This led us to introduce a new two-layer bioinformatics tool, NEw Predictor for EnhanceRs and their Strength Prediction (NEPERS). The significant contributions of this study are summarized below. As shown in Figure 1:

NEPERS formulates the prediction of enhancers and their strength as a binary classification problem and solves it using a cascade deep forest algorithm.
It takes advantage of multi-view features, such as position-specific trinucleotide propensity based on single-stranded (PSTNPss) characteristics, position-specific trinucleotide propensity based on double-stranded (PSTNPdss) characteristics, the composition of k-spaced nucleic acid pairs (CKSNAP), and nucleotide chemical properties (NCP), to incorporate biological sequences into nominal descriptors.
A block-wise deep forest algorithm was applied, and a quantitative score was derived using metrics including accuracy (ACC), specificity (Spe), sensitivity (Sen), and Mathew’s correlation coefficient (MCC) with five-fold cross-validation (5CV), and independent dataset tests are utilized to evaluate the performance of NEPERS.
Our method outperformed existing predictors by achieving high predictive rates.

2. Related Work

In previous studies, many techniques have been used to successfully discover enhancers and their strengths, which are based on different machine learning (ML), ensemble, and DL approaches. These methods include iEnhancer-2L [4], EnhancerPred [5], Enhancer-TNC [6], iEnhancer-5Steps [7], Enhancer-PCWM [8], iEnhancer-RF [9], and iEnhancer-MFGBDT [10], which are classification methods based on conventional machine learning approaches. iEnhancer-EL [11], iEnhancer-XG [12], and iEnhancer-EBLSTM [13] are classification methods based on ensemble-based approaches. iEnhancer-ECNN [14], Enhancer-RNN [15], Enhancer-DSNet [16], spEnhancer [17], iEnhancer-RD [18], Enhancer-BERT-2D [19], and iEnhancer-DHF [20] are classification methods based on DL approaches. All developed methods are discussed from multiple aspects, such as benchmark and independent dataset construction; feature generation and the selection of optimized features; ML-, ensemble-, and DL-based learning models; and evaluation parameters.

In 2016, Liu et al. created the first two-layer enhancer predictor, iEnhancer-2L [4]. In this model, DNA sequences were produced using the pseudo-k-tuple nucleotide composition (PseKNC). A jackknife test was conducted to evaluate the predictor performance. The five-fold cross-validation (5CV) technique was utilized to find the best parameters. SVM performed the best among all classifiers [21]. In 2016, Cangzhi et al. proposed EnhancerPred [5], which utilized three types of sequence-based features to convert biological sequences into attributes. The F1-score, in combination with SVM, was used to obtain the optimal features. The study utilized different classifiers, among which the SVM provided outstanding results in terms of overall performance. In 2017, Tahir et al. developed iEnhancer-TNC [6], which utilized dinucleotide composition (DNC) and trinucleotide composition (TNC) to extract numerical descriptors from these biological sequences. In this model, the SVM classifier, along with the TNC approach, was predicted to be the best for classifying enhancers. In 2019, Le et al. developed a classifier named iEnhancer-5Steps [7] using the concept of pseudo-amino acid composition with 5CV. The scikit-learn package was used to perform SVM on the dataset. In 2021, Yang et al. introduced a predictor called Enhancer-PCWM based on the physicochemical properties of the weight matrix (PCWM) [8]. In the first layer, SVM with a linear kernel is used, and in the second layer, the AdaBoost ensemble method is used to classify enhancers and their strengths, respectively. In 2021, Lim et al. developed iEnhancer-RF [9]. In this study, the RF method and a light gradient boost machine (LGBM) model were applied. The 5CV strategy was used for data extraction in both layers. Independent tests were also performed on the data obtained from iEnhancer-EL.

Similarly, in 2018, Liu et al. developed an ensemble-based method, iEnhancer-EL [11]. The BioSeq-Analysis tool was used to transform sequences into vectors. The jackknife CV method was applied to the training dataset. The K-mers, subsequence profile, and PseKNC techniques were employed for feature extraction and SVM for enhancer prediction. In 2018, Cai et al. developed iEnhancer-XG [12]. In this model, the subsequence profile technique along with the XGBoost learning algorithm were used for prediction. The 10CV and independent datasets were used as evaluation strategies. In 2021, Niu et al. developed iEnhancer-EBLSTM [13]. In this study, the Ensemble Long Short-Term Memory (EBLSTM) method was used, which works in two steps: extracting features using a 3-mer and identifying enhancers using the LSTM method. In 2021, Liang et al. developed iEnhancer-MFGBDT [10]. In this study, multiple features, such as the k-mer and reverse-complement k-mer nucleotide composition (RCKmers) based on DNA sequence, second-order moving average, normalized Moreau–Broto auto-cross-correlation, and Moran auto-cross-correlation based on the dinucleotide physical structural property matrix, were fused, and then a gradient boosting decision tree (GBDT) was used to extract features and perform classification. In 2019, Nguyen et al. developed a CNN-based model called Enhancer-ECNN [14]. For data transformation, one-hot encoding and k-mers were used. The trained CNNs were used for model construction. In 2020, Li et al. developed Enhancer-RNN [15]. The most efficient optimal feature extraction method used in this study was Random Forest (RF) which was applied along with the 10-fold CV strategy. Enhancer classification was performed using the 3-mer, word-to-vector, and RNN models. In 2020, Ibrahim et al. introduced Enhancer-DSNet [16]. In this study, a two-layer classification model based on a linear classifier was developed, in which 5CV and independent tests, along with an SVM classifier, were used. In 2021, Mu et al. developed spEnhancer [17]. This study used SeqPose to convert DNA sequences into a numerical sequence and a Chi-Squared test to remove redundant data. Another three-class classification model is implemented in this study. This model collects samples from three different classes and trains them via a bidirectional LSTM model. In 2021, Yang et al. developed the iEnhancer-RD [18] method with optimal features selected using the RFE. For parameter optimization, SVM, along with a linear kernel and 5CV strategy, was used as the evaluation strategy. Enhancer-BERT-2D [19] was introduced by Nguyen et al. in 2021 [19]. BERT and CNN were used as the feature extraction techniques. The model’s performance was tested using independent data, 5CV, and strategy. In 2021, NAGINA et al. developed the iEnhancer-DHF [20]. This method represents the DNA sequence in the feature vectors using the PseKNC and FastText methods. This model uses Chou’s five-step rule, and is based on the DNN algorithm. An independent dataset and the 10CV strategy were used in this model.

3. Materials and Methods

3.1. Data Collection

For appropriate predictor training and assessment, a reliable benchmark dataset must be created or selected. The following two datasets (benchmark and independent) have mostly been utilized in the literature for the prediction of enhancers and their strengths from primary sequences [2,5,7,15,16].

S_{B} = S^{+} \cup S^{-}

(1)

S^{+} = S_{s t r o n g}^{+} \cup S_{w e a k}^{+}

(2)

S_{I N D} = S_{I N D}^{+} \cup S_{I N D}^{-}

(3)

S_{I N D}^{+} = S_{I N D s t r o n g}^{+} \cup S_{I N D w e a k}^{+}

(4)

The dataset

S_{B}

indicates the benchmark dataset for discrimination of enhancers and non-enhancers where there are 1484 enhancer samples (positive) and 1484 non-enhancer samples (negative), represented with symbols

S^{+}

and

S^{-}

, respectively. The dataset was named layer-1 prediction and is the current predictor. The dataset

S^{+}

indicates the layer-2 benchmark dataset where the 1484 enhancer samples (positive) are further divided into strong and weak enhancers. There are 742 samples for each strong and weak enhancer which are represented with symbols

S_{s t r o n g}^{+}

and

S_{w e a k}^{+}

, respectively. Likewise,

S_{I N D}

shows the independent dataset with 200 of each sample for enhancers and non-enhancers. Enhancers are further categorized into strong and weak enhancers, having 100 samples for each and represented with

S_{I N D}^{+}

as shown in Table 1.

3.2. Feature Engineering

3.2.1. PSTNPss

Position-specific trinucleotide propensity based on single-stranded (PSTNPss) characteristics is a statistical technique based on single-stranded DNA/RNA properties [22,23]. The total trinucleotide number is 64, including AAA, AAC, ..., TTT. Therefore, the following 64 × 79 matrix may be used to describe the trinucleotide position specificity for an 81 bp sample:

Z = [\begin{matrix} z_{1, 1} & \begin{matrix} z_{1, 2} & \dots \end{matrix} & z_{1, 79} \\ z_{2, 1} & \begin{matrix} z_{2, 2} & \dots \end{matrix} & z_{2, 79} \\ \begin{matrix} ⋮ \\ z_{64, 1} \end{matrix} & \begin{matrix} \begin{matrix} ⋮ \\ z_{64, 2} \end{matrix} & \begin{matrix} \dots \\ \dots \end{matrix} \end{matrix} & \begin{matrix} ⋮ \\ z_{64, 79} \end{matrix} \end{matrix}]

(5)

where the variable:

Z_{i, j} = F^{+} (3 {m e r}_{i}| j) - F^{-} (3 {m e r}_{i} | j)

(6)

Here, i = 1, 2, ..., 64 and j = 1, 2, ..., 79.

F^{+}

(3

{m e r}_{i}

|j) and

F^{-}

(3

{m e r}_{i}

|j) denote the frequency of the ith trinucleotide (3

{m e r}_{i}

) at the jth position appearing in the positive (

S^{+}

) and negative (

S^{-}

) datasets, respectively. In the formula,

{m e r}_{1}

equals AAA,

3 {m e r}_{2}

equals AAC, …, 3

{m e r}_{64}

equals TTT. Therefore, the sample of Equation (5) can be expressed as:

S = {[φ_{1} {, φ}_{2}, \dots, φ_{u}, \dots, φ_{79}]}^{T}

(7)

where T is the operator of transpose and

φ_{u}

is defined as follows where 1 <= µ <= 79 [24]:

φ_{u} = [\begin{matrix} z_{1, u} & w h e n N_{u} N_{u + 1} N_{u + 2} = A A A \\ \begin{matrix} z_{2, u} \\ z_{3, u} \end{matrix} & \begin{matrix} w h e n N_{u} N_{u + 1} N_{u + 2} = A A C \\ w h e n N_{u} N_{u + 1} N_{u + 2} = A A G \end{matrix} \\ \begin{matrix} ⋮ \\ z_{64, u} \end{matrix} & \begin{matrix} \begin{matrix} ⋮ & ⋮ & ⋮ \end{matrix} \\ w h e n N_{u} N_{u + 1} N_{u + 2} = T T T \end{matrix} \end{matrix}]

(8)

3.2.2. PSTNPdss

PSTNPdss is statistically featured, employing an approach based on double-stranded DNA properties according to complementary base pairing; hence, they have more obvious statistical features [25]. At this point, we consider A and T to be equivalent to C and G. As a result, each sample may be transformed into a sequence including only A and T. As a result, the following 8 × 79 matrix may convey the features of trinucleotide position specificity for an 81 bp sample:

Z^{'} = [\begin{matrix} {z^{'}}_{1, 1} & \begin{matrix} {z^{'}}_{1, 2} & \dots \end{matrix} & {z^{'}}_{1, 79} \\ {z^{'}}_{2, 1} & \begin{matrix} {z^{'}}_{2, 2} & \dots \end{matrix} & {z^{'}}_{2, 79} \\ \begin{matrix} ⋮ \\ {z^{'}}_{8, 1} \end{matrix} & \begin{matrix} \begin{matrix} ⋮ \\ {z^{'}}_{8, 2} \end{matrix} & \begin{matrix} \dots \\ \dots \end{matrix} \end{matrix} & \begin{matrix} ⋮ \\ {z^{'}}_{8, 79} \end{matrix} \end{matrix}]

(9)

where the variable is

{Z^{'}}_{i, j} = F^{+} (3 {m e r}_{i}| j) - F^{-} (3 {m e r}_{i} | j)

(10)

Here, i = 1, 2, …, 8 and j = 1,2, …, 79.

F^{+}

(3

{m e r}_{i}

|j) and

F^{-}

(3

{m e r}_{i}

|j) denote the frequency of the ith trinucleotide (3

{m e r}_{i}

) at the jth position appearing in the positive (

S^{+}

) and negative (

S^{-}

) datasets, respectively. In the formula,

{m e r}_{1}

equals AAA,

3 {m e r}_{2}

equals AAC, …, 3

{m e r}_{8}

equals CCC. Therefore, the sample of Equation (9) can be expressed as:

S^{'} = {[{φ^{'}}_{1} {, φ^{'}}_{2}, \dots, {φ^{'}}_{u}, \dots, {φ^{'}}_{79}]}^{T}

(11)

where T is the operator of transpose and

φ_{u}

is defined as follows where 1 <= µ <= 79 [24]:

{φ^{'}}_{u} = [\begin{matrix} {z^{'}}_{1, u} & w h e n N_{u} N_{u + 1} N_{u + 2} = A A A \\ \begin{matrix} {z^{'}}_{2, u} \\ {z^{'}}_{3, u} \end{matrix} & \begin{matrix} w h e n N_{u} N_{u + 1} N_{u + 2} = A A C \\ w h e n N_{u} N_{u + 1} N_{u + 2} = A C A \end{matrix} \\ \begin{matrix} ⋮ \\ {z^{'}}_{8, u} \end{matrix} & \begin{matrix} \begin{matrix} ⋮ & ⋮ & ⋮ \end{matrix} \\ w h e n N_{u} N_{u + 1} N_{u + 2} = C C C \end{matrix} \end{matrix}]

(12)

3.2.3. CKSNAP

Composition of k-spaced nucleic acid pairs (CKSNAP) that are K steps apart from one another is represented by the K-spaced nucleic acid pairs feature encoding [26,27]. The CKSNAP characteristic has 16 values that correspond to nucleic acid pairs: AA, AC, AG, …, TG, TT. K is specified to have a maximum value of 5 by default. Using k = 1 as an example, CKSNAP is as follows:

V = \frac{N_{A * A}}{N_{T o t a l}}, \frac{N_{A * C}}{N_{T o t a l}}, \frac{N_{A * G}}{N_{T o t a l}}, \dots, \frac{N_{T * T}}{N_{T o t a l}}

(13)

where * denotes any A, C, G, or T nucleotide; N_A×Y denotes the number of X * Y nucleic acid pairs that occur in the sequence; and N_Total is the total number of one-spaced nucleic acid pairs in the sequence [28].

To further clarify the CKSNAP feature calculation, let us consider a small example using the DNA sequence ‘ATGCATGC’: For k = 1:

V = \frac{N_{A * A}}{N_{T o t a l}}, \frac{N_{A * C}}{N_{T o t a l}}, \frac{N_{A * G}}{N_{T o t a l}}, \frac{N_{A * T}}{N_{T o t a l}}, \frac{N_{C * A}}{N_{T o t a l}}, \frac{N_{C * C}}{N_{T o t a l}}, \frac{N_{C * G}}{N_{T o t a l}}, \frac{N_{C * T}}{N_{T o t a l}}, \frac{N_{G * A}}{N_{T o t a l}}, \frac{N_{G * C}}{N_{T o t a l}}, \frac{N_{G * G}}{N_{T o t a l}}, \frac{N_{G * T}}{N_{T o t a l}}, \frac{N_{T * A}}{N_{T o t a l}}, \frac{N_{T * C}}{N_{T o t a l}}, \frac{N_{T * G}}{N_{T o t a l}}, \frac{N_{T * T}}{N_{T o t a l}}

where:

N_A∗A represents the count of ‘AA’ pairs;
N_A∗C represents the count of ‘AC’ pairs;
N_A∗G represents the count of ‘AG’ pairs;
and so on for all 16 possible nucleic acid pairs.
N_Total is the total count of one-spaced nucleic acid pairs in the sequence.

For our example sequence ‘ATGCATGC’, the counts would be determined, and the corresponding values for each pair in the feature vector would be calculated. This illustrates how CKSNAP captures the composition of one-spaced nucleic acid pairs in a sequence.

3.2.4. Nucleotide Chemical Property (NCP)

NCP is a simple encoding technique denoted by nucleotide chemical properties. Its distinctiveness stems from the chemical nature of each form of ribonucleic acid. There is no ribonucleic acid that shares more than one chemical attribute with others based on three chemical qualities [29]. As a result, NCP-encoded characteristics extracted from an RNA sequence contain sufficient structural details to solve a binary classification problem. Each sequence sample is converted into a 3×N-dimensional matrix using NCP, where N is the sequence length. Nucleotides come in four distinct forms, which are adenine (A), guanine (G), cytosine (C), and thymine (T) with various chemical structures and bonds that make up an RNA sequence. Guanine and adenine are purines with two fused aromatic rings, whereas cytosine and thymine are pyrimidines with a single aromatic ring. Aromatic cyclic structures are present in both groups and attach to sugar molecules [30]. Furthermore, the amino group is shared by adenine and cytosine, whereas the keto group is shared by guanine and uracil. The number of hydrogen bonds created between adenine and uracil, on the other hand, is less than that formed between guanine and cytosine. As a result, these four separate kinds of nucleotides with their three distinct sets of chemical characteristics have been given binary categorization criteria. Based on their chemical characteristics, A, C, G, and T are represented by the combined coordinates [1, 1, 1], [0, 1, 0], [1, 0, 0], and [0, 0, 1], respectively [31,32].

3.3. Model Training and Evaluation

3.3.1. Deep Forest

Deep forest (DF) is a deep neural network which has great modeling ability, minimal hyperparameters and is parallel-friendly [33]. Deep forest’s layer-by-layer processing, in-model feature translation, and proper model complexity are three aspects that contribute to the success of deep neural networks [33]. DF models, as compared to other DNN approaches, produce significantly superior outcomes. When dealing with tiny sample sizes, however, several basic deep forest models may encounter overfitting and ensemble diversity issues. In classifiers, these methods are capable of learning more relevant high-level features [34]. It has become a dominating classifier in a variety of fields, including radar high-resolution range profile recognition, face anti-spoofing, hyperspectral imaging, self-interacting proteins, and cancer subtype classification. For feature learning in each layer, DF is an ensemble-based technique that uses decision trees rather than various neural network structures. DF’s technique is reliable and suitable for training tiny amounts of data because of its cascade-type design. DNN has a harder time tuning the hyperparameter than DF. Unlike other classification algorithms, the deep forest classifier’s greater discriminative capacity is due to its strong learning potential [35]. We created a new CDF technique, which is an extension of the gcForest model that includes the RF, extreme gradient boost (XGBoost), and extremely randomized trees (ERT) classifiers. The XGBoost classifier’s boosting parameter was set to k = 500 and the trees of ERT were also set to 500. The number of decision trees on RF was similarly set to 500, and the node characteristics were chosen at random [36]. The details of parameters utilized for the training of the final model are provided in Table S1.

3.3.2. Two-Layer Classification Framework

This framework classifies enhancers into two levels. These not only identify enhancers but also classify them as strong or weak enhancers based on their strength. Enhancers and non-enhancers are separated at the first level, while enhancers are divided into two subgroups at the second level: strong enhancers and weak enhancers. In the two-layer classifying framework, the benchmark dataset is used to train, and an independent dataset is used to test the predictor. Enhancers should be identified as strong and weak enhancers based on their varied levels of biological activity and the regulatory impact they have on the target genes [4]. That is why we have also used a two-layered classification framework in our model.

4. Evaluation Parameters

The performance evaluation is considered important in machine learning as a way to evaluate the effectiveness of a computational model after its development [37,38,39]. According to the data types and categorization issues, different performance measurements factors are employed for this aim. With the use of a confusion matrix, which records both the correct and wrong values for each class and compares them to the actual outcomes, performance measure parameters are created [40]. The confusion matrix’s columns indicate the actual values for each class, whereas its rows indicate the predicted class [4]. The following is an overview of the performance evaluation criteria adopted in the ongoing research work:

A C C = \frac{T P + T N}{T P + F P + F N + T N}

(14)

S e n = \frac{T P}{T P + F P}

(15)

S p e = \frac{T N}{F P + T N}

(16)

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{[T P + F P] [T P + F N] [T N + F P] [T N + F N]}}

(17)

5. Results and Analysis

5.1. Feature Analysis (Individual vs. Fusion)

We used the PSTNPss, PSTNPdss, CKSNAP, and NCP for the feature encoding combination in Table 2 by employing the CDF technique for the purpose of evaluating the feature fusion’s performance. The ACC of PSTNPss, PSTNPdss, CKSNAP, and NCP for layer 1 can be observed to be 0.810, 0.801, 0.761, and 0.744, respectively. The achieved ACC of the four features combined was 0.876, which is greater than the ACC of individual sequence features as shown in Figure 2A and the ROC curve in Figure 2B. Additionally, the ACCs for the second layer were 0.916, 0.658, 0.635, and 0.606, accordingly. Once more, the fusion of all four features performs better than any one feature alone, with an ACC of 0.959 as shown in Figure 3A and the ROC curve in Figure 3B.

We also assessed how well the models performed in independent tests. According to Table 2, the ACCs for layer 1 of the independent tests for PSTNPss, PSTNPdss, CKSNAP, and NCP were 0.853, 0.778, 0.748, and 0.503, respectively. The ACC of the four features combined, however, was 0.863 for an independent dataset, which is higher than the ACC of each sequence feature as shown in Figure 2C and the ROC curve in Figure 2D. Additionally, the ACCs for the second layer for the independent tests were 0.855, 0.805, 0.655, and 0.650, respectively. Once more, the fusion of all four traits exceeded each one separately with an ACC of 0.890, as shown in Figure 3C and the ROC curve in Figure 3D. In conclusion, these findings demonstrated that the feature fusion approach is successful in enhancing our model’s capacity for prediction.

It is extensively documented in the machine learning literature that harnessing the combination of multiple features, as opposed to depending on a solitary feature set, can markedly improve model performance [41,42]. ‘PSTNPss’ may exert a dominant influence on the model’s performance in comparison to other features. In certain instances, a singular feature might encapsulate the essential information requisite for precise predictions, diminishing the impact of additional features [43,44,45]. ‘PSTNPss’ captures pivotal patterns in both enhancer and non-enhancer sequences, playing a crucial role in the model’s efficacy. This feature contains a wealth of information and distinctive patterns that differentiate enhancers from non-enhancers, especially when contrasted with ‘NCP’ and ‘CKSNAP.’ The model appears to heavily rely on the information encoded in ‘PSTNPs’ for making accurate predictions. Therefore, we also tested and evaluated different combinations (like sets of two and three) of the individual features for performance, and their results are illustrated in Table S2. It is found from the comparison that different combination results are less than the combination of all features. Thus, for the final training of the proposed model, all features were considered for both layers 1 and 2 and on both benchmark and independent datasets.

5.2. Analysis of Various Classifiers

In this section, we compare the performance of our suggested classifiers to that of other classifiers via the benchmark dataset, e.g., RF, SVM, XGB, GBDT, LightGBM, DNN, MLP, CNN, and CDF. We compared these classifiers with respect to both the CV and independent test performance. Table 3 displays the 10-fold CV outcomes, bolding the metrics with the highest values for each class. It is evident that CDF performs better than other classifiers in all measurement parameters with an ACC of 0.876, Sen of 0.864, Spe of 0.888, MCC of 0.753, and AUC of 0.940 for layer 1 and an ACC of 0.959, Sen of 0.960, Spe of 0.958, MCC of 0.918, and AUC of 0.990 for layer 2, respectively. Table 3 displays the independent testing results, bolding the metrics with the highest values for each class. Looking at the results drawn for independent testing in Table 3, for layer 1, we can see that the CDF has the highest ACC of 0.863, Sen of 0.865, Spe of 0.860, MCC of 0.725, and AUC of 0.948. And for second layer, the CDF has the highest ACC of 0.890 and Sen of 0.940, the DNN has the highest Spe, i.e., 0.990, and the CDF has the highest MCC of 0.784 and AUC of 0.951. Based on the ACC of both the layers, we will use the CDF to analyze our features as the best analyzing classifier among the rest of the classifiers tested, as shown in Figure 4. Therefore, we will now analyze our features on both cross-validation and the independent dataset using CDF.

5.3. Comparison with Existing Methods via 10-Fold CV

In this part, we compare the performance of our suggested features to existing benchmark techniques and other research teams looking at the enhancer classification problem, e.g., iEnhancer-2L, EnhancerPred, Enhancer-TNC, iEnhancer-5Step, Enhancer-PCWM, iEnhancer-RF, iEnhancer-EL, iEnhancer-XG, iEnhancer-EBLSTM, iEnhancer-MFBGDT, iEnhancer-ECNN, Enhancer-RNN, Enhancer-DSNet, spEnhancer, iEnhancer-RD, Enhancer-BERT-2D, iEnhancer-DHF, and our proposed method. In this portion, we compare cross-validation test performance. Table 4 presents the results, bolding the metrics with the highest values for each class. It is evident that our suggested technique performs better than competing models across all parameters with an ACC of 0.876, Sen of 0.864, Spe of 0.888, MCC of 0.753, and AUC of 0.940 for layer 1 and an ACC of 0.959, Sen of 0.960, Spe of 0.958, MCC of 0.918, and AUC of 0.990 for layer 2, respectively. Thus, we are able to provide a different method for identifying enhancers and their strengths which outperforms other feature sets in independent data testing as well.

5.4. Independent Test Comparison with Existing Methods

In this section, we evaluate the performance of our suggested model to other benchmark techniques and research teams looking at the enhancer classification problem, e.g., iEnhancer-5Step, Enhancer-PCWM, iEnhancer-RF, iEnhancer-EL, iEnhancer-XG, iEnhancer-EBLSTM, iEnhancer-MFBGDT, iEnhancer-ECNN, Enhancer-DSNet, spEnhancer, iEnhancer-RD, Enhancer-BERT-2D, iEnhancer-DHF, and our proposed method. We compare the performance of independent tests in this section. The results are shown in Table 5, where the metrics with the highest values for each class are bolded. It is evident that our suggested technique performs better than competing models across all parameters with an ACC of 0.863, Sen of 0.865, Spe of 0.860, MCC of 0.725, and AUC of 0.948 for layer 1 and an ACC of 0.890, Sen of 0.940, Spe of 0.840, MCC of 0.784, and AUC of 0.951 for layer 2, respectively. Thus, we are able to provide a different method for identifying enhancers and their strengths which outperforms other feature sets in independent data testing as well.

5.5. Model Interpretation

In this section, we use the Shapley Additive Explanation Algorithm (SHAP) [4] to interpret the proposed models for prediction of enhancers and their strength. SHAP is a unified predictive interpretation framework proposed in 2017 as the only locally consistent and accurate expectation-based feature mapping method [28]. This technique can interpret the values of the importance of features of complex learning models and provide interpretable predictions for the test sample [28]. The SHAP scores are proposed as a consistent measure of feature importance because they assign each feature a significant value (

\emptyset

) that represents the impact of including that feature in model predictions [28,46].

The top 20 features for the enhancer and its strength are displayed in Figure 5A,B, where each row represents the distribution of a feature’s SHAP value. The feature contribution increases in the positive direction in proportion to the bigger SHAP value in that direction, and vice versa. The feature value’s magnitude is represented by color. When the qualities rise, the color becomes red, and when they fall, it becomes blue. For instance, in the case of PSTNPdss39 and CKSNAP4, a larger characteristic value corresponds to a stronger positive SHAP value. A SHAP value of >0 denotes that the prediction favors the positive class, and the points with varied colors show the distribution of each feature’s impact on the output of the suggested model. On the other hand, a prediction that has a SHAP value of less than zero indicates a tendency toward the negative class.

6. Conclusions

This study includes a comprehensive analysis of all existing enhancer predictors based on ML-based techniques, ensemble approaches, and DL-based methods. We performed a comparison analysis and performance assessment of all the available predictors in terms of datasets utilization, feature construction, ACC outcomes, and MCC outcomes. We discovered that enhancer prediction was performed using two datasets: the benchmark for enhancer prediction among other regulatory elements and, for estimating enhancer strength, an independent dataset. In terms of efficiency and acceptability, we ran a comparative analysis and discovered two approaches that produce the best results: iEnhancer-5Step was found to be best for the

S_{B}

dataset, while Enhancer-RNN was found to be best for the

S^{+}

dataset. This study also gives us information that will help us design and develop new predictors for the precise and reliable identification and prediction of enhancers and their strengths. In our model, we also incorporated a cascade-type structure consisting of two layers, where enhancers and non-enhancers are separated at the first level, and the differentiation between strong and weak enhancers is performed at the second level. We used the PSTNPss, PSTNPdss, CKSNAP, and NCP algorithms in conjunction with the CDF technique in order to perform a preliminary evaluation of the performance of the feature encoding combination before moving on to evaluate the performance of feature fusion and selection. Through the use of a benchmark dataset, we evaluated how well our suggested classifier, CDF, performed in comparison to the other classifiers such as RF, SVM, XGB, GBDT, LightGBM, DNN, MLP, and CNN. The performances of these classifiers via 10-fold CV and independent tests were taken into consideration and compared. CDF clearly outperforms than other classifiers in all measurement parameters, making it a better predictor than other models. It is anticipated that with the help of this comparative analysis and assessment as well as based on the outcomes of the proposed method, the future researchers will find it effective in predicting enhancers and their strengths.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/info14120636/s1. Table S1: Hyperparameter search details for Cascade deep forest classifiers; Table S2: Performance analysis of various feature subsets via benchmark and independent datasets.

Author Contributions

Conceptualization, M.G., S.A. and M.K.; methodology, M.G. and S.A.; validation, M.G., S.A. and M.K. investigation, M.G., S.A., M.K. and M.H.; data curation, M.G. and S.A.; writing—original draft preparation, M.G., S.A., M.K. and M.H.; writing—review and editing, M.G., S.A., M.K. and M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors are very thankful to the University of Management and Technology, Lahore, Pakistan for providing support for this work.

Conflicts of Interest

All the authors of this manuscript declare no conflict of interest.

References

Tahir, M.; Hayat, M. Inuc-stnc: A sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of saac and chou’s pseaac. Mol. BioSyst. 2016, 12, 2587–2593. [Google Scholar] [CrossRef] [PubMed]
Akui, T.; Fujiwara, K.; Sato, G.; Takinoue, M.; Shin-ichiro, M.N.; Doi, N. System concentration shift as a regulator of transcription-translation system within liposomes. Iscience 2021, 24, 102859. [Google Scholar] [CrossRef] [PubMed]
Hu, Z.; Tee, W.W. Enhancers and chromatin structures: Regulatory hubs in gene expression and diseases. Biosci. Rep. 2017, 37, BSR20160183. [Google Scholar]
Liu, B.; Fang, L.; Long, R.; Lan, X.; Chou, K.-C. Ienhancer-2l: A two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics 2016, 32, 362–369. [Google Scholar] [CrossRef] [PubMed]
Jia, C.; He, W. Enhancerpred: A predictor for discovering enhancers based on the combination and selection of multiple features. Sci. Rep. 2016, 6, 38741. [Google Scholar] [CrossRef] [PubMed]
Tahir, M.; Hayat, M.; Kabir, M. Sequence based predictor for discrimination of enhancer and their types by applying general form of chou’s trinucleotide composition. Comput. Methods Programs Biomed. 2017, 146, 69–75. [Google Scholar] [CrossRef] [PubMed]
Le, N.Q.K.; Yapp, E.K.Y.; Ho, Q.-T.; Nagasundaram, N.; Ou, Y.-Y.; Yeh, H.-Y. Ienhancer-5step: Identifying enhancers using hidden information of DNA sequences via chou’s 5-step rule and word embedding. Anal. Biochem. 2019, 571, 53–61. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Wang, S. Identifying Enhancers and Their Strength Based on pcwm Feature by a Two-Layer Predictor. In Proceedings of the Fifth International Conference on Biological Information and Biomedical Engineering, Hangzhou, China, 20–22 July 2021; pp. 1–8. [Google Scholar]
Lim, D.Y.; Khanal, J.; Tayara, H.; Chong, K.T. Ienhancer-rf: Identifying enhancers and their strength by enhanced feature representation using random forest. Chemom. Intell. Lab. Syst. 2021, 212, 104284. [Google Scholar] [CrossRef]
Liang, Y.; Zhang, S.; Qiao, H.; Cheng, Y. Ienhancer-mfgbdt: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree. Math. Biosci. Eng. 2021, 18, 8797–8814. [Google Scholar] [CrossRef]
Liu, B.; Li, K.; Huang, D.-S.; Chou, K.-C. Ienhancer-el: Identifying enhancers and their strength with ensemble learning approach. Bioinformatics 2018, 34, 3835–3842. [Google Scholar] [CrossRef]
Cai, L.; Ren, X.; Fu, X.; Peng, L.; Gao, M.; Zeng, X. Ienhancer-xg: Interpretable sequence-based enhancers and their strength predictor. Bioinformatics 2021, 37, 1060–1067. [Google Scholar] [CrossRef] [PubMed]
Niu, K.; Luo, X.; Zhang, S.; Teng, Z.; Zhang, T.; Zhao, Y. Ienhancer-eblstm: Identifying enhancers and strengths by ensembles of bidirectional long short-term memory. Front. Genet. 2021, 12, 385. [Google Scholar] [CrossRef] [PubMed]
Nguyen, Q.H.; Nguyen-Vo, T.-H.; Le, N.Q.K.; Do, T.T.; Rahardja, S.; Nguyen, B.P. Ienhancer-ecnn: Identifying enhancers and their strength using ensembles of convolutional neural networks. BMC Genom. 2019, 20, 951. [Google Scholar] [CrossRef] [PubMed]
Li, Q.; Xu, L.; Li, Q.; Zhang, L. Identification and classification of enhancers using dimension reduction technique and recurrent neural network. Comput. Math. Methods Med. 2020, 2020, 8852258. [Google Scholar] [CrossRef] [PubMed]
Asim, M.N.; Ibrahim, M.A.; Malik, M.I.; Dengel, A.; Ahmed, S. Enhancer-dsnet: A Supervisedly Prepared Enriched Sequence Representation for the Identification of Enhancers and Their Strength. In Proceedings of the International Conference on Neural Information Processing, Bangkok, Thailand, 23–27 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 38–48. [Google Scholar]
Mu, X.; Wang, Y.; Duan, M.; Liu, S.; Li, F.; Wang, X.; Zhang, K.; Huang, L.; Zhou, F. A novel position-specific encoding algorithm (seqpose) of nucleotide sequences and its application for detecting enhancers. Int. J. Mol. Sci. 2021, 22, 3079. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Wang, S.; Xia, X. Ienhancer-rd: Identification of enhancers and their strength using rkpk features and deep neural networks. Anal. Biochem. 2021, 630, 114318. [Google Scholar] [CrossRef] [PubMed]
Le, N.Q.K.; Ho, Q.-T.; Nguyen, T.-T.-D.; Ou, Y.-Y. A transformer architecture based on bert and 2d convolutional neural network to identify DNA enhancers from sequence information. Brief. Bioinform. 2021, 22, bbab005. [Google Scholar] [CrossRef]
Inayat, N.; Khan, M.; Iqbal, N.; Khan, S.; Raza, M.; Khan, D.M.; Khan, A.; Wei, D.Q. Ienhancer-dhf: Identification of enhancers and their strengths using optimize deep neural network with multiple features extraction methods. IEEE Access 2021, 9, 40783–40796. [Google Scholar] [CrossRef]
MacPhillamy, C.; Alinejad-Rokny, H.; Pitchford, W.S.; Low, W.Y. Cross-species enhancer prediction using machine learning. Genomics 2022, 114, 110454. [Google Scholar] [CrossRef]
Chen, Z.; Zhao, P.; Li, F.; Marquez-Lago, T.T.; Leier, A.; Revote, J.; Zhu, Y.; Powell, D.R.; Akutsu, T.; Webb, G. Ilearn: An integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, rna and protein sequence data. Brief. Bioinform. 2020, 21, 1047–1057. [Google Scholar] [CrossRef]
Liu, B.; Gao, X.; Zhang, H. Bioseq-analysis2. 0: An updated platform for analyzing DNA, rna and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res. 2019, 47, e127. [Google Scholar] [CrossRef]
He, W.; Jia, C.; Duan, Y.; Zou, Q. 70propred: A predictor for discovering sigma70 promoters based on combining multiple features. BMC Syst. Biol. 2018, 12, 99–107. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Zhao, P.; Li, F.; Leier, A.; Marquez-Lago, T.T.; Wang, Y.; Webb, G.I.; Smith, A.I.; Daly, R.J.; Chou, K.-C. Ifeature: A python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018, 34, 2499–2502. [Google Scholar] [CrossRef] [PubMed]
Xu, H.; Jia, P.; Zhao, Z. Deep4mc: Systematic assessment and computational prediction for DNA n4-methylcytosine sites by deep learning. Brief. Bioinform. 2021, 22, bbaa099. [Google Scholar] [CrossRef] [PubMed]
Tahir, M.; Tayara, H.; Hayat, M.; Chong, K.T. Kdeepbind: Prediction of rna-proteins binding sites using convolution neural network and k-gram features. Chemom. Intell. Lab. Syst. 2021, 208, 104217. [Google Scholar] [CrossRef]
Bi, Y.; Xiang, D.; Ge, Z.; Li, F.; Jia, C.; Song, J. An interpretable prediction model for identifying n7-methylguanosine sites based on xgboost and shap. Mol. Ther. Nucleic Acids 2020, 22, 362–372. [Google Scholar] [CrossRef] [PubMed]
Chen, W.; Yang, H.; Feng, P.; Ding, H.; Lin, H. Idna4mc: Identifying DNA n4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics 2017, 33, 3518–3523. [Google Scholar] [CrossRef]
Zhang, M.; Sun, J.-W.; Liu, Z.; Ren, M.-W.; Shen, H.-B.; Yu, D.-J. Improving n6-methyladenosine site prediction with heuristic selection of nucleotide physical–chemical properties. Anal. Biochem. 2016, 508, 104–113. [Google Scholar] [CrossRef]
Nguyen-Vo, T.-H.; Nguyen, Q.H.; Do, T.T.; Nguyen, T.-N.; Rahardja, S.; Nguyen, B.P. Ipseu-ncp: Identifying rna pseudouridine sites using random forest and ncp-encoded features. BMC Genom. 2019, 20, 971. [Google Scholar] [CrossRef]
Tahir, M.; Tayara, H.; Hayat, M.; Chong, K.T. Intelligent and robust computational prediction model for DNA n4-methylcytosine sites via natural language processing. Chemom. Intell. Lab. Syst. 2021, 217, 104391. [Google Scholar] [CrossRef]
Zhou, Z.-H.; Feng, J. Deep forest: Towards an Alternative to Deep Neural Networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
Guo, Y.; Liu, S.; Li, Z.; Shang, X. Bcdforest: A boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data. MC Bioinform. 2018, 19, 118. [Google Scholar] [CrossRef] [PubMed]
Arif, M.; Kabir, M.; Ahmad, S.; Khan, A.; Ge, F.; Khelifi, A.; Yu, D.-J. Deepcppred: A deep learning framework for the discrimination of cell-penetrating peptides and their uptake efficiencies. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 19, 2749–2759. [Google Scholar] [CrossRef] [PubMed]
Wang, M.; Li, F.; Wu, H.; Liu, Q.; Li, S. Predpromoter-mf (2l): A novel approach of promoter prediction based on multi-source feature fusion and deep forest. Interdiscip. Sci. Comput. Life Sci. 2022, 14, 697–711. [Google Scholar] [CrossRef]
Jia, C.; Bi, Y.; Chen, J.; Leier, A.; Li, F.; Song, J. Passion: An ensemble neural network approach for identifying the binding sites of rbps on circrnas. Bioinformatics 2020, 36, 4276–4282. [Google Scholar] [CrossRef] [PubMed]
Shoombuatong, W.; Basith, S.; Pitti, T.; Lee, G.; Manavalan, B. Throne: A new approach for accurate prediction of human rna n7-methylguanosine sites. J. Mol. Biol. 2022, 434, 167549. [Google Scholar] [CrossRef] [PubMed]
Charoenkwan, P.; Ahmed, S.; Nantasenamat, C.; Quinn, J.M.W.; Moni, M.A.; Lio’, P.; Shoombuatong, W. Amypred-frl is a novel approach for accurate prediction of amyloid proteins by using feature representation learning. Sci. Rep. 2022, 12, 7697. [Google Scholar] [CrossRef] [PubMed]
Schaduangrat, N.; Nantasenamat, C.; Prachayasittikul, V.; Shoombuatong, W. Meta-iavp: A sequence-based meta-predictor for improving the prediction of antiviral peptides using effective feature representation. Int. J. Mol. Sci. 2019, 20, 5743. [Google Scholar] [CrossRef]
Li, G.-Q.; Liu, Z.; Shen, H.-B.; Yu, D.-J. Targetm6a: Identifying n 6-methyladenosine sites from rna sequences via position-specific nucleotide propensities and a support vector machine. IEEE Trans. Nanobiosci. 2016, 15, 674–682. [Google Scholar] [CrossRef]
Dietterich, T.G. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems, 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?” Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Domingos, P. A few useful things to know about machine learning. Commun. ACM 2012, 55, 78–87. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Ahmad, S.; Charoenkwan, P.; Quinn, J.M.; Moni, M.A.; Hasan, M.M.; Lio’, P.; Shoombuatong, W. Scorpion is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins. Sci. Rep. 2022, 12, 4106. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Framework for the proposed method, NEPERS, for the prediction of enhancers and their strength.

Figure 2. Performance analysis of various features via benchmark datasets on layer 1 and layer 2. (A) shows the performance metrics values for layer 1, (B) shows the metrics values for layer 2, (C) represents the ROC curve for layer 1, and (D) indicates the ROC curve for layer 2.

Figure 3. Performance analysis of various features for independent datasets on layer 1 and layer 2. (A) shows the performance metrics values for layer 1, (B) shows the metrics values for layer 2, (C) represents the ROC curve for layer 1, and (D) indicates the ROC curve for layer 2.

Figure 4. Performance analysis of various classifiers. (A) indicates layer 1 and (C) layer 2 for benchmark dataset, whereas the independent dataset results are represented in (B) for layer 1 and (D) for layer 2.

Figure 5. Representation of essential features utilized by NEPERS in enhancer predictions, where SHAP values represent the directionality of the top 20 features where negative and positive SHAP values influences the predictions toward layer 1 in (A) and layer 2 in (B), respectively.

Table 1. A summary of training and independent test datasets used in enhancer predictors.

Type	Dataset	Samples	CD-HIT Threshold
Enhancer	$S_{B}$ (layer-1)	1484	0.8
Non-enhancer	$S_{B}$ (layer-1)	1484
Strong Enhancer	$S^{+}$ (layer-2)	742
Weak Enhancer	$S^{+}$ (layer-2)	742
Enhancer	$S_{I N D}$ (layer-1)	200	0.8
Non-enhancer	$S_{I N D}$ (layer-1)	200
Strong Enhancer	$S_{I N D}^{+}$ (layer-2)	100
Weak Enhancer	$S_{I N D}^{+}$ (layer-2)	100

Table 2. Performance analysis of various features via benchmark and independent datasets.

Layers	Features	Benchmark Dataset					Independent Testing
Layers	Features	ACC	Sen	Spe	MCC	AUC	ACC	Sen	Spe	MCC	AUC
Layer_1	PSTNPss	0.810	0.800	0.820	0.630	0.890	0.853	0.845	0.850	0.715	0.948
	PSTNPdss	0.801	0.837	0.763	0.607	0.874	0.778	0.835	0.720	0.559	0.852
	CKSNAP	0.761	0.713	0.809	0.528	0.839	0.748	0.700	0.795	0.497	0.822
	NCP	0.744	0.711	0.777	0.493	0.827	0.503	1.000	0.005	0.050	0.798
	All features	0.876	0.864	0.888	0.753	0.940	0.863	0.865	0.860	0.725	0.949
Layer_2	PSTNPss	0.916	0.939	0.892	0.833	0.973	0.855	0.800	0.910	0.714	0.934
	PSTNPdss	0.658	0.751	0.565	0.322	0.710	0.805	0.760	0.850	0.612	0.910
	CKSNAP	0.635	0.733	0.536	0.276	0.688	0.655	0.780	0.530	0.320	0.713
	NCP	0.606	0.684	0.527	0.216	0.646	0.650	0.800	0.500	0.314	0.730
	All features	0.959	0.960	0.958	0.918	0.990	0.890	0.940	0.840	0.784	0.951

Table 3. Performance analysis of various classifiers via benchmark and independent datasets via all features.

Layers	Classifiers	10CV					Independent Testing
Layers	Classifiers	ACC	Sen	Spe	MCC	AUC	ACC	Sen	Spe	MCC	AUC
Layer_1	RF	0.781	0.754	0.808	0.568	0.862	0.810	0.800	0.820	0.620	0.900
	SVM	0.736	0.721	0.751	0.476	0.817	0.770	0.800	0.740	0.541	0.846
	XGB	0.785	0.765	0.805	0.575	0.875	0.780	0.765	0.795	0.560	0.828
	GBDT	0.797	0.781	0.814	0.600	0.878	0.763	0.730	0.795	0.526	0.858
	LightGBM	0.818	0.804	0.832	0.640	0.901	0.798	0.770	0.825	0.596	0.851
	DNN	0.851	0.849	0.852	0.704	0.930	0.485	0.135	0.835	-0.042	0.477
	MLP	0.627	0.579	0.676	0.255	0.632	0.715	0.765	0.665	0.432	0.727
	CNN	0.712	0.682	0.742	0.430	0.781	0.733	0.805	0.660	0.470	0.804
	CDF	0.876	0.864	0.888	0.753	0.940	0.863	0.865	0.860	0.725	0.948
Layer_2	RF	0.801	0.826	0.775	0.604	0.887	0.790	0.760	0.820	0.581	0.871
	SVM	0.588	0.574	0.601	0.176	0.620	0.740	0.940	0.540	0.524	0.850
	XGB	0.841	0.856	0.826	0.684	0.920	0.770	0.780	0.760	0.540	0.840
	GBDT	0.857	0.863	0.851	0.714	0.930	0.825	0.730	0.920	0.662	0.859
	LightGBM	0.693	0.722	0.664	0.388	0.754	0.810	0.810	0.810	0.620	0.913
	DNN	0.635	0.606	0.665	0.275	0.697	0.525	0.060	0.990	0.136	0.540
	MLP	0.522	0.837	0.207	0.054	0.528	0.630	1.000	0.260	0.387	0.635
	CNN	0.673	0.672	0.674	0.348	0.724	0.640	0.710	0.570	0.283	0.688
	CDF	0.959	0.960	0.958	0.918	0.990	0.890	0.940	0.840	0.784	0.951

Table 4. Performance comparison of various classifiers via benchmark dataset.

Predictors	Layer 1					Layer 2
Predictors	ACC	Sen	Spe	MCC	AUC	ACC	Sen	Spe	MCC	AUC
iEnhancer-2L	0.769	0.781	0.759	0.540	0.850	0.619	0.622	0.618	0.240	0.660
EnhancerPred	0.773	0.719	0.828	0.550	-	0.682	0.712	0.652	0.360	-
Enhancer-TNC	0.773	0.758	0.786	0.550	-	0.647	0.720	0.544	0.300	-
iEnhancer-5Step	0.823	0.811	0.835	0.650	-	0.681	0.753	0.608	0.370	-
Enhancer-PCWM	0.815	0.816	0.814	0.631	0.895	0.631	0.829	0.434	0.286	0.692
iEnhancer-RF	0.762	0.736	0.787	0.526	0.840	0.625	0.684	0.566	0.253	0.670
iEnhancer-EL	0.781	0.756	0.804	0.561	85.470	0.651	0.690	0.611	0.315	0.696
iEnhancer-XG	0.811	0.757	0.865	0.626	-	0.668	0.749	0.585	0.334	-
iEnhancer-EBLSTM	0.772	0.755	0.795	0.534	0.835	0.658	0.812	0.536	0.324	0.688
iEnhancer-MFBGDT	0.787	0.775	0.798	0.574	0.862	0.660	0.706	0.616	0.323	0.719
iEnhancer-ECNN	0.769	0.782	0.752	0.537	0.832	0.678	0.791	0.564	0.368	0.748
Enhancer-RNN	0.767	0.733	0.801	0.699	-	0.849	0.858	0.840	0.699	-
Enhancer-DSNet	0.760	0.760	0.760	0.520	-	0.630	0.630	0.670	0.260	-
spEnhancer	0.779	0.708	0.850	0.523	0.846	0.642	0.850	0.305	0.211	0.614
iEnhancer-RD	0.788	0.810	0.570	0.576	0.844	0.705	0.840	0.570	0.426	0.792
Enhancer-BERT-2D	0.800	0.765	0.729	0.531	-	-	-	-	-	-
iEnhancer-DHF	0.801	0.858	0.863	0.722	0.910	0.696	0.696	0.693	0.392	0.720
NEPERS	0.876	0.864	0.888	0.753	0.940	0.959	0.960	0.958	0.918	0.990

Table 5. Performance comparison of various classifiers via independent dataset.

Predictors	Layer 1					Layer 2
Predictors	ACC	Sen	Spe	MCC	AUC	ACC	Sen	Spe	MCC	AUC
iEnhancer-5Steps	0.790	0.820	0.760	0.580	-	0.635	0.740	0.530	0.280	-
Enhancer-PCWM	0.770	0.785	0.755	0.540	0.821	0.695	0.810	0.580	0.401	0.756
iEnhancer-RF	0.797	0.785	0.810	0.595	-	0.850	0.930	0.770	0.709	-
iEnhancer-EL	0.747	0.710	0.785	0.496	0.817	0.610	0.540	0.680	0.222	0.680
iEnhancer-XG	0.757	0.740	0.775	0.515	-	0.635	0.700	0.570	0.272	-
iEnhancer-EBLSTM	0.728	0.774	0.726	0.498	0.788	0.622	0.740	0.572	0.310	0.664
iEnhancer-MFGBDT	0.775	0.768	0.796	0.561	0.859	0.685	0.726	0.668	0.386	0.752
iEnhancer-ECNN	0.769	0.785	0.752	0.537	0.832	0.678	0.791	0.564	0.368	0.748
iEnhancer-DSNet	0.780	0.780	0.770	0.560	-	0.830	0.830	0.670	0.700	-
spEnhancer	0.772	0.830	0.715	0.579	0.824	0.620	0.910	0.330	0.370	0.625
iEnhancer-RD	0.788	0.810	0.765	0.576	0.844	0.705	0.840	0.570	0.892	0.792
Enhancer-BERT-2D	0.756	0.800	0.712	0.514	-	-	-	-	-	-
iEnhancer-DHF	0.832	0.821	0.844	0.711	-	0.675	0.659	0.687	0.312	-
NEPERS	0.863	0.865	0.860	0.725	0.948	0.890	0.940	0.840	0.784	0.951

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gill, M.; Ahmed, S.; Kabir, M.; Hayat, M. A Novel Predictor for the Analysis and Prediction of Enhancers and Their Strength via Multi-View Features and Deep Forest. Information 2023, 14, 636. https://doi.org/10.3390/info14120636

AMA Style

Gill M, Ahmed S, Kabir M, Hayat M. A Novel Predictor for the Analysis and Prediction of Enhancers and Their Strength via Multi-View Features and Deep Forest. Information. 2023; 14(12):636. https://doi.org/10.3390/info14120636

Chicago/Turabian Style

Gill, Mehwish, Saeed Ahmed, Muhammad Kabir, and Maqsood Hayat. 2023. "A Novel Predictor for the Analysis and Prediction of Enhancers and Their Strength via Multi-View Features and Deep Forest" Information 14, no. 12: 636. https://doi.org/10.3390/info14120636

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Predictor for the Analysis and Prediction of Enhancers and Their Strength via Multi-View Features and Deep Forest

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data Collection

3.2. Feature Engineering

3.2.1. PSTNPss

3.2.2. PSTNPdss

3.2.3. CKSNAP

3.2.4. Nucleotide Chemical Property (NCP)

3.3. Model Training and Evaluation

3.3.1. Deep Forest

3.3.2. Two-Layer Classification Framework

4. Evaluation Parameters

5. Results and Analysis

5.1. Feature Analysis (Individual vs. Fusion)

5.2. Analysis of Various Classifiers

5.3. Comparison with Existing Methods via 10-Fold CV

5.4. Independent Test Comparison with Existing Methods

5.5. Model Interpretation

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI