DeepSGP: Deep Learning for Gene Selection and Survival Group Prediction in Glioblastoma

Kirtania, Ritaban; Banerjee, Subhashis; Laha, Sayantan; Shankar, B. Uma; Chatterjee, Raghunath; Mitra, Sushmita

doi:10.3390/electronics10121463

Open AccessArticle

DeepSGP: Deep Learning for Gene Selection and Survival Group Prediction in Glioblastoma

by

Ritaban Kirtania

^1,†

,

Subhashis Banerjee

^1,†

,

Sayantan Laha

^2,†

,

B. Uma Shankar

^1,†

,

Raghunath Chatterjee

^2,†

and

Sushmita Mitra

^1,*,†

¹

Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India

²

Human Genetics Unit, Indian Statistical Institute, Kolkata 700108, India

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2021, 10(12), 1463; https://doi.org/10.3390/electronics10121463

Submission received: 17 April 2021 / Revised: 10 June 2021 / Accepted: 14 June 2021 / Published: 18 June 2021

(This article belongs to the Special Issue 10th Anniversary of Electronics: Recent Advances in Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Glioblastoma Multiforme (GBM) is an aggressive form of glioma, exhibiting very poor survival. Genomic input, in the form of RNA sequencing data (RNA-seq), is expected to provide vital information about the characteristics of the genes that affect the Overall Survival (OS) of patients. This could have a significant impact on treatment planning. We present a new Autoencoder (AE)-based strategy for the prediction of survival (low or high) of GBM patients, using the RNA-seq data of 129 GBM samples from The Cancer Genome Atlas (TCGA). This is a novel interdisciplinary approach to integrating genomics with deep learning towards survival prediction. First, the Differentially Expressed Genes (DEGs) were selected using EdgeR. These were further reduced using correlation-based analysis. This was followed by the application of ranking with different feature subset selection and feature extraction algorithms, including the AE. In each case, fifty features were selected/extracted, for subsequent prediction with different classifiers. An exhaustive study for survival group prediction, using eight different classifiers with the accuracy and Area Under the Curve (AUC), established the superiority of the AE-based feature extraction method, called DeepSGP. It produced a very high accuracy (0.83) and AUC (0.90). Of the eight classifiers, using the extracted features by DeepSGP, the MLP was the best at Overall Survival (OS) prediction with an accuracy of 0.89 and an AUC of 0.97. The biological significance of the genes extracted by the AE were also analyzed to establish their importance. Finally, the statistical significance of the predicted output of the DeepSGP algorithm was established using the concordance index.

Keywords:

survival prediction; deep learning; glioblastoma; RNA sequencing data; differentially expressed genes

1. Introduction

Cancer is a disease of genetic instability, often associated with genes directly involved in cell growth and proliferation, differentiation, survival, and apoptosis, or indirectly involved through genes participating in cell signal transduction pathways. Research over the past decade has revealed that many of the defective genes affect the same set of molecular pathways [1]; defects in only a small number of such pathways are actually responsible for many different tumor types [2]. Therefore, therapeutics directed at specific molecular pathways (involving cancer-altered genes) can be used to effectively treat different types of tumors. Current developments in genomics have enabled molecular profiling of biological specimens by simultaneously revealing the expression levels of thousands of genes. The gene expression patterns of cancer tissues can reveal its etiology, prognosis, and response to therapy and facilitate the individualized selection of therapies. Here, dimensionality reduction and feature selection help in filtering out and focusing on only those sets of differentially expressed genes, from the diseased tissue, that can be best correlated with patient prognosis and clinical outcome [3]. This helps reduce the computational complexity of the problem and make the appropriate modeling feasible.

Gliomas constitute 70% of malignant primary brain tumors in adults [4] and are usually classified as High-Grade Gliomas (HGGs) and Low-Grade Gliomas (LGGs)—with the former being more aggressive and infiltrative. Glioblastoma Multiforme (GBM) is an aggressive form of HGG. It is one of the most common and invasive brain tumors, with a median survival of 12 to 15 months. Standard treatments are surgery followed by chemotherapy or radiotherapy [5]. Surgical removal of tumor can increase survival substantially [6]. It was shown that post surgical chemotherapy with radiation followed by adjuvant temozolomide can improve survival further [5]. However, some idea about the overall survival is definitely useful for effective treatment planning. Although standard treatment protocols help enhance the survival of GBM patients, the prediction of their survival time and the identification of (possible) causative genes can increase the efficiency of these models. Genomic features (RNA-seq), obtained invasively, can be used for survival prediction in GBM patients.

Deep Learning (DL) is a kind of representation learning, where a machine learns from raw data to automatically discover the representations needed for regression or classification [7]. It involves multiple levels (depth) of representation, such that each transforms the representation at the preceding level into one at a higher, slightly more abstract level. Some of the inherent limitations of deep learning include high computational cost and the requirement of a large number of training data. Among the different DL models, the AE [8] is used to learn efficient data encoding through representation learning and is typically employed for dimensionality reduction.

The problem of predicting OS in brain cancer patients is challenging. It depends on multiple factors such as cancer grade, molecular biomarkers, non-biological factors (tumor location, whether it was surgically removed or reduced), age, sex, and other comorbidities. Besides, the presence of censored data causes further problems. The application of deep learning models for OS prediction is limited also due to the absence of large annotated publicly available datasets [9]. This is because training a deep model from scratch generally requires a large amount of training data to generalize, while the data remain typically scarce in medical applications with expert annotation being expensive.

The prediction of the OS of patients, from multi-modal MRI using deep learning, has started to attract attention [10]. However, it has been observed that the majority of the literature on survival prediction is based on radiomics [11,12,13,14,15,16,17,18]. Reference [19] reported a very low accuracy (

51.5 %

) on the test set even by the best performing model. In more recent times, clinical (age) and radiomic (volumetric, shape, and texture) features from multimodal MRI scans were used [20] for two-class (short and long) and three-class (short, medium, and long) survival group prediction. It may be noted that the clinical feature (age), very obviously, contributed the most towards the high accuracy reported for survival prediction. Moreover, survival analysis, employing regression on genomic data, has also gained attention, with most of the literature attempting to find a gene signature related to the survival of a patient [21,22,23,24,25,26,27,28,29]. A study to discover genes with prognostic potential for GBM patients was reported [9], using an MLP with two hidden layers. The selection of 27 significant genes was reported from the TCGA-GBM gene expression data employing survival analysis.

In this research, OS prediction was transformed into a classification problem by dividing the patients into two survival groups, viz. high and low. It was based on the determination of a threshold; by selecting an equal number of samples on either side of the median survival point [30]. This was a novel interdisciplinary approach integrating deep learning with genomics towards survival prediction.

The contributions and output of this article are as follows:

Development of a new AE-based feature extraction model (DeepSGP) for survival group prediction in glioblastoma;
A novel interdisciplinary application to survivability prediction, in terms of genes, using deep learning;
An exhaustive experimental study, including comparison with state-of-the-art (related) feature extraction and feature selection algorithms;
Quantitative evaluation to establish the statistically significant performance of the model;
Analysis of the biological significance of genes extracted by the AE.

The rest of the paper is organized as follows. The list of symbols is given in Table 1. Preprocessing is reported in Section 2. Section 3 describes the methods for the selection of the genes. Section 4 provides the details on dimensionality reduction. This is followed by the experimental results in Section 5, encompassing the comparative study with the related methods and analysis of the biological significance of the extracted genes. Finally, Section 6 concludes the article.

2. Preprocessing

Here, we present the dataset preparation for the simulation study, based on TCGA-GBM RNA sequencing data (https://www.cancer.gov/tcga (accessed on 8 November 2018). The Cancer Genome Atlas Program provided 174 samples containing RNA sequencing (RNA-seq) data, with the number of control and disease samples being five and one-hundred sixty-nine, respectively. The control data being the normal samples, these were not found suitable for this study; hence, they were discarded. After eliminating the 40 censored data, the threshold between the high and low survival groups was selected based on the median [30] computed at 384 days on the 129 samples. Note that the sample median can be treated as a good representation of the population statistics [30]. Now, there were 64 samples (with high survival) and 65 samples (having low survival) of glioblastoma cases, to formulate the two-class framework. The raw counts of RNA-seq for these 129 samples were employed in our experiments. Here, the RNA-seq data for each sample consisted of the raw read counts of 60,483 genes, which included both protein-coding and non-coding genes.

The EdgeR package [31] (in R) was used to prepare the data. It is a Bioconductor software package that detects differentially expressed genes from normalized read count data. The raw count of the RNA sequence, with a cutoff of 200 per million, was employed for low abundance data cleaning, to estimate the effective library size of the entire data towards normalization. Genes with low expression values are basically noise and hinder the evaluation of differentially expressed genes. Next, the RNA-seq dataset was normalized by adjusting the library sizes to minimize the log-fold changes. A major step in the analysis of Differentially Expressed Genes (DEGs) is to estimate their dispersion parameters. EdgeR uses an overdispersed Poisson model to analyze the biological and statistical variances, with the empirical Bayes method being employed to normalize the degree of overdispersion across transcripts.

Differential expression analysis [32] was used to select groups of DEGs having significant quantitative changes in expression levels between the experimental groups (involving normalized read count data and statistical analysis) reporting high and low survival in glioblastoma samples. With a threshold of p < 0.002 [31] (indicating the significance of genes), a total of 877 DEGs were extracted by the above procedure. These 877 genes were considered as our dataset for subsequent feature (gene) selection or extraction.

A volcano plot [33] was generated using the differential expression analysis results obtained from EdgeR. Volcano plots are a useful means of visualizing the data in RNA-seq, proteomics, and other high-throughput experiments. It usually resembles a scatter plot where the statistical significance denoted by the p-values is plotted with the fold changes of the expression of the genes. From this plot, one can easily perceive changes in large datasets and thereby identify the best candidate genes relevant to the study. We used the Bioconductor package “EnhancedVolcano” to construct the volcano plot. Likewise, in the case of differential expression analysis, we took a p-value threshold of less than 0.002 to denote significance. Additionally, we used a log fold-change cut-off of one, signifying at least a two-fold change in the expression between the two conditions. The volcano plot is provided in Figure 1. Looking at the volcano plot, we can see that the genes marked in red are the ones that were statistically significant on account of both the p-values and their fold changes. On the other hand, the genes represented in blue depict the ones that were significant in terms of their p-values (≤0.002), measured as the magnitude of their fold changes (which were less than two folds). In other words, the genes depicted in “red” were deregulated by at least a two-fold change in expression between the two conditions (high- vs. low-GBM). In general, these two groups of genes were relevant in our study for further downstream analyses.

As some samples in the TCGA-GBM RNA sequencing dataset had a very high expression value for a particular gene, as compared to that of the majority samples in the same class, this was likely to be noise. Therefore, we did not consider

l o g_{2} (F C)

, where

F C

(Fold Change) is the ratio of the expression values of a particular gene between the two classes, as a deciding criterion during DEG selection. Low abundance cleaning based on expression count rate and a p-value less than 0.002 were the deciding criteria.

Due to high correlation among the genes of RNA-seq data, it became necessary to reduce the redundancy amongst features. Correlation was applied for dimensionality reduction. This is a statistical method that evaluates how strongly a pair of features are related. The correlation coefficient uses the standard deviation (

σ

) and covariance (

c o v

) to measure the relationship between features

f_{1}

and

f_{2}

as:

c o r r (f_{1}, f_{2}) = \frac{c o v (f_{1}, f_{2})}{σ_{f_{1}} * σ_{f_{2}}} .

(1)

Correlation-based feature selection helps reduce the feature set while preserving the data characteristics. It helps select a subset of features that ensures minimum feature-to-feature correlation and redundancy. We selected only one gene from each correlated cluster, with the correlation coefficient > 0.7, based on the (minimum) p-value.

As the sample size of the dataset was very small (only 129) and the number of features was very high, employing deep learning became infeasible. Therefore, data augmentation was performed to enhance the sample size (from 129 to 804). Over-sampling was employed for this purpose [34], using the SMOTE algorithm [35]. The Synthetic Minority Oversampling TEchnique (SMOTE) improves random oversampling by generating additional samples based on the feature space characteristics of a dataset.

3. Selection of Genes

The selection of features can be supervised, as well as unsupervised, depending on class information availability in the data. The algorithms are typically categorized as filter and wrapper models [36], with a differing emphasis on dimensionality reduction or accuracy enhancement. The wrapper methods assess feature subsets according to their usefulness to a given predictor. However, selecting a good set of features is usually suboptimal for building a predictor, particularly in the presence of redundant variables. Since finding the best feature subset is intractable or NP-hard [37], heuristic and non-deterministic strategies are deemed more practical. It is important to select a highly informative and non-redundant feature (gene) subset for the efficiency and accuracy of decision making.

Feature selection can be performed through ranking, as well as subset selection strategies [38]. The former computes the ranking score for each gene according to its discriminative power, followed by a simple choice of the top k genes. Although preferable for high-dimensional problems, due to its computational scalability, such subsets of selected genes remain prone to redundancy. This is because the ranking score is computed independently for each gene, while completely ignoring its correlation with others. On the contrary, the feature subset selection methods focus on choosing a subset of genes that jointly possess higher discriminative power. In general, sophisticated subset selection methods demonstrate improved decision making. However, the high computational cost usually limits their applicability to high-dimensional domains. Since a combinatorial search of the optimal combination of features is computationally prohibitive, a judicious combination of both strategies becomes desirable.

In this study, EdgeR was used to find the significantly deregulated genes between the two groups under study. Prior to carrying out the statistical testing, the low abundance genes were filtered out to improve the power of the statistical testing [39] (as described in Section 2). This preprocessing was followed by a sequential combination of ranking and subset selection in order to select the relevant genes. The entire process is outlined in Figure 2.

3.1. Ranking

The simple two-tailed t-test [40], which measures the significance of the difference of the means between two distributions, was used to evaluate the discriminatory power of each individual gene in separating the two survival classes. If the features are assumed to come from normal distributions with unknown, but equal variances, the t-statistic is defined as:

t = \frac{μ_{1}^{f} - μ_{2}^{f}}{\sqrt{\frac{(N_{1}^{'} - 1) {(σ_{1}^{f})}^{2} + (N_{2}^{'} - 1) {(σ_{2}^{f})}^{2}}{N_{1}^{'} + N_{2}^{'} - 2} (\frac{1}{N_{1}^{'}} + \frac{1}{N_{2}^{'}})}},

(2)

where

μ_{1}^{f}

,

{(σ_{1}^{f})}^{2}

, and

μ_{2}^{f}

,

{(σ_{2}^{f})}^{2}

are the sample mean and variance of feature f in the two classes, respectively, and

N_{1}^{'}

and

N_{2}^{'}

are the corresponding numbers of samples in the two classes. Here, the features correspond to the expression values of the genes.

In the absence of any consideration of the correlation between features, the redundant ones are inevitably selected, thereby affecting the classification performance. Therefore, we used this strategy to initially select the more discriminatory genes by applying a cut-off ratio

p < α

, where

α = 0.05

is the significance level for rejecting the null hypothesis. Here, the p-value indicates the probability of the test statistic, calculated using the null distribution.

3.2. Subset Selection

In the second stage, subset selection was employed to evaluate the effectiveness of the discriminative genes selected by the t-statistic, as outlined above. Let J be the criterion function for feature selection, which measures the importance or usefulness of a feature (gene) subset with respect to classification. It is also known as the “relevance index” in the literature. Let

F_{k}

be the feature that is being evaluated,

C

be the class label, and

S

be the set of features that are already selected.

Let

I (X; Y)

denote the mutual information between two random variables X and Y, such that

I (X; Y) = \sum_{x \in X} \sum_{y \in Y} p^{'} (x y) log \frac{p^{'} (x y)}{p^{'} (x) p^{'} (y)},

where

p^{'} (x)

and

p^{'} (x y)

are the probability and joint probability of occurrence, respectively. We have:

Mutual Information Maximization (MIM) [41]:

$J_{m i m} (F_{k}) = I (F_{k}; C) .$

(3)

Here, each feature is considered independently and ranked according to the score. On the basis of the rank, the required number of features is selected;
Joint Mutual Information (JMI) [42]:

$J_{j m i} (F_{k}) = \sum_{F_{j} \in S} I (F_{k} F_{j}; C) .$

(4)

The objective is to select a complementary subset $F_{k}$ of the relevant features $F_{j}$ ;
Minimum Redundancy Maximum Relevance (MRMR) [43]:

$J_{m r m r} (F_{k}) = I (F_{k}; C) - \frac{1}{| S |} \sum_{F_{j} \in S} I (F_{k}; F_{j})$

(5)

focuses on selecting feature pairs from $S$ having the maximum relevance and minimum redundancy, with $| S |$ denoting the cardinality of the set;
Double Input Symmetrical Relevance (DISR) [44] is a modification of the criterion of Equation (4) and given by:

$J_{d i s r} (F_{k}) = \sum_{F_{j} \in S} \frac{I (F_{k} F_{j}; C)}{H (F_{k} F_{j} C)} .$

(6)

Here, $H (F_{k} F_{j} C)$ is the joint entropy;
Mutual Information Feature Selection (MIFS) [45], as Equation (5), also attempts to balance the “relevance-redundancy” tradeoff in the feature. We computed:

$J_{m i f s} (F_{k}) = I (F_{k}; C) - β \sum_{F_{j} \in S} I (F_{k}; F_{j}),$

(7)

with $β$ being configured experimentally. Note that $β = 0$ leads to $J_{m i m} (F_{k})$ , and $β = \frac{1}{| S |}$ results in $J_{m r m r} (F_{k})$ . Since it has been experimentally observed [45] that $β = 1$ is often optimal, hence this was used in our study;
Interaction Capping (ICAP) [46], formulated as:

$J_{i c a p} = I (F_{k}; C) - \sum_{F_{j} \in S} max [0, {I (F_{k}; F_{j}) - I (F_{k}; F_{j} | C)}],$

(8)

and Conditional Infomax Feature Extraction (CIFE) [47], given by:

$J_{c i f e} = I (F_{k}; C) - \sum_{F_{j} \in S} I (F_{k}; F_{j}) + \sum_{F_{j} \in C} I (F_{k}; F_{j} | C),$

(9)

are two other ways to handle the relevance-redundancy tradeoff in feature selection.
Finally, a Consensus (CONSS) of all seven subsets, generated by Equations (3)–(9), was also considered to incorporate the aggregated effect of the algorithms. Those genes occurring in at least three of the seven selected subsets were considered, of which the top 50 genes were taken for the present study.

4. Dimensionality Reduction

Although feature selection can choose individual features that are important for classification, it often neglects the interfeature relationship. An individual feature may monothetically fail to represent a dataset properly, yet polythetically, its inter-relationship with other features can function better. Feature extraction encompasses projection to a new dimension for generating transformed attributes in a reduced space without losing significant information. Given a gene dataset, it is therefore possible that this relation between genes holds the key to the accurate prediction of survival in a patient. Here lies the utility of autoencoders in dimensionality reduction.

4.1. Statistical Methods

Some of the well-known, traditional feature extraction strategies include Principal Component Analysis (PCA) [48], Kernel PCA (KPCA) [49], and Independent Component Analysis (ICA) [50]. PCA extracts the dominant information from the patterns by identifying eigenvectors with the largest eigenvalues as the most relevant attributes. KPCA, on the other hand, is a nonlinear formulation of PCA. Using integral operator kernel functions, it extracts the principal components of a dataset. A nonlinear map is employed for this purpose. ICA separates a multivariate signal from some individual non-Gaussian components, such that the generated attributes remain statistically independent. It serves as an example of blind source separation.

4.2. Autoencoder

Although statistical feature selection algorithms can efficiently detect singleton features, they typically ignore the combined effect of multiple features. This intergene relationship is not properly explored in the high-dimensional genomic data. In this scenario, the latent layer of an AE allows representing a combined effect of the entire input feature set. AEs are global nonlinear dimensionality reduction methods. They capture the topology of the manifold and tend to capture more of these global characteristics than other traditional dimensionality reduction algorithms. They can thereby learn very complicated nonlinear transformations of the features that interact in a nonlinear way, as observed in the case of high-dimensional genomic data.

The autoencoder [51] is a classical unsupervised learning model used here for gene data analysis. The basic AE is depicted in Figure 3 and includes an encoder and a decoder. The encoder compresses high-dimensional data into a low-dimensional feature space. This is an effective way of dimensionality reduction [52,53], particularly for redundant gene data. The decoder is then responsible for restoring the compressed features back to their input space.

Let

X = [x_{1}, x_{2}, \dots, x_{m}] \in R^{n \times m}

be the input, containing m samples with n features. The function of the encoder can be expressed as

E : X \in R^{n \times m} ⟶ Y \in R^{p \times m}

where

p ≪ n

and Y is the output of the hidden layer. The function of the decoder is

D : Y \in R^{p \times m} ⟶ \hat{X} \in R^{n \times m}

, where

\hat{X}

is the output of the AE. Here,

E

and

D

represent the encoder and decoder transformations, respectively. The Mean-Squared Error (MSE) optimization function of the AE (

M S E_{A E}

) can be represented as:

M S E_{A E} = arg max_{E, D} | | X - {D (E (X)) | |}^{2} .

(10)

The AE was employed to reduce the feature space in order to efficiently perform subsequent decision making in the form of survival prediction.

5. Experimental Results

In this section, we present the results for survival group prediction, using eight classification algorithms on the reduced feature space (involving the selection and projection of genes). Preprocessing (as in Section 2) on the RNA-seq data with EdgeR over 60,483 genes, considering 129 uncensored samples, led to the extraction of 877 DEGs. The application of correlation-based selection, with a coefficient

c o r r = 0.7

, resulted in the choice of a reduced set of 315 genes.

5.1. Selection of Genes

The 315 genes thus obtained using Equation (2) were considered for feature ranking. A subset of 50 genes was selected by each of the seven subset selection algorithms of Equations (3)–(9), from the 315 gene set. An overall summary of the seven subsets is presented in Table 2. It is clearly evident that the maximum p-value by all the algorithms was ≤0.0019, thereby establishing the statistical significance of the selections. This was followed by considering a consensus among these seven subsets of genes, such that those present in at least three of the seven subsets were selected. Of these, the first 50 genes were chosen. The results are depicted in the last row of Table 2, as

C O N S S

. Subsets of gene IDs, selected before and after the application of the seven filters, are reported in Supplementary Materials Sections S1 (Table S1) and S2, respectively.

5.2. Extraction of Reduced Attributes

Next, the AE model DeepSGP was employed for attribute extraction. The 315 genes’ set was encoded to 50 transformed attributes, selected from the latent (middle) layer of the AE consisting of two hidden layers with 200 and 100 nodes, respectively (Figure 3). The decoded output layer can reconstruct these back to their original gene space. Multiple the AE architecture was used for this experiment (hidden layers with 200, 150; 150, 100; and 150, 75 nodes, were employed). It was experimentally observed that stepwise decrementing (and incrementing) of the number of nodes, over the consecutive hidden layers (from 200 through 100 to 50), provided improved output, as compared to that of a sudden drop. The latent layer constituted the transformed attributes. The 50 projected attributes (chosen for the parity of comparison with the other methods) formed the reduced set for further investigation.

For comparison with other feature extraction procedures (explained in Section 4.1), we also extracted the most relevant 50 attributes from the 315 genes using the PCA (variance between 50.84 and 2.03), KPCA, and ICA algorithms. Finally, eight subsets of selected features (using feature selection and consensus algorithms), one subset of encoded attributes (DeepSGP), and three sets of extracted attributes were used by the eight classifiers considered in our study. Note that in each case, the 50 attributes (or features) were selected/extracted, based on their performance, for subsequent prediction with different classifiers. By increasing or decreasing the features, an improvement in performance was not evident. Moreover, taking 50 features (or attributes) in all the cases was reasonable, particularly for maintaining parity.

5.3. Classification into Survival Groups

The capability of feature selection and dimensionality reduction (including AE), using a 50-dimensional reduced input space, was next evaluated for survival class prediction of glioblastoma patients into the low and high survival classes. Eight classifiers, viz. Logistic Regression [54] (LR), Support Vector Machine [55] (SVM, linear kernel), k Nearest Neighbors [56] (kNN, averaged over k = 3, 5, and 7), Naive Bayes [57] (NB), Multilayer Perceptron [58] (MLP, ReLU activation), Classification and Regression Tree [59] (CART with entropy), Random Forest (RF, with n_estimators = 100) [60], and XGBoost (XGB, with booster gbtree) [61] were used for this purpose. The hyperparameters were selected based on several experiments. The classifiers were tested on the seven subsets of selected genes and one consensus subset of genes along with four subsets of extracted attributes. All the experiments were implemented using Python 3.6.

The results are depicted in Table 3 and Table 4, evaluated for the average Accuracy

A c c

and AUC, respectively. As the sample size of the dataset was very small, we did not use 10-fold cross-validation. Instead, the Leave-One-Out strategy (LOO) was used for the experiments. The metric Area Under the Curve (

A U C

) measured the distinguishability between the two survival classes (low and high). The algorithms considered included CIEF, DISR, ICAP, JMI, MIFS, MIM, MRMR, and Consensus (CONSS), for selection, and ICA, PCA, KPCA, and AE, for extraction. The average score of each algorithm, over all eight classifiers, is indicated in the last column of both tables. It was evident that the average accuracy of 0.83 and the AUC 0.90 were attained by the AE, as it outperformed the rest of the algorithms by a large margin. The next highest value (on average) was obtained by ICAP, with an accuracy of 0.71 and an AUC of 0.81, which was significantly lower than the proposed DeepSGP. The ICA performed the worst, with an average accuracy of only 0.61 and an average AUC of 0.71. From the perspective of a classifier, the best results were obtained using the MLP; with a 0.89 accuracy and a AUC of 0.97. The F1-score (the harmonic mean of the precision and recall) is defined as

F 1 = 2 \times (p r e c i s i o n \times r e c a l l) / (p r e c i s i o n + r e c a l l)

, where

p r e c i s i o n = T r u e p o s i t i v e s / (T r u e p o s i t i v e s + F a l s e p o s i t i v e s) and r e c a l l = T r u e p o s i t i v e / (T r u e p o s i t i v e + F a l s e n e g a t i v e), s p e c i f i c i t y = (T r u e n e g a t i v e / (T r u e n e g a t i v e + F a l s e p o s i t i v e), and s e n s i t i v i t y = T r u e p o s i t i v e / (T r u e p o s i t i v e + F a l s e n e g a t i v e)

. These are depicted in Table 5. The Receiver Operating Characteristic (ROC) curve is shown in Figure 4. From Table 5 and Figure 4, it was evident that (in all cases), DeepSGP outperformed the other methods. The experiments were also performed with DeepSGP involving fewer nodes. For example, with 10, 20, 30, and 40 nodes used for testing, the average accuracy and AUC were always less than 0.80 and 0.88, respectively. It may be noted that increasing the size of the subset did not ensure better results. Subsequent analysis of the significance of the genes, decoded from the attributes selected by the AE (DeepSGP), is pursued in Section 5.4. The predictive ability of the extracted attributes from DeepSGP was estimated using the concordance index, which was found to be a reasonable 0.81. The proposed method was compared with state-of-the-art feature extraction/selection algorithms and demonstrated superiority in all cases.

All the experiments were conducted on a computer with an Intel Xeon E5 processor using 128 GB of RAM and an NVIDIA TITAN Xp GPU. The programming language R and Python 3.6 were used for the experiments [62,63].

5.4. Biological Significance of Extracted Genes

The weights of the AE were analyzed to infer the importance of its encoded genes towards decision making. Since our model had two hidden layers, preceding the latent layer, it became difficult to directly decode the genes based on their connectivity from the latent layer. Therefore, an analogous single-layer AE was designed with 50 nodes in its latent layer. Although this smaller AE also generated a high prediction accuracy of 0.84, it was less than that by the DeepSGP version represented in Figure 3. The normalized final weights, with values above a threshold of 0.5 or below a threshold of −0.5, were considered to be the important genes. In this manner, twenty-one genes could be extracted. Their biological relevance could also be validated from the existing literature. The results are listed in Table 6, along with the corresponding references, reporting their relevance.

5.4.1. Comparisons with Other Feature Sets

Analogously, employing thresholds of

\pm 0.4

and

\pm 0.3

resulted in the extraction of 52 and 100 significant genes, respectively. The results for the accuracy and AUC, with the eight classifiers on all three sets of selected genes, viz. 21, 52, and 100, are provided in Table 7. It can be observed that the accuracy and AUC with 315 genes were not better than those obtained using the AE. This highlights the importance of the AE towards extracting significant genes for survival group prediction. It may be emphasized that while Table 3 and Table 4 provide prediction evaluation in terms of 50 projected attributes (for DeepSGP with a 200 × 100 × 50 × 100 × 200 configuration), Table 7 presents the results obtained by genes extracted by analyzing the AE weight values (with a 200 × 50 × 200 configuration). It may be concluded that a judicious combination of features (genes) through DeepSGP led to improved survival group prediction.

5.4.2. Pathway Analysis

The shortlisted set of 100 genes was used to perform pathway analysis using the KEGG [92] and Reactome [93] pathway databases. The IDs of the subset of the shortlisted 100 genes for pathway analysis are listed in Supplementary Materials Section S3 (Table S2). The list of identified pathways is given in Supplementary Materials Section S4 (Tables S3 and S4). We obtained only two significant pathways in the case of KEGG, while eight significant processes were obtained upon performing the analysis on the Reactome Database. Of these processes, the noteworthy ones were the activation of matrix metalloproteinases and collagen formation and degradation [94], associated with glioblastoma.

5.4.3. Gene Ontology Analysis

Gene Ontology analysis was carried out on these top 100 differentially expressed genes, extracted from the AE (Table 7). The Panther (http://www.pantherdb.org/(accessed on 2 February 2020)) and GOrilla (http://cbl-gorilla.cs.technion.ac.il/(accessed on 2 February 2020)) gene ontology databases were used to perform the enrichment analysis. Fisher’s exact test with FDR correction yielded no significantly enriched biological processes (i.e., FDR ≤ 0.05). On that account, we withheld any statistical analysis and looked into the biological significance of these genes independently.

The selected list revealed several genes related to the prognosis and pathogenesis of glioblastoma cells. Among these, Heparin-Binding Epidermal Growth Factor-like growth factor (HBEGF) has been reported to be expressed in several cases of malignant gliomas, and it has been shown to induce glioblastoma in mice [95]. Various subtypes of glioblastoma have been shown to be characterized by the increased expression of genes related to the extracellular matrix and invasion [96]. Accordingly, our set included cadherins PCDHA7 and PCDHB15 to be among the deregulated genes. Another extracellular matrix protein, COL22A1, has been shown to be involved in head-and-neck cancers. Spainhour and Qiu [75] identified a possible association between COL22A1 expression and survival in glioblastoma and low-grade glioma patients. Several matrix metalloproteases were found to be deregulated, including MMP7 and MMP9, which have been reported to be associated with glioblastoma. A low level of expression of MMP9 was found to be correlated with better survival rates and is considered to be a prognostic biomarker in patients with primary glioblastoma [97]. Mu et al. [98] demonstrated that PRL3 promotes invasion and proliferation in glioma cells, and this process is mediated by the increased expression of MMP7 through the ERK and JNK signaling pathways. Another family of proteins, namely A Disintegrin And Metalloproteinases (ADAMs), has been found to be involved in the progression and pathogenicity of cancers. Of these, ADAMTS14 was found to be significantly deregulated in our dataset [99]. B3GLCT, a glucosyltransferase, has been found to be highly overexpressed in gliomas, among other categories of malignancies, as seen in the TCGA dataset (https://www.proteinatlas.org/ENSG00000187676-B3GLCT/pathology) (accessed on 2 February 2020).

Bioinformatic analysis integrating both gene expression and methylation profiles in glioblastoma using TCGA RNA-seq and methylation data, followed by further validation in other TCGA gene expression and TCGA methylation datasets, revealed several genes (including RPP25) to be among the eight-gene signature that divided the GBM patients into high-risk and low-risk groups, with the high- risk individuals being characterized by lower survival rates [84]. The lncRNA FER1L4 has been established to be upregulated in gliomas, with the expression levels being significantly higher in high-grade gliomas as compared to gliomas of a lower grade [89]. Another differentially expressed lncRNA, LINC02731, has been found to have significantly elevated expression in brain tissues as compared to other tissues (https://www.ncbi.nlm.nih.gov/gene/100128239#gene-expression (accessed on 2 February 2020)); however, whether or not it has any role in glioblastoma is yet to be examined.

Using the Panther Database to examine the pathways regulated by these genes, we found that several genes were involved in the cholecystokinin receptor-mediated signaling pathway (CCKR map in GO terms). These included HBEGF, IL8, and matrix metalloproteinases (MMP-3,7,9). The matrix metalloproteases were also found to be involved in other pathways such as the plasminogen activating cascade, which is known to be associated with glioma cell invasion [100] and the Alzheimer disease presenilin pathway. Other regulated pathways, with corresponding genes that were differentially expressed in our set, included the gonadotropin-releasing hormone receptor pathway (FOSB, MAP2K3, and AL035425), the cadherin signaling pathway (PCDHA7 and PCDHB15), the integrin signaling pathway (AL035425, MAP2K3), and the EGF receptor signaling pathway (HBEGF and MAP2K3).

6. Conclusions

We developed an AE-based feature extraction model, DeepSGP, for survival group prediction, producing very high-quality results. It can be inferred that the RNA-seq data had a good potential to predict the survival prognostic of GBM patients, with the appropriate selection of genes. The extraction by the AE was compared with that from several state-of-the-art feature selection and extraction algorithms. A set of eight classifiers was used for learning and predicting, based on the selected and extracted reduced attributes’ set. DeepSGP outperformed all the methods on average, and was particularly good with respect to MLP. The biological significance of the genes, extracted by the AE, were also analyzed to establish their importance. The concordance index on the AE-extracted features demonstrated that the results were quite reasonable. In our study, feature selection was employed as a means of extracting the highest correlated genes. Though this method will potentially identify the most relevant genes, several genes specific to disease biology will be omitted by this strategy. Moreover, the results are yet to be implemented in clinical care. One of the major reasons is a lack of a multicentric approach in validating the prediction of the model before the implementation of these observations [101]. In the future, we will try to discover the correlation between the selected genes with the imaging phenotypes (extracted from multi-modal MR images) for a better understanding of brain cancer patients.

Supplementary Materials

The following are available online at https://www.mdpi.com/article/10.3390/electronics10121463/s1.

Author Contributions

Data curation, R.K. and R.C.; Formal analysis, R.K. and S.M; Investigation, R.K. and S.B; Methodology, R.K. and S.B.; Resources, R.K., S.B. and S.L.; Software, R.K. and S.B.; Supervision, R.C. , S.M. AND B.U.S.; Validation, R.K. and S.L.; Visualization, R.K. and S.L.; Writing—original draft, R.K., R.C. and B.U.S.; Writing—review & editing, R.K., R.C. and S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in https://portal.gdc.cancer.gov/ (accessed on 2 Febuary 2021).

Acknowledgments

This work was carried out under the J. C. Bose National Fellowship (JCB/2020/000033) awarded to S. Mitra.

Conflicts of Interest

The authors declare no conflict of interest.

References

Reyna, M.A.; Haan, D.; Paczkowska, M.; Verbeke, L.P.; Vazquez, M.; Kahraman, A.; Pulido-Tamayo, S.; Barenboim, J.; Wadi, L.; Dhingra, P.; et al. Pathway and network analysis of more than 2500 whole cancer genomes. Nat. Commun. 2020, 11, 1–17. [Google Scholar] [CrossRef] [PubMed]
Sever, R.; Brugge, J.S. Signal Transduction in Cancer. Cold Spring Harb. Perspect. Med. 2015, 5, a006098. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1999, 286, 531–537. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bauer, S.; Wiest, R.; Nolte, L.P.; Reyes, M. A survey of MRI-based medical image analysis for brain tumor studies. Phys. Med. Biol. 2013, 58, R97–R129. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Stupp, R.; Mason, W.P.; Van Den Bent, M.J.; Weller, M.; Fisher, B.; Taphoorn, M.J.; Belanger, K.; Brandes, A.A.; Marosi, C.; Bogdahn, U.; et al. Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma. N. Engl. J. Med. 2005, 352, 987–996. [Google Scholar] [CrossRef]
Lacroix, M.; Abi-Said, D.; Fourney, D.R.; Gokaslan, Z.L.; Shi, W.; DeMonte, F.; Lang, F.F.; McCutcheon, I.E.; Hassenbusch, S.J.; Holland, E.; et al. A multivariate analysis of 416 patients with glioblastoma multiforme: Prognosis, extent of resection, and survival. J. Neurosurg. 2001, 95, 190–198. [Google Scholar] [CrossRef] [PubMed] [Green Version]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Hinton, G.E.; Krizhevsky, A.; Wang, S.D. Transforming auto-encoders. In Proceedings of the 21st International Conference on Artificial Neural Networks, Espoo, Finland, 14–17 June 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 44–51. [Google Scholar]
Wong, K.K.; Rostomily, R.; Wong, S.T. Prognostic Gene Discovery in Glioblastoma Patients using Deep Learning. Cancers 2019, 11, 53. [Google Scholar] [CrossRef] [Green Version]
Spyridon, B.; Reyes, M.; Jakab, A.; Bauer, S.; Rempfler, M.; Crimi, A.; Shinohara, R.T.; Berger, C.; Ha, S.M.; Rozycki, M.; et al. Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BraTS Challenge. arXiv 2018, arXiv:1811.02629. [Google Scholar]
Alex, V.; Safwan, M.; Krishnamurthi, G. Automatic Segmentation and Overall Survival Prediction in Gliomas Using Fully Convolutional Neural Network and Texture Analysis. arXiv 2017, arXiv:1712.02066. [Google Scholar]
Banerjee, S.; Mitra, S.; Uma Shankar, B. Multi-Planar Spatial-ConvNet for segmentation and survival prediction in brain cancer. In Proceedings of the 4th International Workshop BrainLes 2018: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Granada, Spain, 16 September 2018; Springer International Publishing: Cham, Switzerland, 2019; Volume 11384, pp. 94–104. [Google Scholar]
Isensee, F.; Kickingereder, P.; Wick, W.; Bendszus, M.; Maier-Hein, K.H. Brain tumor segmentation and radiomics survival prediction: Contribution to the BraTS 2017 Challenge. In Proceedings of the 3rd International Workshop BrainLes 2017: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Quebec City, QC, Canada, 14 September 2017; Springer: Cham, Switzerland, 2017; pp. 287–297. [Google Scholar]
Feng, X.; Tustison, N.; Meyer, C. Brain Tumor Segmentation Using an Ensemble of 3D U-Nets and Overall Survival Prediction Using Radiomic Features. In Proceedings of the 4th International Workshop BrainLes 2018: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Granada, Spain, 16 September 2018; Springer: Cham, Switzerland, 2018; pp. 279–288. [Google Scholar]
Puybareau, E.; Tochon, G.; Chazalon, J.; Fabrizio, J. Segmentation of Gliomas and Prediction of Patient Overall Survival: A Simple and Fast Procedure. In Proceedings of the 4th International Workshop BrainLes 2018: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Granada, Spain, 16 September 2018; Springer: Cham, Switzerland, 2018; pp. 199–209. [Google Scholar]
Sun, L.; Zhang, S.; Luo, L. Tumor Segmentation and Survival Prediction in Glioma with Deep Learning. In Proceedings of the 4th International Workshop BrainLes 2018: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Granada, Spain, 16 September 2018; Springer: Cham, Switzerland, 2018; pp. 83–93. [Google Scholar]
Weninger, L.; Rippel, O.; Koppers, S.; Merhof, D. Segmentation of Brain Tumors and Patient Survival Prediction: Methods for the BraTS 2018 Challenge. In Proceedings of the 4th International Workshop BrainLes 2018: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Granada, Spain, 16 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–12. [Google Scholar]
Wu, G.; Wang, Y.; Yu, J. Overall survival time prediction for high grade gliomas based on sparse representation framework. In Proceedings of the 3rd International Workshop BrainLes 2017: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Quebec City, QC, Canada, 14 September 2017; Springer: Cham, Switzerland, 2017; pp. 77–87. [Google Scholar]
Suter, Y.; Jungo, A.; Rebsamen, M.; Knecht, U.; Herrmann, E.; Wiest, R.; Reyes, M. Deep Learning versus Classical Regression for Brain Tumor Patient Survival Prediction. In Proceedings of the 4th International Workshop BrainLes 2018: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Granada, Spain, 16 September 2018; Springer: Cham, Switzerland, 2018; pp. 429–440. [Google Scholar]
Sanghani, P.; Ang, B.T.; King, N.K.K.; Ren, H. Overall survival prediction in glioblastoma multiforme patients from volumetric, shape and texture features using machine learning. Surg. Oncol. 2018, 27, 709–714. [Google Scholar] [CrossRef]
Chai, R.; Zhang, K.; Wang, K.; Li, G.; Huang, R.; Zhao, Z.; Liu, Y.; Chen, J. A novel gene signature based on five glioblastoma stem-like cell relevant genes predicts the survival of primary glioblastoma. J. Cancer Res. Clin. Oncol. 2018, 144, 439–447. [Google Scholar] [CrossRef] [PubMed]
Fatai, A.A.; Gamieldien, J. A 35-gene signature discriminates between rapidly-and slowly-progressing glioblastoma multiforme and predicts survival in known subtypes of the cancer. BMC Cancer 2018, 18, 377. [Google Scholar] [CrossRef] [PubMed]
Gao, W.Z.; Guo, L.M.; Xu, T.Q.; Yin, Y.H.; Jia, F. Identification of a multidimensional transcriptome signature for survival prediction of postoperative glioblastoma multiforme patients. J. Transl. Med. 2018, 16, 368. [Google Scholar] [CrossRef] [Green Version]
Jain, R.; Poisson, L.; Narang, J.; Gutman, D.; Scarpace, L.; Hwang, S.N.; Holder, C.; Wintermark, M.; Colen, R.R.; Kirby, J.; et al. Genomic mapping and survival prediction in glioblastoma: Molecular subclassification strengthened by hemodynamic imaging biomarkers. Radiology 2013, 267, 212–220. [Google Scholar] [CrossRef] [Green Version]
Srinivasan, S.; Patric, I.R.P.; Somasundaram, K. A ten-microRNA expression signature predicts survival in glioblastoma. PLoS ONE 2011, 6, e17438. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wang, Y.; Liu, X.; Guan, G.; Zhao, W.j.; Zhuang, M. A risk classification system with five-gene for survival prediction of glioblastoma patients. Front. Neurol. 2019, 10, 745. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, X.Q.; Sun, S.; Lam, K.F.; Kiang, K.M.Y.; Pu, J.K.S.; Ho, A.S.W.; Lui, W.M.; Fung, C.F.; Wong, T.S.; Leung, G.K.K. A long non-coding RNA signature in glioblastoma multiforme predicts survival. Neurobiol. Dis. 2013, 58, 123–131. [Google Scholar] [CrossRef]
Zhang, Y.; Xu, J.; Zhu, X. A 63 signature genes prediction system is effective for glioblastoma prognosis. Int. J. Mol. Med. 2018, 41, 2070–2078. [Google Scholar] [CrossRef] [Green Version]
Zhou, M.; Zhang, Z.; Zhao, H.; Bao, S.; Cheng, L.; Sun, J. An immune-related six-lncRNA signature to improve prognosis prediction of glioblastoma multiforme. Mol. Neurobiol. 2018, 55, 3684–3697. [Google Scholar] [CrossRef]
Zhou, M.; Chaudhury, B.; Hall, L.O.; Goldgof, D.B.; Gillies, R.J.; Gatenby, R.A. Identifying spatial imaging biomarkers of glioblastoma multiforme for survival group prediction. J. Magn. Reson. Imaging 2017, 46, 115–123. [Google Scholar] [CrossRef]
Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. EdgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26, 139–140. [Google Scholar] [CrossRef] [Green Version]
Anders, S.; Huber, W. Differential expression analysis for sequence count data. Genome Biol. 2010, 11, R106. [Google Scholar] [CrossRef] [Green Version]
Blighe, K.; Rana, S.; Lewis, M. EnhancedVolcano: Publication-Ready Volcano Plots with Enhanced Colouring and Labeling. R Package Version 1.10.0. 2019. Available online: https://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html (accessed on 2 February 2021).
Erickson, B.J.; Korfiatis, P.; Kline, T.L.; Akkus, Z.; Philbrick, K.; Weston, A.D. Deep Learning in Radiology: Does One Size Fit All? J. Am. Coll. Radiol. 2018, 15, 521–526. [Google Scholar] [CrossRef] [Green Version]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling TEchnique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Kohavi, R.; John, G. Wrappers for feature selection. Artif. Intell. 1997, 97, 273–324. [Google Scholar] [CrossRef] [Green Version]
Amaldi, E.; Kann, V. On the approximation of minimizing non zero variables or unsatisfied relations in linear systems. Theor. Comput. Sci. 1998, 209, 237–260. [Google Scholar] [CrossRef] [Green Version]
Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Sha, Y.; Phan, J.H.; Wang, M.D. Effect of Low-Expression Gene Filtering on Detection of Differentially Expressed Genes in RNA-seq Data. In Proceedings of the 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 31 March 2015; pp. 6461–6464. [Google Scholar]
Seiler, M.C.; Seiler, F.A. Numerical recipes in C: The art of scientific computing. Risk Anal. 1989, 9, 415–416. [Google Scholar] [CrossRef]
Lewis, D.D. Feature selection and feature extraction for text categorization. In Proceedings of the Workshop on Speech and Natural Language, Pacific Grove, CA, USA, 19–22 February 1991; Association for Computational Linguistics: San Francisco, CA, USA, 1992; pp. 212–217. [Google Scholar]
Yang, H.; Moody, J. Data Visualization and Feature Selection: New Algorithms for NonGaussian Data. In NIPS’99: Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; MIT Press: Cambridge, MA, USA, 2000; Volume 12, pp. 687–693. [Google Scholar]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
Rothlauf, F.; Branke, J.; Cagnoni, S. (Eds.) On the Use of Variable Complementarity for Feature Selection in Cancer Classification; Applications of Evolutionary Computing: EvoWorkshops, LNCS; Springer: Berlin, Germany, 2006; Volume 3907. [Google Scholar]
Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 1994, 5, 537–550. [Google Scholar] [CrossRef] [Green Version]
Jakulin, A. Machine Learning Based on Attribute Interactions. Ph.D. Thesis, University of Ljubljana, Ljubljana, Slovenia, 2005. [Google Scholar]
Lin, D.; Tang, X. Conditional infomax learning: An integrated framework for feature extraction and fusion. In Proceedings of the 9th European Conference on Computer Vision (ECCV 2006), Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 68–82. [Google Scholar]
Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
Schölkopf, B.; Smola, A.; Müller, K.R. Kernel principal component analysis. In Proceedings of the International Conference on Artificial Neural Networks, Lausanne, Switzerland, 8–10 October 1997; Springer: Berlin/Heidelberg, Germany, 1997; pp. 583–588. [Google Scholar]
Hyvärinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 2000, 13, 411–430. [Google Scholar] [CrossRef] [Green Version]
Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of the ICML Workshop on Unsupervised and Transfer Learning, Bellevue, WA, USA, 2 July 2012; pp. 37–49. [Google Scholar]
Wang, W.; Huang, Y.; Wang, Y.; Wang, L. Generalized autoencoder: A neural network framework for dimensionality reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 490–497. [Google Scholar]
Wang, Y.; Yao, H.; Zhao, S. Auto-encoder based dimensionality reduction. Neurocomputing 2016, 184, 232–242. [Google Scholar] [CrossRef]
Peng, C.Y.J.; Lee, K.L.; Ingersoll, G.M. An introduction to logistic regression analysis and reporting. J. Educ. Res. 2002, 96, 3–14. [Google Scholar] [CrossRef]
Suykens, J.A.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 1999, 9, 293–300. [Google Scholar] [CrossRef]
Cover, T.M.; Hart, P.E. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Rish, I. An Empirical Study of the Naive Bayes Classifier. Available online: https://www.cc.gatech.edu/fac/Charles.Isbell/classes/reading/papers/Rish.pdf (accessed on 2 February 2021).
Baum, E.B. On the capabilities of multilayer perceptrons. J. Complex. 1988, 4, 193–215. [Google Scholar] [CrossRef] [Green Version]
De’ath, G.; Fabricius, K.E. Classification and regression trees: A powerful yet simple technique for ecological data analysis. Ecology 2000, 81, 3178–3192. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2013. [Google Scholar]
Van Rossum, G.; Drake, F. Python 3 Reference Manual CreateSpace; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
Baldari, S.; Ubertini, V.; Garufi, A.; D’orazi, G.; Bossi, G. Targeting MKK3 as a novel anticancer strategy: Molecular mechanisms and therapeutical implications. Cell Death Dis. 2015, 6, e1621. [Google Scholar] [CrossRef] [Green Version]
Daniel, P.M.; Filiz, G.; Tymms, M.J.; Ramsay, R.G.; Kaye, A.H.; Stylli, S.S.; Mantamadiotis, T. Intratumor MAPK and PI3K signaling pathway heterogeneity in glioblastoma tissue correlates with CREB signaling and distinct target gene signatures. Exp. Mol. Pathol. 2018, 105, 23–31. [Google Scholar] [CrossRef]
Yang, H.; Jiang, Z.; Wang, S.; Zhao, Y.; Song, X.; Xiao, Y.; Yang, S. Long non-coding small nucleolar RNA host genes in digestive cancers. Cancer Med. 2019, 8, 7693–7704. [Google Scholar] [CrossRef] [Green Version]
Ongusaha, P.P.; Kwak, J.C.; Zwible, A.J.; Macip, S.; Higashiyama, S.; Taniguchi, N.; Fang, L.; Lee, S.W. HB-EGF is a potent inducer of tumor growth and angiogenesis. Cancer Res. 2004, 64, 5283–5290. [Google Scholar] [CrossRef] [Green Version]
Liao, W.C.; Liao, C.K.; Tsai, Y.H.; Tseng, T.J.; Chuang, L.C.; Lan, C.T.; Chang, H.M.; Liu, C.H. DSE promotes aggressive glioma cell phenotypes by enhancing HB-EGF/ErbB signaling. PLoS ONE 2018, 13, e0198364. [Google Scholar] [CrossRef]
Xu, X.; Xiong, X.; Sun, Y. The role of ribosomal proteins in the regulation of cell proliferation, tumorigenesis, and genomic integrity. Sci. China Life Sci. 2016, 59, 656–672. [Google Scholar] [CrossRef]
Cal, S.; Lopez-Otin, C. ADAMTS proteases and cancer. Matrix Biol. 2015, 44, 77–85. [Google Scholar] [CrossRef]
Rome, C.; Arsaut, J.; Taris, C.; Couillaud, F.; Loiseau, H. MMP-7 (Matrilysin) Expression in Human Brain Tumors. Mol Carcinog. 2007, 46, 446–452. [Google Scholar] [CrossRef]
Modéer, T.; Dahllöf, G. Development of phenytoin-induced gingival overgrowth in non-institutionalized epileptic children subjected to different plaque control programs. Acta Odontol. Scand. 1987, 45, 81–85. [Google Scholar] [CrossRef]
Tong, H.; Yu, X.; Lu, X.; Wang, P. Downregulation of solute carriers of glutamate in gliosomes and synaptosomes may explain local brain metastasis in anaplastic glioblastoma. IUBMB Life 2015, 67, 306–311. [Google Scholar] [CrossRef] [PubMed]
Misawa, K.; Kanazawa, T.; Imai, A.; Endo, S.; Mochizuki, D.; Fukushima, H.; Misawa, Y.; Mineta, H. Prognostic value of type XXII and XXIV collagen mRNA expression in head and neck cancer patients. Mol. Clin. Oncol. 2014, 2, 285–291. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Spainhour, J.C.G.; Qiu, P. Identification of gene-drug interactions that impact patient survival in TCGA. BMC Bioinform. 2016, 17, 409. [Google Scholar] [CrossRef] [Green Version]
Umeda-Yano, S.; Hashimoto, R.; Yamamori, H.; Okada, T.; Yasuda, Y.; Ohi, K.; Fukumoto, M.; Ito, A.; Takeda, M. The regulation of gene expression involved in TGF-β signaling by ZNF804A, a risk gene for schizophrenia. Schizophr. Res. 2013, 146, 273–278. [Google Scholar] [CrossRef] [PubMed]
Dong, Z.Q.; Guo, Z.Y.; Xie, J. The lncRNA EGFR-AS1 is linked to migration, invasion and apoptosis in glioma cells by targeting miR-133b/RACK1. Biomed. Pharmacother. 2019, 118, 109292. [Google Scholar] [CrossRef]
Liu, S.; Zhang, J.; Tsang, L.L.; Huang, J.; Tu, S.P.; Jiang, X. R-spodin2 enhances canonical Wnt signaling to maintain the stemness of glioblastoma cells. Cancer Cell Int. 2018, 18, 156. [Google Scholar] [CrossRef]
Liu, P.; Du, R.; Yu, X. LncRNA HAND2-AS1 overexpression inhibits cancer cell proliferation in melanoma by downregulating ROCK1. Oncol. Lett. 2019, 18, 1005–1010. [Google Scholar] [CrossRef] [Green Version]
Lamb, R.; Bonuccelli, G.; Ozsvári, B.; Peiris-Pagès, M.; Fiorillo, M.; Smith, D.L.; Bevilacqua, G.; Mazzanti, C.M.; McDonnell, L.A.; Naccarato, A.G.; et al. Mitochondrial mass, a new metabolic biomarker for stem-like cancer cells: Understanding WNT/FGF-driven anabolic signaling. Oncotarget 2015, 6, 30453. [Google Scholar] [CrossRef] [Green Version]
Lando, M.; Fjeldbo, C.S.; Wilting, S.M.; Snoek, B.C.; Aarnes, E.K.; Forsberg, M.F.; Kristensen, G.B.; Steenbergen, R.D.; Lyng, H. Interplay between promoter methylation and chromosomal loss in gene silencing at 3p11-p14 in cervical cancer. Epigenetics 2015, 10, 970–980. [Google Scholar] [CrossRef] [Green Version]
Hüls, G.; Lindemann, H.; Velcovsky, H. Angiotensin converting enzyme (ACE) in the follow-up control of children and adolescents with allergic alveolitis. Monatsschrift Kinderheilkd. 1989, 137, 158–161. [Google Scholar]
Xia, X.; Cao, F.; Yuan, X.; Zhang, Q.; Chen, W.; Yu, Y.; Xiao, H.; Han, C.; Yao, S. Low expression or hypermethylation of PLK2 might predict favorable prognosis for patients with glioblastoma multiforme. PeerJ 2019, 7, e7974. [Google Scholar] [CrossRef]
Wang, W.; Zhao, Z.; Wu, F.; Wang, H.; Wang, J.; Lan, Q.; Zhao, J. Bioinformatic analysis of gene expression and methylation regulation in glioblastoma. J. Neuro Oncol. 2018, 136, 495–503. [Google Scholar] [CrossRef] [PubMed]
Bagchi, S.; Li, S.; Wang, C.R. CD1b-autoreactive T cells recognize phospholipid antigens and contribute to antitumor immunity against a CD1b+ T cell lymphoma. Oncoimmunology 2016, 5, e1213932. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Piña, Y.; Houston, S.K.; Murray, T.G.; Koru-Sengul, T.; Decatur, C.; Scott, W.K.; Nathanson, L.; Clarke, J.; Lampidis, T.J. Retinoblastoma treatment: Impact of the glycolytic inhibitor 2-deoxy-d-glucose on molecular genomics expression in LHBETATAG retinal tumors. Clin. Ophthalmol. 2012, 6, 817. [Google Scholar] [PubMed]
Motaln, H.; Koren, A.; Gruden, K.; Ramšak, Ž.; Schichor, C.; Lah, T.T. Heterogeneous glioblastoma cell cross-talk promotes phenotype alterations and enhanced drug resistance. Oncotarget 2015, 6, 40998. [Google Scholar] [CrossRef] [Green Version]
Wilson, W.G.; Alford, B.A.; Schnatterly, P.T. Craniolacunia as the Result of Compression and Decompression of the Fetal Skull. Am. J. Med Genet. 1987, 27, 729–730. [Google Scholar] [CrossRef]
Ding, F.; Tang, H.; Nie, D.; Xia, L. Long non-coding RNA Fer-1-like family member 4 is overexpressed in human glioblastoma and regulates the tumorigenicity of glioma cells. Oncol. Lett. 2017, 14, 2379–2384. [Google Scholar] [CrossRef] [Green Version]
Xia, L.; Nie, D.; Wang, G.; Sun, C.; Chen, G. FER1L4/miR-372/E2F1 works as a ceRNA system to regulate the proliferation and cell cycle of glioma cells. J. Cell. Mol. Med. 2019, 23, 3224–3233. [Google Scholar] [CrossRef] [Green Version]
Zhou, Q.; Yu, Q.; Gong, Y.; Liu, Z.; Xu, H.; Wang, Y.; Shi, Y. Construction of a lncRNA-miRNA-mRNA network to determine the regulatory roles of lncRNAs in psoriasis. Exp. Ther. Med. 2019, 18, 4011–4021. [Google Scholar] [CrossRef] [Green Version]
Kanehisa, M.; Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28, 27–30. [Google Scholar] [CrossRef]
Jassal, B.; Matthews, L.; Viteri, G.; Gong, C.; Lorente, P.; Fabregat, A.; Sidiropoulos, K.; Cook, J.; Gillespie, M.; Haw, R.; et al. The reactome pathway knowledgebase. Nucleic Acids Res. 2020, 48, D498–D503. [Google Scholar] [CrossRef]
Mammoto, T.; Jiang, A.; Jiang, E.; Panigrahy, D.; Kieran, M.W.; Mammoto, A. Role of collagen matrix in tumor angiogenesis and glioblastoma multiforme progression. Am. J. Pathol. 2013, 183, 1293–1305. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shin, C.H.; Robinson, J.P.; Sonnen, J.A.; Welker, A.E.; Yu, D.X.; VanBrocklin, M.W.; Holmen, S.L. HBEGF promotes gliomagenesis in the context of Ink4a/Arf and Pten loss. Oncogene 2017, 36, 4610–4618. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Dontenwill, M.; Martin, S.; Janouskova, H. Integrins and p53 pathways in glioblastoma resistance to temozolomide. Front. Oncol. 2012, 2, 157. [Google Scholar]
Li, Q.; Chen, B.; Cai, J.; Sun, Y.; Wang, G.; Li, Y.; Li, R.; Feng, Y.; Han, B.; Li, J.; et al. Comparative analysis of matrix metalloproteinase family members reveals that MMP9 predicts survival and response to temozolomide in patients with primary glioblastoma. PLoS ONE 2016, 11, e0151815. [Google Scholar] [CrossRef] [Green Version]
Mu, N.; Gu, J.; Liu, N.; Xue, X.; Shu, Z.; Zhang, K.; Huang, T.; Chu, C.; Zhang, W.; Gong, L.; et al. PRL-3 is a potential glioblastoma prognostic marker and promotes glioblastoma progression by enhancing MMP7 through the ERK and JNK pathways. Theranostics 2018, 8, 1527. [Google Scholar] [CrossRef]
Mochizuki, S.; Okada, Y. ADAMs in cancer cell proliferation and progression. Cancer Sci. 2007, 98, 621–628. [Google Scholar] [CrossRef]
Tsatas, D.; Kaye, A.H. The role of the plasminogen activation cascade in glioma cell invasion: A review. J. Clin. Neurosci. 2003, 10, 139–145. [Google Scholar] [CrossRef]
Tewarie, I.A.; Senders, J.T.; Kremer, S.; Devi, S.; Gormley, W.B.; Arnaout, O.; Smith, T.R.; Broekman, M.L. Survival prediction of glioblastoma patients—Are we there yet? A systematic review of prognostic modeling for glioblastoma and its clinical potential. Neurosurg. Rev. 2020, 1–11. [Google Scholar] [CrossRef]

Figure 1. Volcano plot depicting the significant genes in terms of their significance levels and fold changes in expression. Genes with p-values less than 0.002 were considered to be statistically significant (“red” and “blue” points in the plot). Of these, those with more than a two-fold expression are additionally demarcated (genes denoted in “red”). The top 5 upregulated and downregulated genes with the highest changes in expression are labeled. The genes represented in black (NS) and green (logFC) points denote the ones that were not statistically significant and were not considered for downstream analyses.

Figure 2. Flowchart of the experimental strategy.

Figure 3. Basic architecture of the autoencoder.

Figure 4. Comparative study of the ROC curve in the survival prediction by DeepSGP using 50 attributes. Here, the x-axis represents the false positive rate, and the y-axis corresponds to the true positive rate. Each colored plot indicates the performance of an algorithm, with reference to the respective legend.

Table 1. List of symbols used.

Symbol	Symbol Description	Symbol	Symbol Description
$\| S \|$	Cardinality of the set	FC	Fold Change
$C$	Class label	GBM	Glioblastoma Multiforme
$c o v$	Covariance	GO	Gene Ontology
$D$	Decoder	HGG	High-Grade Gliomas
$E$	Encoder	ICA	Independent Component Analysis
$F_{j}$	Relevant features	ICAP	Interaction Capping
$F_{k}$	Subset of features	JMI	Joint Mutual Information
$H (F_{k} F_{j} C)$	Joint entropy	KPCA	Kernel Principal Component Analysis
I	Mutual information	LGG	Low-Grade Gliomas
J	Criterion function	LOO	Leave-One-Out strategy
kNN	k Nearest Neighbor classifier	LR	Logistic Regression
m	Samples	MIFS	Mutual Information Feature Selection
n	Features	MIM	Mutual Information Maximization
$p^{'} (x)$	Probability	MLP	Multi-Layer Perceptron
$p^{'} (x y)$	Joint probability	MRI	Magnetic Resonance Imaging
$S$	Set of features	MRMR	Minimum Redundancy Maximum Relevance
$\hat{X}$	Output of AE	MSE	Mean-Squared Error
X and Y	Random variables	NB	Naive Bayes
$σ$	Standard deviation	OS	Overall Survival
AE	Autoencoder	PCA	Principal Component Analysis
AUC	Area Under the Curve	RF	Random Forest
CART	Classification And Regression Tree	RNA-seq	RNA sequencing data
CIFE	Conditional Infomax Feature Extraction	ROC	Receiver Operating Characteristic curve
CONSS	Consensus	SMOTE	Synthetic Minority Oversampling TEchnique
DEG	Differentially Expressed Genes	SVM	Support Vector Machine
DISR	Double Input Symmetrical Relevance	TCGA	The Cancer Genome Atlas
DL	Deep Learning	XGB	XGBoost

Table 2. Characteristics of the top 50 genes chosen by the subset selection algorithms.

#	Algorithm	Max p-Value	Min p-Value
1	CIFE	0.0019	$2.88 \times 10^{- 5}$
2	DISR	0.0019	$4.16 \times 10^{- 26}$
3	ICAP	0.0019	$1.72 \times 10^{- 6}$
4	JMI	0.0019	$2.79 \times 10^{- 11}$
5	MIFS	0.0014	$9.15 \times 10^{- 25}$
6	MIM	0.0019	$2.79 \times 10^{- 11}$
7	MRMR	0.0019	$2.79 \times 10^{- 11}$
8	CONSS	0.0019	$0.00 (A p p r o x .)$

Table 3. Comparative study of the accuracy of the survival prediction by DeepSGP using 50 attributes.

#	Methods	LR	SVM	KNN	NB	MLP	CART	RF	XGB	Avg
1	CIEF	0.76	0.74	0.72	0.72	0.69	0.57	0.67	0.64	0.69
2	DISR	0.72	0.70	0.65	0.77	0.69	0.59	0.72	0.66	0.69
3	ICAP	0.73	0.76	0.73	0.79	0.71	0.55	0.69	0.68	0.71
4	JMI	0.75	0.73	0.72	0.67	0.66	0.53	0.68	0.68	0.68
5	MIFS	0.67	0.71	0.55	0.69	0.67	0.60	0.66	0.59	0.64
6	MIM	0.68	0.68	0.67	0.68	0.64	0.52	0.63	0.62	0.64
7	MRMR	0.74	0.72	0.62	0.79	0.72	0.45	0.75	0.63	0.68
8	CONSS	0.73	0.68	0.70	0.77	0.73	0.61	0.74	0.73	0.71
9	ICA	0.65	0.64	0.51	0.78	0.58	0.47	0.66	0.58	0.61
10	PCA	0.71	0.68	0.46	0.61	0.58	0.65	0.74	0.75	0.65
11	KPCA	0.74	0.71	0.60	0.63	0.65	0.45	0.64	0.68	0.64
12	DeepSGP	0.85	0.86	0.76	0.86	0.89	0.66	0.88	0.85	0.83

Table 4. Comparative study over the AUC in the survival prediction by DeepSGP using 50 attributes.

#	Methods	LR	SVM	KNN	NB	MLP	CART	RF	XGB	Avg
1	CIEF	0.85	0.85	0.82	0.85	0.85	0.60	0.80	0.75	0.80
2	DISR	0.85	0.85	0.73	0.90	0.85	0.62	0.73	0.73	0.78
3	ICAP	0.87	0.86	0.86	0.87	0.87	0.52	0.79	0.80	0.81
4	JMI	0.86	0.83	0.82	0.82	0.86	0.58	0.81	0.78	0.80
5	MIFS	0.73	0.75	0.64	0.74	0.83	0.60	0.76	0.65	0.71
6	MIM	0.79	0.78	0.75	0.80	0.79	0.59	0.76	0.68	0.74
7	MRMR	0.87	0.86	0.72	0.93	0.87	0.54	0.80	0.67	0.78
8	CONSS	0.83	0.79	0.78	0.84	0.83	0.57	0.79	0.75	0.77
9	ICA	0.77	0.75	0.58	0.86	0.77	0.56	0.76	0.63	0.71
10	PCA	0.85	0.78	0.57	0.69	0.85	0.65	0.80	0.78	0.75
11	KPCA	0.78	0.81	0.70	0.75	0.78	0.52	0.72	0.73	0.72
12	DeepSGP	0.97	0.95	0.86	0.94	0.97	0.65	0.95	0.94	0.90

Table 5. Comparative study of the F1-score, sensitivity, and specificity in the survival prediction by DeepSGP using 50 attributes.

Score	CIFE	DISR	ICAP	JMI	MIFS	MIM	ICA	PCA	KPCA	DeepSGP
F1	0.71	0.65	0.73	0.70	0.68	0.67	0.51	0.60	0.60	0.82
Sensitivity	0.75	0.64	0.77	0.75	0.71	0.74	0.53	0.66	0.62	0.82
Specificity	0.69	0.74	0.69	0.68	0.66	0.64	0.64	0.57	0.59	0.84

Table 6. The significant genes, with their p-values and roles.

#	Gene Name	p-Value	Reference	Comment
1	MAP2K3	$7.91 \times 10^{- 5}$	[64,65]	related to tumor invasion and progression of gliomas and breast tumors.
2	B3GLCT	$2.01 \times 10^{- 4}$	[66]	related to cell proliferation and invasion.
3	HBEGF	$1.48 \times 10^{- 4}$	[67,68]	widely expressed in tumors compared with normal tissue.
4	AC073072.2	$2.47 \times 10^{- 4}$	[69]	affects tumorigenesis.
5	ADAMTS14	$1.41 \times 10^{- 7}$	[70]	expression signature of human prostate cancer cells.
6	LINC02731	$1.89 \times 10^{- 3}$	Note ¹	associated with schizophrenia.
7	MMP7	$1.69 \times 10^{- 6}$	[71,72]	produced by the glioma itself.
8	SLC17A1	$9.30 \times 10^{- 5}$	[73]	downregulation of membrane expression in the gliosomes.
9	COL22A1	$1.57 \times 10^{- 7}$	[74,75]	high expression in squamous cell carcinoma of the head and neck.
10	INHBE	$6.89 \times 10^{- 5}$	[76]	plays a crucial role in cell growth and differentiation.
11	EGFR-AS1	$1.28 \times 10^{- 13}$	[77]	linked to migration, invasion, and apoptosis in glioma cells.
12	SPON2	$3.85 \times 10^{- 5}$	[78]	maintains stem cell traits in GBM.
13	HAND2-AS1	$2.25 \times 10^{- 5}$	[79]	downregulated in tumor tissues of patients with melanoma.
14	CHCHD2P9	$4.24 \times 10^{- 5}$	[80]	upregulated by WNT1/FGF3 in MCF7 cells.
15	GPR27	$1.19 \times 10^{- 3}$	[81,82]	related to cervical cancer.
16	RPP25	$1.01 \times 10^{- 43}$	[83,84]	gene signature for survival of GBM patients.
17	CD1B	$1.99 \times 10^{- 5}$	[85]	has antitumor potential.
18	CRYGS	$7.37 \times 10^{- 11}$	[86]	related to retinoblastoma.
19	CPA6	$1.97 \times 10^{- 4}$	[87,88]	related to heterogeneous glioblastoma cell.
20	FER1L4	$6.58 \times 10^{- 4}$	[89,90]	high expression predicted a poor prognosis for patients with glioma.
21	AL035425.3	$5.26 \times 10^{- 10}$	[91]	related to psoriasis.

¹ (https://rgd.mcw.edu/rgdweb/search/genes.html (accessed on 21 January 2020)).

Table 7. Acc and AUC of the most important genes.

	# of Genes	LR	SVM	KNN	NB	MLP	CART	RF	XGB	Avg
Acc	21	0.70	0.72	0.64	0.67	0.66	0.57	0.66	0.68	0.66
	52	0.71	0.68	0.57	0.69	0.71	0.58	0.67	0.76	0.67
	100	0.75	0.76	0.65	0.76	0.68	0.60	0.64	0.64	0.69
AUC	21	0.84	0.82	0.78	0.78	0.84	0.60	0.72	0.71	0.76
	52	0.80	0.78	0.71	0.79	0.80	0.59	0.73	0.77	0.75
	100	0.87	0.86	0.78	0.85	0.87	0.64	0.75	0.75	0.80

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kirtania, R.; Banerjee, S.; Laha, S.; Shankar, B.U.; Chatterjee, R.; Mitra, S. DeepSGP: Deep Learning for Gene Selection and Survival Group Prediction in Glioblastoma. Electronics 2021, 10, 1463. https://doi.org/10.3390/electronics10121463

AMA Style

Kirtania R, Banerjee S, Laha S, Shankar BU, Chatterjee R, Mitra S. DeepSGP: Deep Learning for Gene Selection and Survival Group Prediction in Glioblastoma. Electronics. 2021; 10(12):1463. https://doi.org/10.3390/electronics10121463

Chicago/Turabian Style

Kirtania, Ritaban, Subhashis Banerjee, Sayantan Laha, B. Uma Shankar, Raghunath Chatterjee, and Sushmita Mitra. 2021. "DeepSGP: Deep Learning for Gene Selection and Survival Group Prediction in Glioblastoma" Electronics 10, no. 12: 1463. https://doi.org/10.3390/electronics10121463

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DeepSGP: Deep Learning for Gene Selection and Survival Group Prediction in Glioblastoma

Abstract

1. Introduction

2. Preprocessing

3. Selection of Genes

3.1. Ranking

3.2. Subset Selection

4. Dimensionality Reduction

4.1. Statistical Methods

4.2. Autoencoder

5. Experimental Results

5.1. Selection of Genes

5.2. Extraction of Reduced Attributes

5.3. Classification into Survival Groups

5.4. Biological Significance of Extracted Genes

5.4.1. Comparisons with Other Feature Sets

5.4.2. Pathway Analysis

5.4.3. Gene Ontology Analysis

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI