1. Introduction
In the last two decades, bioinformatics technology has obtained rapid development to provide efficient computer-aid ways to diagnose diseases, and bioinformatics with machine learning can make significant breakthroughs in the tumor diagnosis [
1]. The rapid development of high-throughput sequencing technology has demonstrated that gene expression profiling may be used to predict various clinical phenotypes [
2]. A survival prediction model has been used to analyze and grasp the relationships between medical characteristics and survival time of patients in recent years [
3]. Cancer prognosis was assessed by the survival analysis method to provide valuable information [
4]. As usual, high-dimensional candidate genomic features severely reduced the performance of treatments of various predicted clinical phenotypes [
5,
6]. There is a key challenge to improving the prognostic accuracy in survival prediction models. The cox proportional hazard (CPH) model, commonly known as the cox model, is widely used in survival analysis tasks [
7]. It can predict a risk score according to the characteristics or covariates of a set of patient data and correct censored data effectively. Even if the cox model as a linear model has many advantages, a disadvantage of it is that it cannot express the complex nonlinear relationship between the logarithmic risk ratio and static covariates [
8].
Therefore, a machine learning-based CPH model was utilized to solve a complex nonlinear survival analysis problem [
9]. Support vector machine (SVM) is a classical machine learning approach to process high-dimensional features by incorporating ranking and regression constraints [
10]. Thus, an SVM-based CPH method can enhance the learning of high-dimensional data, whereas the hazard was not directly incorporated into data in the model. Deep learning networks are used to determine gene expression data that predict cox regression survival in breast cancer [
11]. A broad analysis was performed on TCGA cancers using a variety of deep learning-based models applied to the survival prognosis of cancer patients [
12]. The random forest is an ensemble learning method that can find the mating survival rate of each patient accurately. Therefore, a random survival forest methodology was investigated through the extended the random forest method, which can analyze the right-censored survival data [
13].
A deep forest (DF) model is a decision-tree-based ensemble learning method including a deep nonneural network type, which has good performance in many tasks [
14]. Additionally, deep forests have developed two types, namely, random forests and completely random-tree forests, which can help to improve diversity of the learning model. A deep survival forest based on deep forest was proposed to construct a model and replace the original random forest with the corresponding survival analysis model. As a tracking algorithm implemented in a deep survival forest and elastic network cox cascade, it can be regarded as a link between deep forest levels [
15].
Any dataset will contain a large number of unlabeled samples because genome-wide gene expression profiling is still too expensive to be used with academic laboratories to research the rich gene expression analysis method [
16]. Thus, in order to improve the model’s learning ability, semi-supervised learning (an incremental learning technique) is investigated to obtain more labeled data from unlabeled samples. Self-supervised learning, an intuitive pseudo-labeling SSL technique, is a general learning framework that relies on a prelearning task formulated by unsupervised labeled data. In this study, we employed self-supervised learning techniques that are designed to learn a useful global model from labeled data. Many recent self-supervised methods have received increasing attention to solve the dilemma of a lack of labels. For example, a twin self-supervision–semi-supervised learning approach is presented to embed self-supervised strategies into a semi-supervised framework to simultaneously learn from few-shot-labeled images and vast unlabeled images [
17]. Liu et al. [
18] proposed a self-supervised mean-teacher method for semi-supervised learning which combines the pre-training of self-supervised mean with semi-supervised fine-tuning to improve the representativeness of the mean-teacher. To tackle these problems, Song et al. [
19] proposed a self-supervised semi-supervised learning framework to tackle the problem of sparsely labeled hyperspectral image recognition.
Motivated by the lack of relevant research, we attempted to exploit the deep survival forest with self-supervised learning in survival analysis tasks. Recently, several survival analysis methods with genomic feature selection have been investigated to predict the survival time of patients precisely. This has become a key technique to improve performance in learning models [
20]. For example, a deep forest model based on feature selection is proposed to reduce the redundancy of features, and could be adaptively incorporated with the classification model [
21]. Zhu et al. [
22] presented an ensemble feature-selection–deep-forest method which outperformed the traditional machine-learning methods. In the prediction of protein–protein interactions, elastic net deep forest is utilized to optimize the initial feature vectors and boost the predictive performance [
23]. Stable feature selection can efficiently avoid negative influences from added or removed training samples [
24]. Thus, we identified disease-causing genes by investigating stable LASSO regularization in survival analysis. In this paper, we propose a self-supervised method using a deep forest algorithm to improve survival prediction performance—deep forest can learn from high-dimensional genome data efficiently; and semi-supervised learning such as self-supervised learning provides more labeled samples to train a global model.
Though extensive testing on the real-world TCGA cancer datasets, the results show that the proposed DFSC method has high prediction accuracy even if high-dimensional survival data are used. The rest of this article is organized as follows.
Section 2 describes our method and experimental dataset. The results are displayed and discussed in
Section 3. Finally, conclusions are presented in
Section 4.
4. Conclusions
In conclusion, our proposed DFSC algorithm can accurately improve the survival rate in cancer patient diagnosis. DFSC has been verified on four experimental datasets and has better prediction accuracy than the other four most advanced survival prediction models. Semi-supervised learning, an effective alternative method in the experimental process, can alleviate the challenge of over-fitting and improve the robustness of the model. Combining semi-supervised learning with a deep forest model can obtain better experimental results. In addition, DFSC can also be used to predict the survival rates of various high-dimensional and collinear diseases. By considering all categories at the same time in the gene selection stage, our proposed extension can identify genes, thereby allowing doctors to make more accurate computer-aided diagnoses.
The establishment of a model to understand the relationship between genomic features and patient survival is a challenge for the future. Advanced machine learning methods have become powerful tools for building an effective survival analysis model. We investigated current work to accurately identify genomic signatures associated with cancer patient survival to improve prognostic precision oncology.