Prediction of Tumor Lymph Node Metastasis Using Wasserstein Distance-Based Generative Adversarial Networks Combing with Neural Architecture Search for Predicting

Wang, Yawen; Zhang, Shihua

doi:10.3390/math11030729

Open AccessArticle

Prediction of Tumor Lymph Node Metastasis Using Wasserstein Distance-Based Generative Adversarial Networks Combing with Neural Architecture Search for Predicting

by

Yawen Wang

^1,† and

Shihua Zhang

^2,*,†

¹

School of Mathematics and Physics, China University of Geosciences, 388 Lumo Road, Hongshan District, Wuhan 430074, China

²

College of Life Science and Health, Wuhan University of Science and Technology, 974 Heping Avenue, Qingshan District, Wuhan 430081, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2023, 11(3), 729; https://doi.org/10.3390/math11030729

Submission received: 13 December 2022 / Revised: 13 January 2023 / Accepted: 22 January 2023 / Published: 1 February 2023

(This article belongs to the Section Mathematical Biology)

Download

Browse Figures

Versions Notes

Abstract

:

Long non-coding RNAs (lncRNAs) play an important role in development and gene expression and can be used as genetic indicators for cancer prediction. Generally, lncRNA expression profiles tend to have small sample sizes with large feature sizes; therefore, insufficient data, especially the imbalance of positive and negative samples, often lead to inaccurate prediction results. In this study, we developed a predictor WGAN-psoNN, constructed with the Wasserstein distance-based generative adversarial network (WGAN) and particle swarm optimization neural network (psoNN) algorithms to predict lymph node metastasis events in tumors by using lncRNA expression profiles. To overcome the complicated manual parameter adjustment process, this is the first time the neural network architecture search (NAS) method has been used to automatically set network parameters and predict lymph node metastasis events via deep learning. In addition, the algorithm makes full use of the advantages of WGAN to generate samples to solve the problem of imbalance between positive and negative samples in the data set. On the other hand, by constructing multiple GAN networks, Wasserstein distance was used to select the optimal sample generation. Comparative experiments were conducted on eight representative cancer-related lncRNA expression profile datasets; the prediction results demonstrate the effectiveness and robustness of the newly proposed method. Thus, the model dramatically reduces the requirement for deep learning for data quantity and the difficulty of architecture selection and has the potential to be applied to other classification problems.

Keywords:

lncRNA; generative adversarial network; neural architecture search; lymph node metastasis; deep learning

MSC:

92-8

1. Introduction

RNA molecules with transcript lengths longer than 200 nt are called long non-coding RNA (lncRNA) [1] and were initially discovered as “noise” of genomic transcription with no biological function. However, with the development of technology and further research, it has been found that lncRNAs play an important regulatory role in a variety of pathological conditions, including neural differentiation, neurological diseases, hematopoiesis, immune response, cancer and other related diseases [2], and lncRNAs are often misregulated in many diseases, especially cancer [3,4].

Tumor metastasis is a significant research problem in cancer research and is closely related to patient survival [5,6,7]. Lymph node metastasis is a widespread form of tumor metastasis in which tumor cells are shed, metastasize through the lymphatic system and then grow the same kind of tumor cells, causing tumor spread [8]. Moreover, due to the complex path of spread, it is difficult to find the primary focus, which can easily delay the treatment time and affect the prognosis. The current diagnostic methods are mainly clinical diagnoses, such as endoscopy, magnetic resonance imaging and computed tomography, but even with extensive clinical examination, the diagnosis accuracy is still not assured [9,10,11]. Thus, determining the presence of lymph node metastasis is an important research topic. Suppose lncRNA expression profile data can be used to predict lymph node metastasis in tumor samples accurately. In that case, it will be important to determine whether to follow up with effective clinical treatment for patients.

Several current studies have demonstrated the remarkable ability to use lncRNA expression profiles to predict lymph node metastasis in different cancers and several algorithms have been developed that can be used for prediction. For example, Sorensen et al. used support vector machines (SVMs) to identify lncRNA data for classification and obtained accuracy in predicting lymph node metastasis in breast cancer patients [12]. Deng et al. predicted cancer-related genes through gene co-expression networks [13,14]. Zhang et al. based their study on the differential expression of lncRNAs for lncRNA expression profiles for biometric feature extraction and then used SVM to predict lymph node metastasis [15]. Li et al. proposed a dimensionality reduction method and classifier for predicting lymph node metastasis in multiple cancers using lncRNA expression profiles based on the local linear reconstruction guided distance metric [16]. It is noted that the current studies of diagnostic algorithms for this problem have focused on the dimensionality reduction of lncRNA expression profiles and the selection of traditional statistical machine learning algorithms, such as k-nearest neighbors (KNN) [17], naïve Bayes [18] and decision tree [19] for classification.

In the study of this problem, the number of lncRNA samples associated with each cancer is relatively limited because the characteristics associated with lymph node metastasis vary greatly from cancer to cancer and need to be learned and predicted separately. So we can start by increasing the sample size. This was first done by replicating samples in the minority category or adding noise to balance positive and negative samples [20]. The synthetic minority oversampling technique (SMOTE), proposed by Chawla et al. [21] in 2002, is still the most widely used oversampling algorithm today. It is based on the idea that neighboring points on the feature space have similar characteristics, so N of the k nearest neighboring sample points of each sample point are randomly selected and the difference between the sample and the neighboring samples is calculated by multiplying a threshold in the range of [0, 1] to achieve the purpose of synthetic data. Since it does not sample on the data space but in the feature space, its accuracy will be higher than the traditional sampling method. Han et al. [22], in 2005, proposed two more minority oversampling techniques based on the SMOTE method by considering neighboring instances and considering only a few cases near the boundary, respectively. There are also algorithms for generating new samples by weighting a few classes. For example, the adaptive integrated sampling method proposed by He et al. in 2008 [23] calculates weights based on learning difficulty; the majority-weighted minority oversampling technique proposed by Barua et al. in 2014 [24] calculates weights based on the Euclidean distance of the minority class from the nearest majority class sample; in 2015, Xie et al. proposed a minority oversampling based on the local density of the low-dimensional space technique [25], which maps the trained samples to a low-dimensional space and assigns weights.

All of the above methods add synthetic new data to the minority class dataset to balance the data and there are also some methods to generate a new training set directly. In 2004, Zhou and Jiang [26] proposed a decision tree method based on neural integration to train a neural network and then use it to generate a new training set; in 2006, Li and Lin [27] proposed a method based on the internalized kernel density estimation-based virtual sample generation method by determining the probability density function of a sample and then generating a new training sample; in 2009, Li and Fang [28] again proposed a nonlinear virtual sample generation technique using group discovery and hyperspherical parametric equations. The Generative Adversarial Network (GAN) was introduced by Goodfellow et al. in 2014 [29]; it has been the most widely used generative model in recent years. Unlike the methods mentioned above in which the inherent characteristics of the samples cannot be exploited, GAN does not require assumptions or restrictions on the distribution of the samples, but is based on deep neural networks (DNNs) to generate synthetic samples with the same distribution as the real samples [30]. The GAN consists of two DNNs that function as a generator and a discriminator, respectively, and they learn from the training data in a mutually adversarial manner [31]. Although the GAN model is mainly used in the field of image generation and there are already some mature and widely used optimization algorithms [30], there are relatively few applications in data enhancement, but Liu et al. showed that data enhancement using GAN is more effective than traditional oversampling methods in improving data quality [32]. So we also used the Wasserstein distance-based generative adversarial network (WGAN) [33] for data augmentation of expression spectrum data.

In solving the problem of a large number of features related to lymph node metastasis of lncRNA expression profiles of some diseases and the large gap in the amount of features of different cancer types, first, we use neural network models to cope with the problem of a large number of features in some data sets and then, due to the large gap in the amount of sample features, it directly affects the structure of building deep learning models. When the number of samples is smaller or the sample features are smaller, we need more straightforward architecture for deep learning to avoid the easy overfitting phenomenon if we do not need to adjust the model structure to adapt to the data. Second, the model structure adapted to the dataset characteristics is more conducive to saving computational cost and improving the model prediction accuracy. Finally, using manual construction of model structures not only relies on the experience of researchers, but also has a higher possibility of missing some model structures and requires more time on model construction. Thus, to adapt to the different characteristics of different disease-related tumor metastasis marker datasets, learn the model structure more extensively and reduce the experimental difficulty and computational time cost, we adopt the idea of Neural Architecture Search (NAS) to optimize the structure and parameters of neural networks unsupervised with reinforcement learning and choose to use Particle Swarm Optimization (PSO). The work on NAS is generally divided into three aspects, search space, search strategy and performance evaluation strategy [34]. The search space is the architecture that can be used in principle and needs to be adapted to the actual task. If a suitable search space can be determined based on a priori knowledge, the search space can be reduced and simplified. It can also lead to the omission of suitable architecture modules. Search strategy refers to how the architecture space is explored. On the one hand, it is desirable to find architectures with good performance, while on the other hand, premature convergence to regions of suboptimal architectures should be avoided. PSO is a nature-inspired algorithm similar to genetic algorithms [35] that can be used to search for the optimal neural network weights and architecture. Performance estimation refers to the use of some less time-consuming training and more readily available evaluation metrics instead of full model training and evaluation on large-scale data.

The main contributions of the paper can be described as follows.

(1) Considering that deep learning frameworks often require manual adjustment of parameters, in this study, we use particle swarm optimization algorithms for different cancer lncRNA expression profile datasets which can learn the structure and parameters of neural networks unsupervised. Therefore, it has the potential for more widespread applications in complex deep learning model construction or personalized adaptation to the study of similar problems with significant differences in different sample datasets.

(2) Given the imbalance between positive and negative samples in our data, we use WGAN to generate samples that effectively solve the data imbalance problem.

In addition, this study facilitates the exploration of the scope and reliability of the potential of next-generation artificial intelligence, represented by deep learning, for the study of biological problems with small samples. It is the first attempt to use a combination of data augmentation and deep learning to improve the accuracy of predicting lymph node metastasis.

The remaining parts are organized as follows. Section 2 presents the material and methods of the new proposed method. Experimental results and discussions are illustrated in Section 3. Finally, Section 4 concludes the work.

2. Materials and Methods

2.1. Materials

The raw data of our lncRNA expression profile samples for eight different cancers were obtained from The Cancer Genome Atlas (TCGA) database, each sample is a patient’s lncRNA expression profile and the samples were classified into negative samples with cancer metastasis and positive samples without cancer metastasis, according to the tumor lymph node metastasis (TNM) classification index. Specifically, T (Tumor) refers to the tumor volume and the adjacent tissue affected area and the volume and space are expressed from T1 to T4, from small to large; N (Node) refers to the regional lymph node. Concerning lymph node (regional lymph node) involvement, the degree from mild to strong is expressed by N1 and N3, respectively. M (Metastasis) refers to distant metastasis (usually Bloodway metastasis), the presence or absence of metastasis, the lack of which is indicated by M0 and the presence of which is indicated by M1. The eight different types of cancer we selected from the TCGA database include head and neck squamous cell carcinoma (HNSC), stomach adenocarcinoma (STAD), thyroid carcinoma (THCA), bladder urothelial carcinoma (BLCA), lung adenocarcinoma (LUAD), colon adenocarcinoma (COAD), breast invasive carcinoma (BRCA), cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC).

2.2. Data Preprocessing

We only kept samples with an M index of 0 because cases with distant organ metastasis often also had local lymphatic metastasis, which would cause noise in assessing the lymph node metastasis model. We then identified samples with an N index of 1 to 4 as lymph node metastasis cases (i.e., positive samples) and samples with an N index of 0 were identified as non-metastatic control samples (i.e., negative samples). Screening out samples lacking clear information on the classification of lymphatic metastasis from TNM, we selected eight cancer-related lncRNA expression profile datasets. For the obtained expression profile datasets of different cancers, we discarded samples with more than 30% missing values and the remaining missing values were estimated using the R package to impute. For all these lncRNAs, as Table 1 shows, the dimension of their original features is 60,483. In the next step, we need to reduce the dimensionality.

The dimension reduction method we chose was biometric feature selection [15] by performing three rounds of feature extraction analysis on a cancer-related lncRNA expression profile sample labeled with the presence or absence of lymph node metastasis in Table 2. A set of biomarkers with small number, low feature latitude redundancy and high correlation with the label is obtained, resulting in more optimal classification performance when training the classifier with them. We also used both Principal Component Analysis (PCA) and Independent Component Analysis (ICA) methods for dimensionality reduction to compare with the biometric feature selection methods. According to the final prediction results, biometric selection is the more appropriate dimensionality reduction method in our experiments.

2.3. Methods

The algorithm as a whole is divided into three steps, dataset collection and preprocessing, sample generation and prediction. The details are described below.

Step 1. Data collection and preprocessing. The original sample is pre-processed and then subjected to the biometric feature extraction method to reduce the dimensionality, as Figure 1 shows. Then, the processed data set is randomly divided into training set, validation set and test set according to the ratio of 6:2:2.

Step 2. Sample generation. The WGAN is trained using the training set data, multiple structures can be constructed in the structure selection and the Wasserstein distance between the generated data and the original data is used as an indicator for the final WGAN construction selection. And for unbalanced data sets we can downsample. This step increases the sample size and balances positive and negative samples, which improves data quality.

Step 3. Prediction. The synthetic samples generated by the optimal architecture of WGAN are used to train the NN classifier and the accuracy of the validation set is used as an indicator to find the optimal architecture of NN using the psoNN model and the test set is used to test the effect of predictive classification.

2.3.1. Dimension Reduction

At this point, we have obtained sample datasets of lncRNA expression profiles for eight different cancers and have performed dimensionality reduction, with sample features ranging from single digits to thousands of digits for each dataset. In order to adapt the data to the input neural network for deep learning and also to put the feature values in a suitable range to facilitate the distribution of the data for NN model learning, we will normalize the expression volume data. This way, we obtain a matrix with a narrow range of element values of [0, 1], trying to avoid the influence of the difference in size of different lncRNA expressions themselves on the prediction results. At this point, each vector that makes up this matrix can be considered as n eigenvalues of a sample in the dataset. We divide these vectors into a training set, a validation set and a test set in the ratio of 6:2:2. The training set will be used as the input for preparing the WGAN model constructed for data augmentation in the next step, while the validation and test sets are left unprocessed for evaluating and testing the model performance.

Since there is a huge difference in the number of positive samples representing the presence of tumor metastasis and negative samples representing the absence of tumor metastasis in some disease-related datasets, resulting in an extremely unbalanced dataset obtained, we need to perform some processing on the training set to reduce the impact of the training sample imbalance and avoid the model tendency to evaluate to a larger number of sample types. More importantly, deep learning requires a large number of samples and the current sample size of most datasets is far from adequate.

2.3.2. Data Generation

We strip the positive and negative training samples and put them into the WGAN model separately for data enhancement and then select the same number of positive and negative synthetic samples to combine into a new synthetic dataset. Although this approach may increase the training difficulty and even decrease the prediction accuracy due to the reduced number of samples for training the WGAN models, we believe that this is a more appropriate way to label the synthetic data than treating the sample label as a discrete attribute. This is not only to avoid the imbalance of the training set samples, but also to avoid the problem of re-labeling the synthetic data.

At the same time, we need to evaluate the similarity between the synthetic data and the original data distribution. The Wasserstein distance, also called the Earth Mover’s distance, is defined as the cost required to transform from one distribution to another and can be used to measure the distance between two distributions. In paper [33], researchers first applied Wasserstein distance as a loss function to the GAN generation model to obtain the WGAN model and argued for a correlation between Wasserstein distance and the quality of generated data. So we introduced the Wasserstein distance as a screening indicator for the hidden layer of the WGAN model and to verify the similarity of our synthetic data with the original data and the generated new training set data can be used for the next training learning step. In order to make the synthetic samples of the WGAN model training training set data similar to the positive and negative samples of the training set data processing effect, the systematic errors generated by the positive and negative samples of the training set in this process remain uniform, we choose a more suitable WGAN model architecture with a smaller number of sample categories and another category of samples keeps the same WGAN model architecture as for training to generate synthetic data. We also found in our experiments that in general the Wasserstein distance between the generated data and the original data is larger for the smaller number of sample categories (generally positive samples in this experiment); the categories with fewer samples are more difficult to train, which also supports the rationality of our treatment.

We first build a WGAN model with the probability of using dropout set to 0.1 for the generator, a truncation parameter of 0.1 for the update discriminator, RMSProp as the optimizer algorithm, no convolutional layer set and the generator to be arranged in ascending order if there are multiple layers and the discriminator in descending order. The number of training, learning rate and hidden layers are all multiple choices and multiple parameters will be combined to construct multiple WGAN architectures, especially for the hidden layers, and we will change the configuration of the number of layers and the number of nodes per layer to provide more construction choices. For different datasets, we will use multiple architectures for training, a dataset will get multiple sets of generated data and we will then filter a set of generated data that may be of better quality according to the metrics and synthesize it into a training dataset for the next step of the deep learning classification part.

For WGAN model building, leaky ReLU, Adam optimizer and dropout were chosen because they are the standard for solving this type of problem [36]. There is no convolution layer because the input information is not an image. Binary cross-entropy loss was used because it is best suited to measure the performance of a model with an output probability between 0 and 1. For each disease dataset, we tested different configurations of the number of layers and the number of nodes per layer. We introduce a metric, Wasserstein distances, to initially filter the data generated via the WGAN model training, since each WGAN construct is put into the next architectural search neural network for deep learning and classification result prediction will greatly increase the computational cost and consume too much experimental time. Distances is a measure of the variance of probability distributions, compared with other commonly used measures such as KL scatter and total variation, which are distance measures of distributions compared to the probability density functions of the corresponding points, ignoring the geometric properties between probability distributions. The addition of Wasserstein distances filtering can complement this missing point. At the same time, since the GAN model is updated by the confrontation between the discriminator and the generator, that is, unlike most models, the GAN model has two losses to be reduced, it is not easy to judge the degree of model training and determine the number of training iterations and we have multiple hidden layers to try, so it would be a great waste of training time if the number of iterations is also used as a parameter in the framework search. We also have multiple hidden layers to try and if the number of iterations is also a parameter in the framework search, it will greatly waste the training time. So we also used the Wasserstein distances as the supervision of the model training degree and when the Wasserstein distances basically remained stable and no longer declined, it intuitively indicated that the gradient might have disappeared and the training could be terminated.

2.3.3. Prediction

Our proposed psoNN is described in terms of the search space, search strategy and performance evaluation strategy of the NAS algorithm: the search space is mainly based on the fully connected layer of the NN model, the dropout layer for avoiding overfitting and the BatchNormalization layer, which is set based on the results of our simple tests on the dataset, as for the generation probability of each architectural layer it is the superparameter of the psoNN algorithm, which can be adjusted according to the model performance. The search strategy is the PSO algorithm used, particle swarm optimization (PSO) is an evolutionary computation technique (evolutionary computation) proposed by Dr. Eberhart and Dr. Kennedy in 1995 and originating in the study of bird flock predation behavior [32].

To apply the pso algorithm to the problem we are studying, we first need to design the particles to be created based on the input training data. The particle design has three types of layers: fully connected layer, dropout layer and BatchNormalization layer, where Batch Normalization (BN) is added between the fully connected and excitation functions.

The proposed parameters of psoNN include three categories of particle swarm optimization, NN architecture initialization and NN training. We only made some adjustments to two of the parameters related to particle swarm optimization, namely, the number of training times and the number of populations. Too many training times are prone to overfitting, while the number of populations is theoretically better; in practice, we need to control the number to reduce the training time. The parameters related to the initialization of the neural network architecture set the probability of occurrence of fully connected layers, dropout layers and BatchNormalization layers, the range of the number of fully connected layer neurons and the range of the total number of NN layers when generating particles (one particle is one NN). After initializing the particles with this rule, the parameters that control the training of particles to find the global best particles will govern the whole particle updating process, including the number of training particles when evaluating particles, the dropout layer discard probability, etc. Finally, we will repeat ten times independently under the constraints of the above parameters: initialize the particles, find the local best particle and the global best particle, update the local best particle to be close to the global best particle and get the global best particle again and repeat the particle update until we finish the search several times.

3. Results and Analysis

3.1. Enviroment Settings

To evaluate the performance of predicting lymph node metastasis via lncRNA expression profiles, we assigned data sets as training set, validation set and test set according to the ratio of 6:2:2 for each of the eight disease-related lncRNA expression profiles. The training set was divided into positive and negative sample sets and the positive and negative sample sets were trained using the WGAN model to generate new synthetic datasets, respectively. The synthetic new training set is then input into the psoNN model for learning and the model is partially trained and the validation set data is input to predict the lymphoma metastasis of the validation set samples and the model with the highest accuracy (i.e., the globally optimal particle) on the validation set is searched. At this point, we consider this model to be the optimal architecture obtained from the search, so we use the synthetic training set for full training and also use the validation set to evaluate the accuracy and get the highest accuracy at the number of training times, i.e., the number of training times when we test the model performance using the test set. The average of the accuracy of the test set obtained each time is the performance evaluation result of our WGAN-psoNN model when trained 10 times with the psoNN model. We use the validation set and input it into the trained model to evaluate the accuracy in order to provide an indicator of the best architecture to search for and also to use it for flexible tuning in order to avoid large differences in the number of training sessions required for different architectures. Moreover, in order for the validation set to better represent the test set, we treat it exactly the same as we do for the test set.

3.2. Parameter Settings

In the WGAN model section, we only need to set up multiple choices for each parameter and the model will automatically combine to generate various architectures and find the optimal architecture based on the Wasserstein distance. Taking the BLCA dataset as an example, the training of WGAN with multiple architectures is shown in Figure 2. We can use the Wasserstein distances as the supervision of the model training degree and to filter the synthesis data generated via different WGAN models.

In the psoNN model section, for various data sets, we just need to determine the parameters, the possibility of using dropout layers, the possibility of using BatchNormalization layers, the possibility of choosing full concatenation and the dropout rate in NN constructing. The three possibilities have to add up to one, and we can increase the possibility of dropout layers or dropout rate when constructing NN to avoid overfitting.

In this study, firstly, there are four parameters in the WGAN model species, which are epochs, learning rate, batch size and hidden layer structure. After experiments, the number of iterations is between 2000 and 8000 and the learning rate is basically set at 0.0005, which is prone to the problem of gradient disappearance if raised to 0.001. The batch size we tested with 50, 100 and 150 parameters, but the impact on the experimental results is not significant; we fixed it at 50. The number of generated groups is decided according to the number of samples to be synthesized, firstly, to make the number of positive and negative samples in the training set basically equal; then the number of samples in the synthetic data set used for training we tested with 5000, 8000 and 10,000, comparing which group of data was appropriate for the next step. The hidden layer structure we designed is relatively simple; there are [[128], [256], [512]], a total of three kinds.

Next is the determination of the parameters of the psoNN model, as the model itself is searched for different structures, we only need to determine the parameters about the search part, the number of partial training iterations 100, the number of full training iterations 2000, the number of training 10, the number of search particles 25, the number of search 10 and the parameters related to the NN model construction, the training batch size of the particle search part and the full training batch size. The training batch size and full training batch size are 1024, the maximum number of fully connected neurons is 10, the minimum number of layers is 1 and the maximum number of layers is 5. There is also the possibility of using dropout layers, the possibility of using BatchNormalization layers and the possibility of choosing full concatenation when constructing the NN model.

3.3. Performance Evaluation

3.3.1. Effectiveness of WGAN

To verify the effectiveness of data augmentation using the WGAN model, we also tried to use the training set directly for training the psoNN model without data augmentation, then use the validation set to determine the optimal architecture and training times and finally use the test set to test the effect, i.e., the psoNN method in Table 3 below. Again, these models are trained ten times and then the average results are calculated and are shown in Table 3.

3.3.2. Effectiveness of psoNN

To verify the effectiveness of classification using psoNN, we train KNN classifiers using new training sets generated after data augmentation and then use the unprocessed test set to obtain the algorithm classification accuracy to compare with our WGAN-psoNN algorithm classification accuracy in Table 4. It was found that NNs are more suitable for learning from data enhanced by WGAN data.

To demonstrate the effectiveness of the pso optimization algorithm, we also experimented with genetic optimization algorithms in Table 5 and the results showed that the pso optimization algorithm was more suitable for our experimental approach.

3.3.3. Effectiveness of WGAN-psoNN

There have been many approaches in using machine learning methods for lncRNA expression profiles to predict lymphatic metastasis problems, and to demonstrate the performance of our WGAN-psoNN model on this problem, we compare it with other methods shown in Figure 3 below where all methods are based on lncRNA expression profile data after biometric feature selection to predict lymphatic metastasis problems. Figure 3 shows the statistical results by repeating the experiment 10 times, using bar charts to obtain the final average accuracy, which contains the use of the proposed WGAN-psoNN classifier, KNN, SVM, RF and LLR-DM. From Figure 3, it can be easily found that the proposed WGAN-psoNN algorithm is more accurate in HNSC, STAD, THCA, BLCA, LUAD, COAD and CESC and the LLR-DM algorithm performs better on BRCA.

4. Discussion and Conclusions

4.1. Discussion

In Section 3, to evaluate the performance of our proposed WGAN-psoNN model, we performed lymph node metastasis prediction using our proposed WGAN-psoNN model for lncRNA expression profiles, using the psoNN model directly without data enhancement and using SMOTE for upsampling followed by the psoNN model, respectively, for a total of three methods and then compared them. In addition, the lymph node metastasis prediction results obtained from the WGAN-psoNN model were compared with those obtained from the k-nearest neighbor algorithm, the random forest algorithm and the support vector machine algorithm and the LLR-DM downscaling and classification method proposed by [15]. From the experimental results, it can be concluded that WGAN is able to perform effective data enhancement on small samples of lncRNA expression profile data and then prediction via psoNN has the potential to obtain better prediction accuracy than that of traditional machine learning algorithms. A discussion and analysis will be presented below.

(1) Current work on lncRNA expression profiles for predicting lymph node metastasis using machine learning has focused on the extraction of features and the use and improvement of classical machine learning algorithms. The main reason why deep learning is difficult to apply to such problems is that for the amount of data needed for deep learning, the number of samples of relevant lncRNA expression profiles, is very limited. This phenomenon is not limited to this one study problem; the number of samples in many linkage prediction problems greatly limits the application of deep learning. We verified in Section 3 that WGAN can effectively increase the number of samples with guaranteed sample quality and outperforms the most commonly used oversampling method SMOTE algorithm on multiple datasets. It is reasonable to believe that WGAN can be used as a data augmentation tool for more classification problems, further expanding the application of deep learning to classification problems based on data with a limited number of samples or difficult sample acquisition.

(2) We used Wasserstein distances to quantitatively measure the distance between the original and generated samples, rather than just replacing the JS scatter with Wasserstein distances in the gradient descent part only, as most WGANs do. The Wasserstein distances metric can also be applied to other GAN-based models, not only to stop training when the gradient disappears, but also to select a better GAN architecture. Since the construction of GAN architectures usually requires manual construction through researchers’ experience and judgment of which architecture is better; with the introduction of the Wasserstein distances metric, we do not need to design architectures specifically for the case of different datasets, but directly iterate through several predefined architectures and then determine a better architecture based on their generated data and the original data’s Wasserstein distances and then determine a generated data set that is closest to the original data. This simplifies the workflow, increases efficiency and increases the likelihood of finding synthetic data that are closer to the original data distribution. In our experiments, we also noticed that if the Wasserstein distances of the generated data are larger than 10 with the original data, then it is difficult to achieve better prediction results, which to some extent illustrates the reasonableness of the Wasserstein distances as an evaluation index. This can also assist researchers in predicting the quality of the generated data in advance, rather than relying on the classifier to predict the effect of the generated data.

We also set several established architectures for the WGAN model to get the optimal generated data according to the Wasserstein distances index, avoiding repeated tuning of multiple data sets with large differences on similar problems. It can also automatically adjust the architecture instead of relying on repeated manual adjustments when used for experiments with multiple data sets that differ significantly from each other.

4.2. Conclusions

In conclusion, we used the WGAN-psoNN model to perform unsupervised learning on data distribution to increase the number of samples and improve data quality and we used automatic search convolutional neural networks to avoid the time-consuming and error-prone problems caused by manual selection of neural structures. Experimental results demonstrate the effectiveness of WGAN data augmentation and the effectiveness of deep learning models like psoNN on small sample data through data augmentation The experimental results demonstrate the effectiveness of WGAN data augmentation, and the prediction results of such deep learning models as psoNN on small sample data through data augmentation can achieve better results than traditional machine learning algorithms. The application of this model to more small-sample prediction problems or multi-dataset prediction problems is justified and has potential. However, the current accuracy of lymph node metastasis prediction is still unsatisfactory and we will consider further improving the prediction accuracy by optimizing the training of the generated model, evaluating and improving the data quality, and adjusting the selection of the optimization algorithm.

Concerning biological models on deep neural networks, application is still relatively small; our research to a certain extent reflects the neural network characteristics or advantages compared to other methods. The first is the inclusiveness of the model. We do not need to guess in advance which model the probability distribution of the data conforms to, we can find out by directly fitting it through the training of neural networks. The second feature is the good extraction ability of the data features. Deep learning’s ability to extract features has been verified in practice, even to a certain extent reducing the importance of feature engineering. The third feature is the relatively strong scalability of the model; using the same model in multiple scenarios after some modifications shows good performance.

Author Contributions

Resources, Y.W.; software, Y.W.; data curation, Y.W.; writing-original, Y.W.; draft preparation, Y.W.; methodology, S.Z.; writing-review, S.Z.; editing, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62172312) and Open Foundation of Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information at Anhui Agricultural University (BDSYS2021001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Alvarez-Dominguez, J.R.; Hu, W.; Lodish, H.F. Regulation of eukaryotic cell differentiation by long non-coding RNAs. In Molecular Biology of Long Non-Coding RNAs; Khalil, A., Coller, J., Eds.; Springer: New York, NY, USA, 2013; pp. 15–67. ISBN 978-1-4614-8620-6. [Google Scholar]
Statello, L.; Guo, C.J.; Chen, L.L.; Huarte, M. Gene regulation by long non-coding RNAs and its biological functions. Nat. Rev. Mol. Cell Biol. 2021, 22, 96–118. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Su, Z.; Lu, S.; Fu, W.; Liu, Z.; Jiang, X.; Tai, S. LncRNA HOXA-AS2 and its molecular mechanisms in human cancer. Clin. Chim. Acta 2018, 485, 229–233. [Google Scholar] [CrossRef] [PubMed]
Huang, D.S.; Zhang, L.; Han, K.; Deng, S.; Yang, K.; Zhang, H. Prediction of protein-protein interactions based on protein-protein correlation using least squares regression. Curr. Protein Pept. Sci. 2014, 15, 553–560. [Google Scholar] [CrossRef]
Tjan-Heijnen, V.; Viale, G. The Lymph Node and the Metastasis. N. Engl. J. Med. 2018, 378, 2045–2046. [Google Scholar] [CrossRef]
Padera, T.P.; Meijer, E.F.; Munn, L.L. The Lymphatic System in Disease Processes and Cancer Progression. Annu. Rev. Biomed. Eng. 2016, 18, 125–158. [Google Scholar] [CrossRef] [PubMed]
Seidman, J.D.; Krishnan, J. Lymphatic Invasion in the Fallopian Tube is a Late Event in the Progression of Pelvic Serous Carcinoma and Correlates with Distant Metastasis. Int. J. Gynecol. Pathol. 2020, 39, 178–183. [Google Scholar] [CrossRef]
Sleeman, J.P.; Thiele, W. Tumor metastasis and the lymphatic vasculature. Int. J. Cancer 2009, 125, 2747–2756. [Google Scholar] [CrossRef] [PubMed]
Christensen, A.F.; Bourke, J.L.; Nielsen, M.B.; Møller, H.; Svendsen, L.B.; Mogensen, A.M.; Vainer, B. Detection rate of periintestinal lymph nodes. Ultraschall. Med. 2006, 27, 360–363. [Google Scholar] [CrossRef]
Obinu, A.; Gavini, E.; Rassu, G.; Maestri, M.; Bonferoni, M.C.; Giunchedi, P. Lymph node metastases: Importance of detection and treatment strategies. Expert Opin. Drug Deliv. 2018, 15, 459–467. [Google Scholar] [CrossRef]
Zeng, Y.R.; Yang, Q.H.; Liu, Q.Y.; Min, J.; Li, H.G.; Liu, Z.F.; Li, J.X. Dual energy computed tomography for detection of metastatic lymph nodes in patients with hepatocellular carcinoma. World J. Gastroenterol. 2019, 25, 1986–1996. [Google Scholar] [CrossRef]
Sorensen, K.P.; Thomassen, M.; Tan, Q.; Bak, M.; Cold, S.; Burton, M.; Larsen, M.J.; Kruse, T.A. Long non-coding RNA expression profiles predict metastasis in lymph node-negative breast cancer independently of traditional prognostic markers. Breast Cancer Res. 2015, 17, 55. [Google Scholar] [CrossRef] [PubMed]
Deng, S.P.; Zhu, L.; Huang, D.S. Predicting hub genes associated with cervical cancer through gene co-expression networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016, 13, 27–35. [Google Scholar] [CrossRef] [PubMed]
Deng, S.P.; Zhu, L.; Huang, D.S. Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networks. BMC Genom. 2015, 16, 3–4. [Google Scholar] [CrossRef]
Zhang, S.; Zhang, C.; Du, J.; Zhang, R.; Yang, S.; Li, B.; Wang, P.; Deng, W. Prediction of Lymph-Node Metastasis in Cancers Using Differentially Expressed mRNA and Non-coding RNA Signatures. Front. Cell Dev. Biol. 2021, 9, 605977. [Google Scholar] [CrossRef]
Li, B.; Tian, Y.; Tian, Y.; Zhang, S.; Zhang, X. Predicting Cancer Lymph-node Metastasis from LncRNA Expression Profiles using Local Linear Reconstruction Guided Distance Metric Learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 99, 1. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Wang, J.; Li, J.; Chen, W.; Liu, C. CRlncRC: A machine learning-based method for cancer-related long noncoding RNA identification using integrated features. BMC Med. Genom. 2018, 11, 120. [Google Scholar] [CrossRef]
Zhang, G.; Deng, Y.; Liu, Q.; Ye, B.; Dai, Z.; Chen, Y.; Dai, X. Identifying Circular RNA and Predicting Its Regulatory Interactions by Machine Learning. Front. Genet. 2020, 11, 655. [Google Scholar] [CrossRef]
Sun, M.; Wu, D.; Zhou, K.; Li, H.; Gong, X.; Wei, Q.; Du, M.; Lei, P.; Zha, J.; Zhu, H.; et al. An eight-lncRNA signature predicts survival of breast cancer patients: A comprehensive study based on weighted gene co-expression network analysis and competing endogenous RNA network. Breast Cancer Res. Treat. 2019, 175, 59–75. [Google Scholar] [CrossRef]
DeRouin, E.; Brown, J.; Fausett, L.; Schneider, M. Neural Network Training on Unequally Represented Classes. In Intellligent Engineering Systems through Artificial Neural Networks; ASME Press: New York, NY, USA, 1991; pp. 135–141. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Advances in Intelligent Computing; Huang, D.S., Zhang, X.P., Huang, G.B., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3644, pp. 878–887. ISBN 978-3-540-28226-6. [Google Scholar]
He, H.; Bai, Y.; Garcia, E.A.; Shutao, L. Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1–6 June 2008; Volume 1, pp. 1322–1328. [Google Scholar]
Barua, S.; Islam, M.M.; Yao, X.; Kazuyuki, M. MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2014, 26, 405–425. [Google Scholar] [CrossRef]
Xie, Z.; Jiang, L.; Ye, T.; Li, X. A synthetic minority oversampling method based on local densities in low-dimensional space for imbalanced learning. In Database Systems for Advanced Applications; Renz, M., Shahabi, C., Zhou, X., Cheema, M., Eds.; Springer: Cham, Switzerland, 2015; Volume 9050, pp. 3–18. ISBN 978-3-319-18122-6. [Google Scholar]
Zhou, Z.; Jiang, Y. Nec4.5: Neural ensemble based C4.5. IEEE Trans. Knowl. Data Eng. 2004, 16, 770–773. [Google Scholar] [CrossRef]
Li, D.C.; Lin, Y.S. Using virtual sample generation to build up management knowledge in the early manufacturing stages. Eur. J. Oper. Res. 2006, 175, 413–434. [Google Scholar] [CrossRef]
Li, D.; Fang, Y. A non-linearly virtual sample generation technique using group discovery and parametric equations of hypersphere. Expert Syst. Appl. 2009, 36, 844–851. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Liu, Y.; Zhou, Y.; Liu, X.; Dong, F.; Wang, C.; Wang, Z. Wasserstein GAN-Based Small-Sample Augmentation for New-Generation Artificial Intelligence: A Case Study of Cancer-Staging Data in Biology. Engineering 2019, 5, 156–163. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
Wang, B.; Xue, B.; Zhang, M. Particle Swarm optimisation for Evolving Deep Neural Networks for Image Classification by Evolving and Stacking Transferable Blocks. In Proceedings of the 2020 IEEE Congress on Evolutionary Computation, Glasgow, UK, 19–24 July 2020; Volume 1512, pp. 1–8. [Google Scholar]
Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
How to Train a Gan? Tips and Tricks to Make Gans Work. Available online: https://github.com/soumith/ganhacks (accessed on 13 December 2022).

Figure 1. An overview of our proposed method for identifying the pan-cancer-related genes.Step 1: The original sample is pre-processed and then subjected to the biometric feature extraction method to reduce the dimensionality. Then it is divided into training set, validation set and test set. Step 2: The WGANs of different architectures are trained to generate synthetic data and the one with the smallest Wasserstein distance is selected as the new training set. Step 3: Use the psoNN model to predict tumor lymph node metastasis.

Figure 2. We used the Wasserstein distances as the supervision of the model training degree and the basis for selecting synthetic data. (a): Ir = 0.005, arc = 512; (b): Ir = 0.005, arc = 256; (c): Ir = 0.005, arc = 128; (d): Ir = 0.0001, arc = 128; (e): Ir = 0.0001, arc = 512; (f): Ir = 0.0001, arc = 256.

Figure 3. The accuracies using WGAN-psoNN, KNN, SVM, RF and LLR-DM classifiers to predict lymphatic metastasis.

Table 1. The statistics of eight cancer-related lncRNA expressions.

Dataset	Lymph Node Metastasis Samples	Non-Metastatic Samples	Total	Original Features
HNSC	78	88	166	60,483
STAD	126	103	229	60,483
THCA	127	145	272	60,483
BLCA	26	134	160	60,483
LUAD	53	231	284	60,483
COAD	41	242	283	60,483
BRCA	148	454	602	60,483
CESC	27	80	107	60,483

Table 2. The numberof sample features obtained by biometric feature selection.

Dataset	Original Features	Features after Biometric Feature Selection
HNSC	60,483	74
STAD	60,483	48
THCA	60,483	129
BLCA	60,483	165
LUAD	60,483	1272
COAD	60,483	373
BRCA	60,483	410
CESC	60,483	6

Table 3. Comparing the prediction accuracy of psoNN and WGAN-psoNN.

Dataset	psoNN	WGAN-psoNN
HNSC	0.82	0.85
STAD	0.70	0.80
THCA	0.70	0.78
BLCA	0.92	0.94
LUAD	0.88	0.91
COAD	0.86	0.93
BRCA	0.88	0.74
CESC	0.82	0.91

Table 4. Comparingthe prediction accuracy of KNN, WGAN-KNN and WGAN-psoNN.

Dataset	KNN	WGAN-KNN	WGAN-psoNN
HNSC	0.71	0.76	0.85
STAD	0.72	0.74	0.80
THCA	0.63	0.73	0.78
BLCA	0.91	0.88	0.94
LUAD	0.88	0.77	0.91
COAD	0.88	0.81	0.93
BRCA	0.73	0.74	0.74
CESC	0.77	0.72	0.91

Table 5. Comparison of the prediction accuracy of genetic optimization algorithms and psoNN.

Dataset	Genetic Optimization Algorithms	psoNN
HNSC	0.71	0.85
STAD	0.74	0.80
THCA	0.71	0.78
BLCA	0.88	0.94
LUAD	0.75	0.91
COAD	0.72	0.93
BRCA	0.77	0.74
CESC	0.82	0.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Zhang, S. Prediction of Tumor Lymph Node Metastasis Using Wasserstein Distance-Based Generative Adversarial Networks Combing with Neural Architecture Search for Predicting. Mathematics 2023, 11, 729. https://doi.org/10.3390/math11030729

AMA Style

Wang Y, Zhang S. Prediction of Tumor Lymph Node Metastasis Using Wasserstein Distance-Based Generative Adversarial Networks Combing with Neural Architecture Search for Predicting. Mathematics. 2023; 11(3):729. https://doi.org/10.3390/math11030729

Chicago/Turabian Style

Wang, Yawen, and Shihua Zhang. 2023. "Prediction of Tumor Lymph Node Metastasis Using Wasserstein Distance-Based Generative Adversarial Networks Combing with Neural Architecture Search for Predicting" Mathematics 11, no. 3: 729. https://doi.org/10.3390/math11030729

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Tumor Lymph Node Metastasis Using Wasserstein Distance-Based Generative Adversarial Networks Combing with Neural Architecture Search for Predicting

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.2. Data Preprocessing

2.3. Methods

2.3.1. Dimension Reduction

2.3.2. Data Generation

2.3.3. Prediction

3. Results and Analysis

3.1. Enviroment Settings

3.2. Parameter Settings

3.3. Performance Evaluation

3.3.1. Effectiveness of WGAN

3.3.2. Effectiveness of psoNN

3.3.3. Effectiveness of WGAN-psoNN

4. Discussion and Conclusions

4.1. Discussion

4.2. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI