Next Article in Journal
Cellular and Molecular Mechanisms in the Pathogenesis of Classical, Vascular, and Hypermobile Ehlers‒Danlos Syndromes
Previous Article in Journal
The Hot QTL Locations for Potassium, Calcium, and Magnesium Nutrition and Agronomic Traits at Seedling and Maturity Stages of Wheat under Different Potassium Treatments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DNILMF-LDA: Prediction of lncRNA-Disease Associations by Dual-Network Integrated Logistic Matrix Factorization and Bayesian Optimization

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
*
Author to whom correspondence should be addressed.
Genes 2019, 10(8), 608; https://doi.org/10.3390/genes10080608
Submission received: 27 June 2019 / Revised: 22 July 2019 / Accepted: 7 August 2019 / Published: 12 August 2019
(This article belongs to the Section Technologies and Resources for Genetics)

Abstract

:
Identifying associations between lncRNAs and diseases can help understand disease-related lncRNAs and facilitate disease diagnosis and treatment. The dual-network integrated logistic matrix factorization (DNILMF) model has been used for drug–target interaction prediction, and good results have been achieved. We firstly applied DNILMF to lncRNA–disease association prediction (DNILMF-LDA). We combined different similarity kernel matrices of lncRNAs and diseases by using nonlinear fusion to extract the most important information in fused matrices. Then, lncRNA–disease association networks and similarity networks were built simultaneously. Finally, the Gaussian process mutual information (GP-MI) algorithm of Bayesian optimization was adopted to optimize the model parameters. The 10-fold cross-validation result showed that the area under receiving operating characteristic (ROC) curve (AUC) value of DNILMF-LDA was 0.9202, and the area under precision-recall (PR) curve (AUPR) was 0.5610. Compared with LRLSLDA, SIMCLDA, BiwalkLDA, and TPGLDA, the AUC value of our method increased by 38.81%, 13.07%, 8.35%, and 6.75%, respectively. The AUPR value of our method increased by 52.66%, 40.05%, 37.01%, and 44.25%. These results indicate that DNILMF-LDA is an effective method for predicting the associations between lncRNAs and diseases.

1. Introduction

Long non-coding RNAs (lncRNAs) are a class of non-coding RNAs (ncRNAs) that are more than 200 nucleotides (nt) in length and do not encode proteins [1]. lncRNAs were originally thought to be genomic transcriptional noise without biological function [2]. Later, more and more evidence indicated that lncRNAs play an important role in many key biological processes, such as translation and post-translational regulation, cell differentiation, proliferation and apoptosis, and epigenetic regulation [3]. Meanwhile, mutations and dysregulation of lncRNAs can cause a variety of human diseases [4,5], including diabetes [6], AIDS [7], and many types of cancer, such as hepatocellular carcinoma [8], lung cancer [9], prostate cancer [10], breast cancer [11], and bladder cancer [12]. Therefore, predicting the potential associations between lncRNAs and diseases helps to explore the complex pathogenesis and etiology of disease at the molecular level and effectively improves the quality of disease diagnosis, treatment, and prevention.
In recent years, several lncRNAs function–disease relationship databases have been established. lncRNAdb [13], lncRNADisease [14], lnc2Cancer [15], and NONCODE [16] are some examples. However, the known lncRNA–disease relationship is still rare, and the use of biological experiments to explore lncRNA–disease associations is both time-consuming and expensive. Using computational methods to infer the potential associations between lncRNAs and diseases has become an effective prior method for biological experiments.
Recently, many computational models have been proposed to predict potential lncRNA–disease associations, which can roughly be divided into three categories. The first class of methods is based on machine learning to predict potential associations. Chen et al. [17] proposed LRSLDA, a semi-supervised learning method based on Laplacian regular least squares. This method does not require a negative sample. However, the problem of parameter selection for combining two classifiers has not been well solved. LDAP [18] uses a support vector machine classifier to predict potential lncRNA–disease associations based on lncRNA similarity and disease similarity. Yu et al. [19] constructed a global quadruple network and a global tripartite network by integrating various biological information. Based on these two global networks, the novel probability model NBCLDAbased on the naive Bayesian classifier was proposed.
The second category is based on biological network models. Heterogeneous data have become a hot topic in recent years. These models tend to construct heterogeneous networks using disease-associated genes/miRNAs or predict new associations between lncRNAs and diseases using multi-data source information fusion. Liang et al. [20] proposed a new method, TPGLDA, for predicting lncRNA–disease associations using a lncRNA–disease–gene tripartite map. It integrates gene–disease associations and lncRNA–disease associations and can effectively identify potential lncRNA–disease associations. Chen et al. [21] proposed an improved restart random walk model IRWRLDA, which integrates multiple data sources including lncRNA expression similarity, functional similarity, Gaussian interaction profile kernel similarity, and disease semantic similarity to predict lncRNA–disease associations. Gu et al. [22] established a global network random walk model GrwLDA, which predicts potential lncRNA–disease associations by integrating disease semantic similarity, lncRNA functional similarity, and known lncRNA–disease associations. These data fusion-based methods have achieved significant improvements over methods that use a single data source.
The third category is some methods based on matrix completion. MFLDA [23] decomposes data matrices of heterogeneous data sources into low-rank matrices via matrix tri-factorization to explore and exploit their intrinsic and shared structure. However, it cannot predict lncRNAs that are not associated with any disease or diseases that are not associated with any lncRNA. SIMCLDA [24] models the lncRNA–disease associations’ prediction problem as a recommended task and uses the induction matrix completion method to solve it.
The lncRNA–disease association matrix and the drug–target association matrix are generally sparse matrices with less known associations. The sparsity of the lncRNA–disease dataset used in this paper is 97.36%, which was obtained from 1-540/(115*178) (115 lncRNAs, 178 diseases, and 540 known associations, sparsity = 1-540/(115*178)), and the sparsity of the four benchmark datasets is 99.01%, 96.55%, 97.00%, and 93.59%, respectively, in drug–target interaction prediction [25]. With regard to the sparse characteristics of the drug–target matrix, neighborhood regularized logistic matrix factorization (NRLMF) was adopted in [26] to predict drug–target interactions, and the effect was significant. NRLMF has also been successfully applied to the prediction of the associations between miRNA–disease [27] and lncRNA–protein [28,29]. Based on NRLMF, dual-network integrated logistic matrix factorization (DNILMF) introduced a drug similarity network and target similarity network to improve the accuracy of prediction [26]. However, the DNILMF prediction effect was greatly affected by the parameter setting. The method of setting parameters based on experience had significant limits in [26]. Because the Gaussian process mutual information algorithm (GP-MI) [30], an advanced Bayesian optimization method, has been successfully applied to the parameter optimization of the logistic matrix factorization model and brings about positive results [31], this paper adopts the GP-MI algorithm to optimize the parameters for DNILMF.
The advantages of using DNILMF-LDA to predict lncRNA–disease associations are: (1) logistic matrix factorization, especially suitable for binary variables and sparsity problems, is used to model the interaction probability of each lncRNA–disease pair; (2) two different similarity kernel matrices of lncRNAs and diseases are fused into a composite kernel matrix by nonlinear fusion technology, and then, the fused kernel matrices are integrated into the model; (3) the lncRNAs’ and diseases’ similarity networks are introduced in the model; the flowchart of DNILMF-LDA given in Figure 1.

2. Materials

2.1. lncRNA–Disease Associations Matrix

The original lncRNA–disease association dataset was downloaded from the lncRNADisease [14] database, which integrated 687 experimentally-validated lncRNA–disease associations between 246 diseases and 369 lncRNAs. The diseases without disease ontology (http://disease-ontology.org/) and lncRNAs without expression profiles in ArrayExpress [32] (http://www.ebi.ac.uk/arrayexpress/) were filtered out, and 540 experimentally-validated lncRNA–disease associations between 115 lncRNAs and 178 diseases were obtained. The lncRNA–disease association matrix is represented by Y.

2.2. lncRNA Expression Similarity Matrix and Disease Semantic Similarity Matrix

More than 60,000 expression profiles from 16 human tissues were downloaded from ArrayExpress [32]. The Spearman correlation coefficient between any two lncRNAs in 115 lncRNAs was calculated and was used as the expression similarity for this pair of lncRNAs [17]. The expression similarity matrix of all lncRNAs is represented by S l .
The semantic similarity of diseases is often used to predict potential lncRNA–disease associations. The semantic similarity of the disease in this paper was calculated with the method in paper [33]. Each disease was represented by a directed acyclic graph (DAG) containing all relevant annotated items, which came from the National Library of Medicine (http://www.nlm.nih.gov/mesh). The semantic similarity of two diseases is based on both the addresses of these diseases in DAG graphs and their semantic relations with their ancestor diseases. The DOSE package provided us with the method to calculate semantic similarities among diseases [34]. The semantic similarity matrix of the disease is represented by S d .

2.3. Similarity Kernel Matrices

Kernel matrices of lncRNAs and diseases were constructed for nonlinear kernel fusion. The construction of kernel matrices consisted of two steps.
The first step was to convert the lncRNAs’ and diseases’ similarity matrices into kernel matrices. In this step, S l and S d were converted to kernel matrices by:
(1) converting S l and S d to the corresponding symmetric matrices, S s y m = S + S T / 2 ;
(2) transforming the symmetric matrices obtained in the first step into semi-positive definite matrices by adding multiple small identity matrices [35]. The transformed lncRNAs’ and diseases’ kernel matrices are represented by K l and K d , respectively.
The second step was to calculate the Gaussian interaction profile (GIP) kernel matrix of lncRNAs and diseases. Y l i and Y l j represent the interaction profile of lncRNA iand lncRNA j, which are the i th row and j th vector of association matrix Y. The distance between these two vectors was computed as their GIP kernel. In this step, for a given lncRNA–disease associations matrix Y, the GIP kernel K g i p l between lncRNAs was calculated according to Formula (1) [35]:
K g i p l i , l j = exp Y l i Y l j 2 σ
where · represents the Euclidean distance and σ represents the kernel bandwidth of the Gaussian spectrum. In our work, the value of σ was set to one. GIP kernel K g i p d between diseases was calculated using the same method.

2.4. Fusion of Similarity Kernel Matrices

The purpose of similarity kernel matrices’ fusion is to merge K g i p l and K l into a kernel matrix and merge K g i p d and K d into another kernel matrix. The steps of kernel fusion [36,37] are:
(1) Normalize and symmetrize the above four kernel matrices. Taking the fusion steps between K g i p l and K l as an example, the resulting matrices are denoted by P 1 and P 2 .
(2) Construct local similarity matrix L 1 and L 2 of K g i p l and K l by Formula (2):
L 1 i , j = P 1 i , j k N i P 1 i , k , j N i 0 , others
where P 1 i , j represents the i th row and j th column element in matrix P 1 . N i denotes the nearest neighbors of the current target i. The number of nearest neighbors was set to 3 according to experience. The similarity between lncRNAi and non-nearest neighbors was zero. Finally, L 1 and L 2 can be obtained;
(3) Update P 1 and P 2 iteratively by Formulas (3) and (4). Iteration step t was set to two by experience.
P t ( 1 ) = L ( 1 ) P t 1 ( 2 ) L ( 1 ) T
P t ( 2 ) = L ( 2 ) P t 1 ( 1 ) L ( 2 ) T
(4) After the iterations, P t 1 and P t 2 were averaged and normalized as the final kernel matrix of diseases, denoted as S d k . S l k was calculated using the same method.

3. Methods

3.1. Problem Formalization

In this paper, the collection of lncRNAs is represented by L = l i 1 m , and the collection of diseases is represented by D = d j 1 n , where m and n are the number of lncRNAs and diseases, respectively. The associations between lncRNAs and diseases are represented by a binary matrix Y R m × n . When lncRNA l i was experimentally verified to be associated with disease d j , y i j = 1 , otherwise, y i j = 0 . L + = l i | j = 1 n y i j > 0 , 1 i m is the collection of positive lncRNAs, and D + = d i | i = 1 m y i j > 0 , 1 j n is the collection of positive diseases. Thus, L = L / L + is the collection of lncRNAs with no known association with all diseases. D = D / D + is the collection of diseases with no known association with all lncRNAs. S l k R m × m is the final similarity kernel matrix of lncRNAs, and S d k R n × n is the final similarity kernel matrix of diseases. The purpose of this paper is to predict lncRNA–disease interaction probabilities and rank candidate lncRNA–disease pairs based on predicted probabilities. The higher ranked lncRNA–disease pairs are most likely to be correlated.

3.2. Prediction of lncRNA–Disease Associations Using the DNILMF Model

The lncRNAs’ kernel matrix S l k , diseases’ kernel matrix S d k , and lncRNA–disease association matrix Y are the input data for the DNILMF model to infer potential lncRNA–disease associations. lncRNAs and diseases were mapped to the r-dimensional shared potential space, where r < min(m, n). Latent vectors u i R 1 × r and v j R 1 × r represent the characteristics of lncRNA l i and disease d j , respectively. U R m × r and V R n × r are potential vectors for all lncRNAs and diseases. Then, the probabilities P of all lncRNAs and diseases were modeled by the following logistic function:
P = exp U V T 1 + exp U V T
What needs to be emphasized is the calculation of P depends on the lncRNA–disease association network Y. Based on the hypothesis that similar diseases are always associated with functionally similar lncRNAs, the interaction probability of lncRNA–disease is affected not only by the lncRNA–disease association network Y, but also by lncRNAs’ similarity network S l k and diseases’ similarity network S d k . Hence, Y is combined with S l k and S d k for matrix factorization. The interaction probabilities of lncRNAs and diseases are:
P = exp α U V T + β S l k U V T + γ U V T S d k 1 + exp α U V T + β S l k U V T + γ U V T S d k
where α , β , γ are the corresponding weight of Y, S l k and S d k . Their sum is 1, and β = γ .
Since the known lncRNA–disease associations are more important than the unknown lncRNA–disease associations, we set the weight of the known lncRNA–disease pairs to c ( c 1 ) and that of the unknown lncRNA–disease pairs to 1. By assuming all samples are independent, the probability p Y | U , V can be calculated by:
p Y | U , V = i = 1 m j = 1 n P i j c Y i j 1 P i j 1 Y i j
where P i j is the interaction probability between lncRNA l i and disease d j . Setting the zero-mean spherical Gaussian prior in lncRNAs’ and diseases’ potential vectors is done as follows:
p U | σ l 2 = i = 1 m N u i | 0 , σ l 2 I , p V | σ d 2 = j = 1 n N v j | 0 , σ d 2 I
where σ l 2 and σ d 2 are the parameters that control the variance of the Gaussian distribution and I represents the identity matrix. According to Bayesian inference:
p U , V | Y , σ l 2 , σ d 2 p Y | U , V p U | σ l 2 p V | σ d 2
Then, learn the model parameters U and V by maximizing the logarithm of the posterior distribution. The objective function L is:
L = max U , V i , j c Y α U V T + β S l k U V T + γ U V T S d k 1 + c Y Y ln 1 + exp α U V T + β S l k U V T + γ U V T S d k ) ] ) λ u 2 U F 2 λ v 2 V F 2
where λ u = 1 σ l 2 , λ ν = 1 σ d 2 , λ u , and λ ν are regularization coefficient of U and V, · F 2 is the Frobenius norm, and ⊙ is the Hadamard product. Starting from the above objective function, the gradient descent algorithm was used to solve U and V, and the gradient variables of U and V are as follows:
L U = c α I + β S l k T Y V + γ c Y Q S d k T V α I + β S l k T Q V λ u U
L V = c α I + γ S d k T Y T U + β c Y T Q T S l k U α I + γ S d k T Q T U λ v V
where Q = 1 + c Y Y 1 exp α U V T + β S l k U V T + γ U V T S d k + 1 , Q T is the transposed matrix of Q. This work uses the AdaGrad algorithm [38] to accelerate the convergence of U and V.
Based on the matrices U and V, the interaction probabilities of any unknown lncRNA–disease pairs can be calculated by Formula (6). Due to the uncertainty of lncRNA l i L and disease d j D , their potential vectors u i and v j obtained by gradient descent cannot accurately describe their characteristics, so k-nearest neighbor sets N + l i and N + d j of l i and d j were constructed (k was empirically set to 5). Then, replace potential vector u i and v j with the linear combination of the k-nearest neighbors [25,26]. The modified interaction probability is:
p ^ i j = exp u ^ i v ^ j T 1 + exp u ^ i v ^ j T
where:
u ^ i = u i , l i L + 1 u N + l i S i u l u N + l i S i u l u u , l i L
v ^ j = v j , d j D + 1 v N + d j S j v d v N + d j S j v d u v , d j D
S i u l denotes the similarity between unknown lncRNA l i and known lncRNA l u , and u u denotes the latent variable of l u .
The selection of model parameter r , α , β , γ , λ u , λ v can affect the performance of the model somehow. It is difficult to ensure the best performance of the model by using empirical parameter values. In order to improve the performance of the model, the Bayesian optimization algorithm was adopted to optimize the setting of parameter values in this work.

3.3. Bayesian Optimization

The Gaussian process mutual information algorithm (GP-MI) was used to optimize the setting of the parameter values. The optimization process of GP-MI for the DNILMF model parameters is shown in Figure 2.
(1)
Bayesian optimization
For function f : χ R , f is an unknown function to be optimized, and χ R n n N , a tight convex set. In this paper, the DNRLMF model is f, and R n is the parameter search space. The purpose of Bayesian optimization is to find the optimal solution for f through continuous queries x ( x 1 , x 2 , χ ). At iteration t, the new query x t is selected from χ according to the previous query χ t 1 = x 1 , x 2 , x t 1 and observations Y t 1 = y 1 , y 2 , , y t 1 . The relationship between y t and x t is y t = f x t + ϵ t , where ϵ t is the noise variable, ϵ t N 0 , σ 2 .
(2)
Gaussian process
Suppose the function f follows Gaussian process G P m , k [30], where m : χ R is a mean function and k : χ × χ R is a kernel function. Let the mean function be zero, that is m : χ 0 , the kernel function is a square exponential kernel.
According to the previous t 1 times queries χ t 1 and observations Y t 1 , the posterior distribution at iteration t is a Gaussian process with expectation as μ t x and variance as σ t 2 x by Bayesian inference.
(3)
GP-MI algorithm
The most critical aspect of the GP-MI algorithm is the choice of the next query x t χ using μ t x and variance σ t 2 x .
x t = a r g m a x x χ μ t x + ϕ t x
where ϕ t : χ R is the increment function of σ t 2 x :
ϕ t x = log 2 δ σ t 2 x + γ ^ t 1 γ ^ t 1
γ ^ t 1 γ ^ t 2 + σ t 1 2 x t 1 ; δ > 0 is a hyperparameter; and the iteration ending condition is x t + 1 = x t . The pseudocode of the GP-MI is shown in Algorithm 1:
Algorithm 1 GP-MI.
γ ^ 0 0
for t = 1 , 2 , do
 Compute μ t and σ t 2 by χ t = x 1 . . x t 1 and Y t = y 1 . . y t 1 // Bayesian inference
ϕ t x log 2 δ σ t 2 x + γ ^ t 1 γ ^ t 1 // Definition of ϕ t ( x ) for all x χ
x t a r g m a x x χ μ t x + ϕ t x // Selection of the next query location
γ ^ t γ ^ t 1 + σ t 2 x t // Update γ ^ t
 get y t by the DNILMF model and x t // Query x t , y t
end for

4. Experimental Results

4.1. Evaluation of Prediction Performance

In this paper, the prediction performance of the detection model was verified by 10-fold cross-validation (CV). AUC and the area under precision-recall (PR) curve (AUPR) were used as the performance evaluation indexes of the model. AUC is an important index to evaluate the classification model. If AUC = 1, the model has perfect performance; if AUC = 0.5, this means random performance. The higher the values of AUC and AUPR, the better the prediction performance.
During the 10-fold CV process, lncRNA–disease pairs (including known pairs and unknown pairs) were randomly divided into ten groups with almost the same data size by setting random seeds. Each time, one of the ten groups was used as the test data, and the values of the test data in the adjacency matrix Y were set to zero. The resulting matrix was the training data Y t r a i n . In each iteration of 10-fold CV, firstly, calculate the kernel matrix and the GIP kernel matrix of lncRNAs and diseases. Secondly, fuse the kernel matrices of lncRNAs and diseases to get two composite kernel matrices. Then, take the fused kernel matrices and Y t r a i n as the model input and update the value of the potential vectors U, V through gradient descent until the optimal value of the model is achieved. Finally, the AUC and AUPR values were obtained by using the trained model to predict and evaluate the test data. After ten iterations, the AUC values of 10 test sets were obtained, and their mean value was taken as the AUC value of one time 10-fold CV. Under 10-fold CV, the AUC value of the model reached 0.9202, and the AUPR value reached 0.5610.

4.2. Comparison with Other Methods

To further evaluate the performance of our model, we compared it with LRLSLDA, BiwalkLDA, SIMCLDA, and TPGLDA under 10-fold CV. The prediction result of the five models using the same dataset is shown in Table 1. The result showed that both AUC and AUPR values of DNILMF-LDA were the highest among five models, indicating that the performance of our model was better than the others. Figure 3 and Figure 4 respectively show the receiver operating characteristic (ROC) curve and precision-recall (PR) curve of the five models.

4.3. Parameter Analysis

For DNILMF-LDA, the dimension r of shared potential space was from 50–100 with a step length of 10 [31]. The coefficient of the potential matrix product ranged from 0–1 with a step length of 0.1, β = γ = 1 α / 2 ; the regularization coefficients λ u and λ v for potential variables of lncRNAs and diseases ranged from 1–10, with a step size of one [25]. The number of neighbors to construct the neighbor set of unknown lncRNAs and diseases was set to five. The weight of known interaction pair was set to five. According to the results of the literature [31], when δ = 10 100 , the Bayesian optimization was very close to the prediction accuracy of the grid search, but the calculation time decreased by 8.94-times on average. Therefore, we set the value of δ and the noise variance of the Gaussian process kernel function σ 2 to 10 100 and 0.1, respectively. In summary, r = 50 , 100 , α = { 0 . 1 , 1 } , λ u = 1 , 10 , λ v = 1 , 10 , K = 5 , c = 5 , δ = 10 100 , and σ 2 = 0 . 1 .
The parameter optimization results of the DNILMF model by the GP-MI algorithm showed that the prediction performance of the model was good when the model parameters r took any value in the range of 50 , 100 , α = 0.1 , β = γ = 0.45 , λ u = 1 , λ v = 1 . When r = 90 , the AUC value of the model reached its highest at 0.9202. The AUC value is shown in Figure 5 when r took different values. The weight of β and γ was greater than that of c, which indicated the importance of the lncRNAs’ and diseases’ similarity network and also indicated the effectiveness of adding the lncRNA–disease associations network and similarity networks into the model.

4.4. Case Studies on Breast, Lung, and Colon Cancer

We further evaluated the role of the DNILMF-LDA model in predicting lncRNA–disease associations by studying three common and typical cancers: breast cancer, lung cancer, and colon cancer. The top ten candidate lncRNAs calculated by DNILMF-LDA for three cancers and their evidence are listed in Table 2, Table 3 and Table 4. The verification of the prediction results was supported by the lncRNADisease and lnc2Cancer databases [14,15].
Lung cancer is one of the most common and deadly cancers in the world. Among the top 10 candidate lncRNAs calculated by DNILMF-LDA, seven lncRNAs were experimentally verified to be associated with lung cancer. For example, the lncRNA-CDKN2B-AS1 promotes NSCLC cell proliferation and inhibits apoptosis by suppressing KLF2 and P21 expression [39]. In addition, a recent study has shown that upregulated lncRNA-UCA1 plays an important role in the development of lung cancer, and it has great application prospects in clinical diagnosis [40].
Colon cancer is the third most common cancer and the second leading cause of cancer death in men and women [41]. Of the top 10 candidate lncRNAs calculated by DNILMF-LDA, eight lncRNAs were experimentally demonstrated to be associated with colon cancer. Studies have shown that inhibiting the expression of lncRNA-TUG 1 can significantly inhibit the migration ability of colon cancer cells, and the overexpression of TUG 1 may promote the proliferation and migration of colon cancer cells [42].
Breast cancer is the most common cancer in women and the most common cancer in the world. Among the top ten candidate lncRNAs calculated by DNILMF-LDA, seven lncRNAs were experimentally demonstrated to be associated with breast cancer. Studies have shown that upregulated lncRNA-CCAT 1, second in our list of breast cancer, participates in various cellular processes related to cancer occurrence [43].
These case studies reconfirmed the potential of DNILMF-LDA in identifying potential lncRNA–disease associations.

5. Discussion

Studies have shown that lncRNAs play an essential role in biological processes and in the diagnosis, prevention, and treatment of complex diseases. It has become an extraordinary method to combine multiple different similarity matrices in the computational model, and using matrix factorization to predict the potential lncRNA–disease associations is also a hot topic. In this paper, the dual-network integrated logistic matrix factorization model was used to predict the potential lncRNA–disease associations, and the GP-MI algorithm of Bayesian optimization was applied for parameter optimization to ensure the optimal performance of the model.
The main advantages of DNILMF-LDA are: (1) Logistic matrix factorization, especially suitable for binary variables and sparsity problems, was used to model the associations probability of each lncRNA–disease pair. (2) The GIP kernel matrix and similarity matrix of lncRNAs and diseases were obtained, and the nonlinear fusion method was adopted in the process of similarity kernel fusion to reduce the difference between similarity matrices. (3) lncRNAs’ and diseases’ similarity networks were introduced in the model. In this paper, 10-fold CV was used to evaluate the prediction performance of our model. The results showed that compared with the LRLSLDA, BiwalkLDA, SIMCLDA, and TPGLDA models, the AUC value of DNILMF-LDA was higher and the prediction performance of DNILMF-LDA better. In addition, case studies of lung cancer, colon cancer, and breast cancer also suggested that DNILMF-LDA was a better computational method to predict the potential lncRNA–disease associations.
Although DNILMF-LDA has obtained reliable experimental results, there are still some biases. For example, the known experimentally-verified lncRNA–disease associations are still limited, and the predictive performance of DNILMF-LDA will be improved by a more comprehensive dataset.

6. Conclusions

In this paper, our major contributions were as follows: First, logistic matrix factorization was used to model the interaction probability of each lncRNA–disease pair. Second, lncRNA and disease similarity networks were introduced into the model. Third, the imbalance between known and unknown interaction pairs was balanced by giving higher weights to known interactions in the model. Fourth, the method of neighborhood information was used to deal with the problems of new lncRNAs and diseases in the process of prediction. Fifth, multiple source similarity fusion was used to improve the prediction accuracy. We obtained the Gaussian kernel matrix and similarity kernel matrix of lncRNAs and diseases, adopted nonlinear fusion to weaken the differences between similar matrices, and extracted the most important information from different similarity data. Sixth, the GP-MI algorithm in Bayesian optimization was adopted in this paper for parameter optimization.
In the future, we expect to acquire new multi-source datasets and explore better kernel fusion methods. Then, we can improve the prediction performance by fully exploiting multi-source data and advanced fusion technology.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; software, Y.L.; writing, original draft preparation, Y.L., J.L., and N.B.; writing, review and editing, Y.L., J.L., and N.B.; supervision, N.B.

Funding

This research received no external funding.

Acknowledgments

The authors thank Xiaofang Xiao for assistance with the experiments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chang, H.Y. Abstract IA02: Genome regulation by long noncoding RNAs. Cancer Res. 2016, 76, IA02. [Google Scholar]
  2. Yanofsky, C. Establishing the triplet nature of the genetic code. Cell 2007, 128, 815–818. [Google Scholar] [CrossRef] [PubMed]
  3. Merry, C.R.; Niland, C.; Khalil, A.M. Diverse Functions and Mechanisms of Mammalian Long Noncoding RNAs; Springer: New York, NY, USA, 2015; pp. 1–14. [Google Scholar]
  4. Cheetham, S.; Gruhl, F.; Mattick, J.; Dinger, M. Long noncoding RNAs and the genetics of cancer. Br. J. Cancer 2013, 108, 2419. [Google Scholar] [CrossRef] [PubMed]
  5. Taft, R.J.; Pang, K.C.; Mercer, T.R.; Dinger, M.; Mattick, J.S. Non-coding RNAs: Regulators of disease. J. Pathol. 2010, 220, 126–139. [Google Scholar] [CrossRef] [PubMed]
  6. Pasmant, E.; Sabbagh, A.; Vidaud, M.; Bièche, I. ANRIL, a long, noncoding RNA, is an unexpected major hotspot in GWAS. FASEB J. 2011, 25, 444–448. [Google Scholar] [CrossRef]
  7. Zhang, Q.; Chen, C.Y.; Yedavalli, V.S.; Jeang, K.T. NEAT1 long noncoding RNA and paraspeckle bodies modulate HIV-1 posttranscriptional expression. MBio 2013, 4, e00596–e00612. [Google Scholar] [CrossRef] [PubMed]
  8. Wang, J.; Liu, X.; Wu, H.; Ni, P.; Gu, Z.; Qiao, Y.; Chen, N.; Sun, F.; Fan, Q. CREB up-regulates long non-coding RNA, HULC expression through interaction with microRNA-372 in liver cancer. Nucleic Acids Res. 2010, 38, 5366–5383. [Google Scholar] [CrossRef] [Green Version]
  9. Wapinski, O.; Chang, H.Y. Long noncoding RNAs and human disease. Trends Cell Biol. 2011, 21, 354–361. [Google Scholar] [CrossRef]
  10. Cui, Z.; Ren, S.; Lu, J.; Wang, F.; Xu, W.; Sun, Y.; Wei, M.; Chen, J.; Gao, X.; Xu, C.; et al. The prostate cancer-up-regulated long noncoding RNA PlncRNA-1 modulates apoptosis and proliferation through reciprocal regulation of androgen receptor. Urol. Oncol. Semin. Orig. Investig. 2013, 31, 1117–1123. [Google Scholar] [CrossRef]
  11. Sun, J.; Chen, X.; Wang, Z.; Guo, M.; Shi, H.; Wang, X.; Cheng, L.; Zhou, M. A potential prognostic long non-coding RNA signature to predict metastasis-free survival of breast cancer patients. Sci. Rep. 2015, 5, 16553. [Google Scholar] [CrossRef]
  12. Ma, Z.; Xue, S.; Zeng, B.; Qiu, D. lncRNA SNHG5 is associated with poor prognosis of bladder cancer and promotes bladder cancer cell proliferation through targeting p27. Trends Cell Biol. 2018, 15, 1924–1930. [Google Scholar] [CrossRef] [PubMed]
  13. Quek, X.C.; Thomson, D.W.; Maag, J.L.; Bartonicek, N.; Signal, B.; Clark, M.B.; Gloss, B.S.; Dinger, M.E. lncRNAdb v2. 0: Expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res. 2014, 21, D168–D173. [Google Scholar]
  14. Chen, G.; Wang, Z.; Wang, D.; Qiu, C.; Liu, M.; Chen, X.; Zhang, Q.; Yan, G.; Cui, Q. lncRNADisease: A database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 2012, 41, D983–D986. [Google Scholar] [CrossRef] [PubMed]
  15. Gao, Y.; Wang, P.; Wang, Y.; Ma, X.; Zhi, H.; Zhou, D.; Li, X.; Fang, Y.; Shen, W.; Xu, Y.; et al. Lnc2Cancer v2.0: Updated database of experimentally supported long non-coding RNAs in human cancers. Nucleic Acids Res. 2018, 47, D1028–D1033. [Google Scholar] [CrossRef] [PubMed]
  16. Zhao, Y.; Li, H.; Fang, S.; Kang, Y.; Wu, W.; Hao, Y.; Li, Z.; Bu, D.; Sun, N.; Zhang, M.Q.; et al. NONCODE 2016: An informative and valuable data source of long non-coding RNAs. Nucleic Acids Res. 2015, 44, D203–D208. [Google Scholar] [CrossRef] [PubMed]
  17. Chen, X.; Yan, G.Y. Novel human lncRNA-disease association inference based on lncRNA expression profiles. Bioinformatics 2013, 29, 2617–2624. [Google Scholar] [CrossRef]
  18. Lan, W.; Li, M.; Zhao, K.; Liu, J.; Wu, F.X.; Pan, Y.; Wang, J. LDAP: A web server for lncRNA-disease association prediction. Bioinformatics 2016, 33, 458–460. [Google Scholar] [CrossRef]
  19. Yu, J.; Ping, P.; Wang, L.; Kuang, L.; Li, X.; Wu, Z. A Novel Probability Model for lncRNA-Disease Association Prediction Based on the Naïve Bayesian Classifier. Genes 2018, 9, 345. [Google Scholar] [CrossRef]
  20. Ding, L.; Wang, M.; Sun, D.; Li, A. TPGLDA: Novel prediction of associations between lncRNAs and diseases via lncRNA-disease-gene tripartite graph. Sci. Rep. 2018, 8, 1065. [Google Scholar] [CrossRef]
  21. Chen, X.; You, Z.H.; Yan, G.Y.; Gong, D.W. IRWRLDA: Improved random walk with restart for lncRNA-disease association prediction. Oncotarget 2016, 7, 57919. [Google Scholar] [CrossRef]
  22. Gu, C.; Liao, B.; Li, X.; Cai, L.; Li, Z.; Li, K.; Yang, J. Global network random walk for predicting potential human lncRNA-disease associations. Sci. Rep. 2017, 7, 12442. [Google Scholar] [CrossRef]
  23. Fu, G.; Wang, J.; Domeniconi, C.; Yu, G. Matrix factorization-based data fusion for the prediction of lncRNA-disease associations. Bioinformatics 2017, 34, 1529–1537. [Google Scholar] [CrossRef]
  24. Lu, C.; Yang, M.; Luo, F.; Wu, F.X.; Li, M.; Pan, Y.; Li, Y.; Wang, J. Prediction of lncRNA-disease associations based on inductive matrix completion. Bioinformatics 2018, 34, 3357–3364. [Google Scholar] [CrossRef]
  25. Hao, M.; Bryant, S.H.; Wang, Y. Predicting drug–target interactions by dual-network integrated logistic matrix factorization. Sci. Rep. 2017, 7, 40376. [Google Scholar] [CrossRef]
  26. Liu, Y.; Wu, M.; Miao, C.; Zhao, P.; Li, X.L. Neighborhood regularized logistic matrix factorization for drug–target interaction prediction. PLoS Comput. Biol. 2016, 12, e1004760. [Google Scholar] [CrossRef]
  27. Yan, C.; Wang, J.; Ni, P.; Lan, W.; Wu, F.; Pan, Y. DNRLMF-MDA: Predicting microRNA-disease associations based on similarities of microRNAs and diseases. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 16, 233–243. [Google Scholar] [CrossRef]
  28. Zhao, Q.; Zhang, Y.; Hu, H.; Ren, G.; Zhang, W.; Liu, H. IRWNRLPI: Integrating random walk and neighborhood regularized logistic matrix factorization for lncRNA-protein interaction prediction. Front. Genet. 2018, 9, 239. [Google Scholar] [CrossRef]
  29. Liu, H.; Ren, G.; Hu, H.; Zhang, L.; Ai, H.; Zhang, W.; Zhao, Q. LPI-NRLMF: lncRNA-protein interaction prediction by neighborhood regularized logistic matrix factorization. Oncotarget 2017, 8, 103975. [Google Scholar] [CrossRef]
  30. Contal, E.; Perchet, V.; Vayatis, N. Gaussian process optimization with mutual information. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 253–261. [Google Scholar]
  31. Ban, T.; Ohue, M.; Akiyama, Y. Efficient hyperparameter optimization by using Bayesian optimization for drug–target interaction prediction. In Proceedings of the 2017 IEEE 7th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), Orlando, FL, USA, 19–21 October 2017; pp. 1–6. [Google Scholar]
  32. Parkinson, H.; Kapushesky, M.; Shojatalab, M.; Abeygunawardena, N.; Coulson, R.; Farne, A.; Holloway, E.; Kolesnykov, N.; Lilja, P.; Lukk, M.; et al. ArrayExpress—A public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 2006, 35, D747–D750. [Google Scholar] [CrossRef]
  33. Wang, D.; Wang, J.; Lu, M.; Song, F.; Cui, Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics 2010, 26, 1644–1650. [Google Scholar] [CrossRef] [Green Version]
  34. Yu, G.; Wang, L.G.; Yan, G.R.; He, Q.Y. DOSE: An R/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 2014, 31, 608–609. [Google Scholar] [CrossRef]
  35. Van Laarhoven, T.; Nabuurs, S.B.; Marchiori, E. Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics 2011, 27, 3036–3043. [Google Scholar] [CrossRef]
  36. Hao, M.; Wang, Y.; Bryant, S.H. Improved prediction of drug–target interactions using regularized least squares integrating with kernel fusion technique. Anal. Chim. Acta 2016, 909, 41–50. [Google Scholar] [CrossRef]
  37. Wang, B.; Mezlini, A.M.; Demir, F.; Fiume, M.; Tu, Z.; Brudno, M.; Haibe-Kains, B.; Goldenberg, A. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 2014, 11, 333. [Google Scholar] [CrossRef]
  38. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
  39. Nie, F.Q.; Sun, M.; Yang, J.S.; Xie, M.; Xu, T.P.; Xia, R.; Liu, Y.W.; Liu, X.H.; Zhang, E.B.; Lu, K.H.; et al. Long noncoding RNA ANRIL promotes non-small cell lung cancer cell proliferation and inhibits apoptosis by silencing KLF2 and P21 expression. Mol. Cancer Ther. 2015, 14, 268–277. [Google Scholar] [CrossRef]
  40. Wang, H.M.; Lu, J.H.; Chen, W.Y.; Gu, A.Q. Upregulated lncRNA-UCA1 contributes to progression of lung cancer and is closely related to clinical diagnosis as a predictive biomarker in plasma. Int. J. Clin. Exp. Med. 2015, 8, 11824. [Google Scholar]
  41. Prenner, S.; Levitsky, J. Comprehensive review on colorectal cancer and transplant. Am. J. Transplant. 2017, 17, 2761–2774. [Google Scholar] [CrossRef]
  42. Zhai, H.Y.; Sui, M.H.; Yu, X.; Qu, Z.; Hu, J.C.; Sun, H.Q.; Zheng, H.T.; Zhou, K.; Jiang, L.X. Overexpression of long non-coding RNA TUG1 promotes colon cancer progression. Med. Sci. Monit. 2016, 22, 3281. [Google Scholar] [CrossRef]
  43. Zhang, X.F.; Liu, T.; Li, Y.; Li, S. Overexpression of long non-coding RNA CCAT1 is a novel biomarker of poor prognosis in patients with breast cancer. Int. J. Clin. Exp. Pathol. 2015, 8, 9440. [Google Scholar]
Figure 1. The flowchart of dual-network integrated logistic matrix factorization-lncRNA–disease association (DNILMF-LDA). Step 1: converting the calculated lncRNAs’ similarity matrix and the diseases’ similarity matrix to the corresponding kernel matrix; Step 2: calculating the Gaussian interaction profile kernel matrix of lncRNAs and diseases, respectively; Step 3: fusing two kernel matrices corresponding to the lncRNAs and the diseases respectively into one kernel matrix; Step 4: constructing the DNILMF model with the lncRNA–disease associations matrix, lncRNAs, and diseases kernel matrices as the input data. In order to ensure the optimal performance of the algorithm, the Gaussian process mutual information (GP-MI) algorithm is used to select parameters. GIP, Gaussian interaction profile.
Figure 1. The flowchart of dual-network integrated logistic matrix factorization-lncRNA–disease association (DNILMF-LDA). Step 1: converting the calculated lncRNAs’ similarity matrix and the diseases’ similarity matrix to the corresponding kernel matrix; Step 2: calculating the Gaussian interaction profile kernel matrix of lncRNAs and diseases, respectively; Step 3: fusing two kernel matrices corresponding to the lncRNAs and the diseases respectively into one kernel matrix; Step 4: constructing the DNILMF model with the lncRNA–disease associations matrix, lncRNAs, and diseases kernel matrices as the input data. In order to ensure the optimal performance of the algorithm, the Gaussian process mutual information (GP-MI) algorithm is used to select parameters. GIP, Gaussian interaction profile.
Genes 10 00608 g001
Figure 2. The optimization process of GP-MI for the DNILMF model parameters. At irritation t: Step 1: get x t according to the previous query χ t 1 and observations Y t 1 ; Step 2: if x t = x t 1 or t is equal to the max value, exit the program; if not, put x t , the disease kernel matrix, lncRNA–disease association matrix, and lncRNA kernel matrix into the DNILMF model, and we can get output x t . Then, take x t and y t as the start of the next irritation.
Figure 2. The optimization process of GP-MI for the DNILMF model parameters. At irritation t: Step 1: get x t according to the previous query χ t 1 and observations Y t 1 ; Step 2: if x t = x t 1 or t is equal to the max value, exit the program; if not, put x t , the disease kernel matrix, lncRNA–disease association matrix, and lncRNA kernel matrix into the DNILMF model, and we can get output x t . Then, take x t and y t as the start of the next irritation.
Genes 10 00608 g002
Figure 3. ROC curve of the five models.
Figure 3. ROC curve of the five models.
Genes 10 00608 g003
Figure 4. PR curve of the five models.
Figure 4. PR curve of the five models.
Genes 10 00608 g004
Figure 5. Influence of r on AUC value when α = 0.1 , β = γ = 0.45 , λ u = 1 , λ v = 1 .
Figure 5. Influence of r on AUC value when α = 0.1 , β = γ = 0.45 , λ u = 1 , λ v = 1 .
Genes 10 00608 g005
Table 1. AUC and area under precision-recall (PR) curve (AUPR) values of the five models.
Table 1. AUC and area under precision-recall (PR) curve (AUPR) values of the five models.
MethodAUCAUPR
LRLSLDA0.53210.0344
SIMCLDA0.78950.1605
BiwalkLDA0.83670.1909
TPGLDA0.85270.1185
DNILMF-LDA0.92020.5610
Table 2. The top ten lncRNA candidates for lung cancer.
Table 2. The top ten lncRNA candidates for lung cancer.
ToplncRNAEvidenceDescription
1CCAT226729200lncRNADisease
2CDKN2B-AS126729200lncRNADisease
3PVT128731781lnc2Cancer
4UCA129731641lnc2Cancer
5CCAT127212446lncRNADisease
6SPRY4-IT126302345lncRNADisease
7GAS526634743lncRNADisease
8HULCunconfirmedunconfirmed
9SRA1unconfirmedunconfirmed
10XISTunconfirmedunconfirmed
Table 3. The top ten lncRNA candidates for colon cancer.
Table 3. The top ten lncRNA candidates for colon cancer.
ToplncRNAEvidenceDescription
1SPRY4-IT128099409lnc2Cancer
2HOTTIP26617875lnc2Cancer
3GHET127931286lnc2Cancer
4MINAunconfirmedunconfirmed
5HIF1A-AS229278853lnc2Cancer
6ADAMTS9-AS227596298lnc2Cancer
7TUG128302487lnc2Cancer
8LINC0015229180678lnc2Cancer
9PANDAR28176943lnc2Cancer
10BC040587unconfirmedunconfirmed
Table 4. The top ten lncRNA candidates for breast cancer.
Table 4. The top ten lncRNA candidates for breast cancer.
ToplncRNAEvidenceDescription
1MNX1-AS1unconfirmedunconfirmed
2CCAT126464701lnc2Cancer
3TUSC723558749lnc2Cancer
4BANCR29565494lnc2Cancer
5DNM3OSunconfirmedunconfirmed
6TUG127791993lncRNADisease
7RPL34-AS1unconfirmedlncRNADisease
8MINA25586347lnc2Cancer
9GHET129843220lnc2Cancer
10PTENP129085464lnc2Cancer

Share and Cite

MDPI and ACS Style

Li, Y.; Li, J.; Bian, N. DNILMF-LDA: Prediction of lncRNA-Disease Associations by Dual-Network Integrated Logistic Matrix Factorization and Bayesian Optimization. Genes 2019, 10, 608. https://doi.org/10.3390/genes10080608

AMA Style

Li Y, Li J, Bian N. DNILMF-LDA: Prediction of lncRNA-Disease Associations by Dual-Network Integrated Logistic Matrix Factorization and Bayesian Optimization. Genes. 2019; 10(8):608. https://doi.org/10.3390/genes10080608

Chicago/Turabian Style

Li, Yan, Junyi Li, and Naizheng Bian. 2019. "DNILMF-LDA: Prediction of lncRNA-Disease Associations by Dual-Network Integrated Logistic Matrix Factorization and Bayesian Optimization" Genes 10, no. 8: 608. https://doi.org/10.3390/genes10080608

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop