Abstract

Long noncoding RNAs (lncRNAs) have an important role in various life processes of the body, especially cancer. The analysis of disease prognosis is ignored in current prediction on lncRNA–disease associations. In this study, a multiple linear regression model was constructed for lncRNA–disease association prediction based on clinical prognosis data (MlrLDAcp), which integrated the cancer data of clinical prognosis and the expression quantity of lncRNA transcript. MlrLDAcp could realize not only cancer survival prediction but also lncRNA–disease association prediction. Ultimately, 60 lncRNAs most closely related to prostate cancer survival were selected from 481 alternative lncRNAs. Then, the multiple linear regression relationship between the prognosis survival of 176 patients with prostate cancer and 60 lncRNAs was also given. Compared with previous studies, MlrLDAcp had a predominant survival predictive ability and could effectively predict lncRNA–disease associations. MlrLDAcp had an area under the curve (AUC) value of 0.875 for survival prediction and an AUC value of 0.872 for lncRNA–disease association prediction. It could be an effective biological method for biomedical research.

1. Introduction

Long noncoding RNAs (lncRNAs) are noncoding RNA molecules, including miRNAs [1], lncRNAs [2], tRNAs [3], piRNAs [4], and more than 200 nucleotides. They were initially thought to be nonfunctional RNA fragments and the only by-product of massive transcription [58]. A large number of recent studies have shown that lncRNAs have abundant biological functions, including the silencing of X chromosome [9] and activation and interference of transcription [1012]. At the same time, the abnormal expression of lncRNAs leads to various diseases [1315]. Therefore the investigation on lncRNA–disease associations is of great significance at the molecular level to radically cure the disease.

Many computational methods have been applied to human lncRNA–disease association prediction in recent years. These methods have two prominent features: machine-learning-based feature and network-based feature.

The machine-learning-based feature of lncRNA–disease association prediction is to establish a learning model in the training dataset and then to perform tests in the test dataset using this learning model. For instance, Zhao et al. [16] developed a learning model based on the Bayesian classifier for lncRNA–disease association prediction. The key issue was that the learning model regarded unknown lncRNA–disease associations as negative sets, restricting the performance of the learning model. In fact, the negative sets for lncRNA–disease association prediction were difficult to achieve. To avoid this problem, Chen et al. [17] put forward the method of LRLSLDA to predict lncRNA–disease associations. It was based only on positive sets, not on negative sets. It adopted the strategy of Laplacian regularized least squares, was a semisupervised learning model, and needed selected optimal parameters to obtain optimal prediction results. LRLSLDA had two limitations: (a) they were under the assumption that functionally similar lncRNAs were related to similar diseases and (b) they were restricted to selecting optimal parameters. Moreover, Chen et al. [18] developed another method, KATZLDA, which incorporated known lncRNA–disease associations, lncRNA expression profiles, lncRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity. KATZLDA could adapt to new diseases and lncRNAs without any known associations. However, KATZLDA still was built on lncRNAs with more known associated diseases or/and miRNA interaction partners.

The network-based feature of lncRNAs–disease association prediction is to establish a learning network of lncRNA–disease associations using known associations. For instance, Yang et al. [19] constructed the bipartite network about lncRNA–disease associations and predicted these associations by the method of transmission in the network, which was the first prediction method based on the network model. A coding–noncoding gene–disease bipartite network was constructed to improve the prediction results in which better prediction results were obtained. However, the approach did not take into account the interaction between lncRNAs and coding genes, and the forecast range was limited. The results were relatively general. Sun et al. [20] proposed a new method based on the network model: RWRlncD. It was used to construct the functional similarity network of lncRNAs using the known lncRNA–disease association network and the similarity network. Subsequently, the reactivated random walk was performed in the functional similarity network of lncRNAs to predict the potential lncRNA–disease associations. However, the edge of the test set was used to calculate the functional similarity of lncRNAs before cross validation, which overestimated the verification results. Zhou et al. [21] presented a novel method (named RWRHLD) to distinguish candidate lncRNA–disease associations using the hybrid network and then performed the random walk algorithm on this hybrid network. RWRHLD was used only to predict lncRNAs in known lncRNA–miRNA associations, where an incomplete coverage of lncRNAs cross talk network and lncRNA–disease association network might lead to inaccurate prediction results. Chen et al. [22] improved the traditional random walk with restart and proposed the method of improved random walk with restart for lncRNA–disease association prediction (IRWRLDA). But the existing problem of IRWRLDA was how to obtain integrated lncRNA similarity based on lncRNA functional similarity and lncRNA Gaussian interaction profile kernel similarity. Chen et al. [23] developed two novel LNCSIMs and proposed a new method LRLSLDA–LNCSIM that could improve the predictive ability of LRLSLDA. LRLSLDA–LNCSIM still had the limit that a semantic contribution decay factor was not well solved. Yu et al. [24] employed multidimensional heterogeneous data to construct an lncRNA function similarity network, employed the disease ontology to construct a disease network, and then proposed the BRWLDA to predict lncRNA–disease associations. Although the prediction performance was improved by BRWLDA, the defect of random walk algorithm still existed. Chen et al. [25] developed a method of HGLDA by integrating miRNA–disease associations and lncRNA–miRNA interactions. However, HGLDA could not be used in the lncRNAs without any known miRNA interaction partners. Ganegoda et al. [26] developed the computational model of KRWRH network, which was a heterogeneous network formed by integrating a disease–disease similarity network, lincRNA–lincRNA similarity network, and known lincRNA–disease association network.

The reviews of Chen et al. [27] and the aforementioned discussions showed that few references were made to the combination of clinical prognosis data with lncRNA–disease association in the existing studies on lncRNA–disease association prediction. In the present study, the analysis of disease prognosis was ignored, and the existing prediction model was limited to a single lncRNA prediction. The prognosis information of a disease associated with lncRNAs was rarely involved (such as the survival time of patients, current state of disease, and family history of genetic diseases). In fact, an analysis related to the prognosis of lncRNA–disease association has more realistic meaning and value.

To overcome the aforementioned issues, a multiple linear regression model for lncRNA–disease association prediction based on clinical prognosis data (MlrLDAcp) was constructed to predict the potential associations between lncRNAs and diseases. At the same time, the survival time of patients with prostate cancer was also predicted in MlrLDAcp. The concepts of predictive correlation factor , decay coefficient , operation, and correction were proposed in this study to construct the multiple linear regression model. An algorithm for developing the multiple linear regression model was designed, in which 481 lncRNA transcripts with values less than 0.001 were cut back to 60 most closely related to the survival time of patients with prostate cancer. Finally, the potential multiple linear regression relationship between the prognosis survival time of 176 patients with prostate cancer and 60 lncRNAs was proposed. MlrLDAcp could realize not only cancer survival prediction but also lncRNA–disease association prediction. Compared with previous findings, MlrLDAcp had a predominant survival predictive ability and could effectively predict lncRNA–disease associations.

2. Materials and Methods

2.1. LncRNA Expression Data

The lncRNA expression data of prostate cancer was obtained from the lncRNAtor database (http://lncrnator.ewha.ac.kr). A total of 44 normal samples and 176 prostate cancer samples were obtained, and the prostate cancer samples were denoted in an ascending order according to sample ID (denoted by ). The expression level of each lncRNA transcript in normal and prostate cancer samples was calculated. The differential expressions between normal and prostate cancer samples were calculated using the aforementioned expression quantities. A total of 481 lncRNA transcripts with a significant difference were obtained (denoted by using a P value in the ascending order) by selecting a P value less than 0.001, and the 481 transcript expression quantities on were denoted by . The lncRNA expression training data matrix (denoted by ) was constructed based on the aforementioned data.

2.2. Clinical Prognosis Data of Patients with Prostate Cancer

The clinical prognosis data of 176 prostate cancer samples in Section 2.1 were obtained from the TCGA database (https://cancergenome.nih.gov). Each prostate cancer sample contained the clinical prognostic data of 60 samples. The data were filtered to keep the patient ID (submitter_id), survival state (vital_status, the survival state of denoted by ), time of death of patients (days_to_death, the death time of denoted by ), and final contact time (days_to_last_follow_up, final contact time of denoted by ). Hence, the survival time training matrix was obtained. If the patient was in a state of death, he had only the time of death but no final contact time, and the final contact time was recorded as 0. On the contrary, if the patient was alive, he had only the final contact time but no death time, and the death time was recorded as 0. Therefore, the survival distribution coefficient matrix (if then ; else ) was constructed. Finally, the survival analysis matrix was obtained.

2.3. Abstracting the Issue

This study on lncRNA–disease associations was conducted from the following two aspects:

(a) A part of lncRNAs in the 481 lncRNAs, which were most closely related to prostate cancer, was screened out through the analysis of prognosis survival. Hence, a subset of (denoted by , ) was obtained, and set ( was the number of elements in , ). contained lncRNAs (denoted by ). The expression quantity of on was denoted by . Therefore, .

(b) The potential relationship between and was given using multiple statistical methods, and finally the prognosis prediction of lncRNA–disease associations was realized using predicting .

Definition 1 (predictive correlation factor ). was defined as , where corresponded to in . The value of predictive correlation factor was the coefficient of multiple linear regression . is shown in (1). For the prognosis prediction of lncRNA–disease associations, the formal definition was as follows. Two tasks needed to be completed while establishing : to calculate and to calculate .

2.4. Operation

Definition 2 (line class vector ). was a one-dimensional vector with 481 rows and 1 column (). corresponded to , and that was equal to 1 or 0. When was equal to 1, corresponding to was selected to . Otherwise, corresponding to was not selected to . was the number of components (i.e., 481). was the number of components, the value of which was 1.

Definition 3 (decay coefficient ). denoted the decay duration in operation. , which was the value of the iteration in operation, is shown in (2).

In (2), is the maximal iterations of operation and is the current iterations of operation. When takes the minimal value 1, is close to 1. When takes the maximal value , is 0.1. When takes the value between 1 and , is the reduction value between 1 and 0.1. The greater is, the greater the decay efficiency is.

Definition 4 ( operation). operation was divided into three stages. (a) The variation center of the iteration on was calculated according to (3). (b) The variation range of the iteration on was calculated according to (4). (c) Bitwise inversion was performed within the variation range (). Figure 1 shows the schematic of operation.

In (3) and (4), is the maximal iterations of operation and is the current iterations of operation.

Definition 5 ( correction). The components (their value was 1) in remained valid only within the top period in operation, and the rest were set to 0.

2.5. Stepwise

was performed on . was the set of . The insignificant component in went from 1 to 0 by . After executing Stepwise, was changed to . was the process of further subtracting and retaining the most important components. The execution of Stepwise went through the following six steps. (A temporary container with an initial value of empty and two marker variables and were defined. The initial values of both marker variables were 0.)

Step 1. A component of significant effect on was added to .

Step 2. Whether a new component was added to was judged; if true, then both and were set to 0 and Step 1 was performed; otherwise, was set to 1 and Step 3 was performed.

Step 3. Whether and was judged; if true, then Step 6 was performed; otherwise, Step 4 was performed.

Step 4. A component of insignificant effect on was removed from .

Step 5. Whether a new component was removed from was judged; if true, then both and were set to 0 and Step 4 was performed; otherwise, was set to 1 and Step 1 was performed.

Step 6. The components in , which was not in , were set to 0. The updated was set to , then and were the output.

2.6. Algorithm of Computing

The algorithm flow of computing was as follows. (Figure 2 shows the algorithm flowchart of computing .)

Step 1. Initialization was set. , , , , and . was the current best bulletin board. , the value of which was less than 10−12 in , was set to 1, and the rest was set to 0.

Step 2. operation was executed. If top components of were the same as top components of and , bitwise inversion was performed on the previous components of . After that, correction was performed.

Step 3. was executed. If was better than , then , and was updated.

Step 4. Whether the maximal iterations were reached was judged; if true, then Step 5 was performed; otherwise, Step 2 was performed.

Step 5. was set to .

Step 6. The multivariate linear regression analysis on was performed using . was set to the multiple linear regression coefficients.

Step 7. was the output.

3. Results and Discussion

3.1.

The calculation results of variation center in operation are shown in Figure 3. The figure shows that the variation center value of operation was evenly distributed in 10 iterations. Furthermore, the variation center value was scattered around each interval from 1 to 481. The variation center value had 5 points on each side of the center point (240). This ensured that the components of all 481 LncRNAs had an equal opportunity to perform variations. It was more beneficial to obtain the global optimal solution.

3.2.

Figure 4 shows the calculation results of decay coefficient . The variation operation proposed in this study aimed to enrich the diversity of sample space. Meanwhile, the variation operation should be dynamic rather than fixed. For the aforementioned issue, the decay coefficient was proposed to control the strength of the variation operation. Figure 4 shows that the decay coefficient decreased with the increase in the number of iterations. This was because the variation operation should be strong at the incipient iteration to obtain the global optimization ability. On the contrary, the variation operation should be weak at the late iteration to obtain the local development ability. It was not hard to see that the decay coefficient was the value of decreasing change between 1 and 0.1, and it controlled the lncRNA components that performed the bitwise inversion in 481 lncRNA components.

3.3. Computing

Table 1 shows the detailed calculation process of , including the variation position, interval, and result in the iteration. As shown in Table 1, the variation position and the interval distribution were relatively discrete, reducing the blind area of operation. In addition, the two factors proposed in the calculation to ensure optimal performance were significant differences in (denoted by ) and of (denoted by ), both of which were characterized by the better performance with a smaller value. Two factors should not be considered unilaterally but comprehensively. For example, if only was taken into account, of , which was the set of less than 10−12, was equal to 2249.24, as shown in Figure 5. Obviously, it was not an optimal solution but a poorer solution. Based on the aforementioned considerations, operation, which was proposed in this study, was a combination of and . In each variation process, the smaller part of performed the variation, rather than the whole. Finally, (=2208.47) was the optimal solution. Therefore, was equal to (i.e., the matrix built using the expression quantity of on , ). Then, a multivariate linear regression analysis was performed on , and the regression coefficients were obtained as required (Table 2). The ensemble transcript ID of X14 was ENST00000559477 in the Table 2. The ensemble transcript ID of X16 was ENST00000560882 in the Table 2. The test was performed on the regression model (the results are shown in Table 3). As was less than 0.0001, the regression model had the statistical significance. It also indicated that MlrLDAcp was feasible and effective.

The prediction model (MlrLDAcp) proposed in this study had two potential aspects:

(a) The survival of cancer patients was predicted by combining with the multiple linear regression model of MlrLDAcp.

(b) The association between lncRNAs and diseases was predicted using MlrLDAcp.

The performance of evaluation was expanded from the two aforementioned aspects.

3.4. Survival Predictive Ability

Receiver operating characteristic (ROC) analyses were performed to compare the predictive accuracies of prostate cancer samples between MlrLDAcp and Huang’s method [28] (the state-of-the-art method), to evaluate the survival predictive ability. The 5-year biochemical recurrence survivals of the two methods were compared between TCGA and lncRNAtor databases. Figure 6 shows the experimental results. The value of the area under the curve (AUC) was calculated from the corresponding area under the ROC curve. As shown in Figure 6, MlrLDAcp with an AUC value of 0.875 was better than Huang with an AUC value of 0.833. As a result, the prediction accuracy of 5-year biochemical recurrence survival in MlrLDAcp was improved by 4.2% (versus Huang). These results suggested that MlrLDAcp might have a predominant survival predictive ability.

3.5. Predictive Ability of lncRNA–Disease Associations

The leave-one-out cross validation (LOOCV) was implemented on the gold standard dataset to compare MlrLDAcp and two state-of-the-art methods: LRLSLDA [17] and KRWRH [26], to evaluate the predictive ability of lncRNA–disease associations. The datasets were divided into training sets () and test sets (). The known lncRNA–disease associations in were defined as (, and was the number of known lncRNA–disease associations). In each step of the LOOCV, each was implemented on and , and then the model learning was carried out on . The ROC curve plotted the sensitivity (that was true-positive rate ) versus the 1-specificity (that was false-positive rate ), where TP denoted true positives, FP denoted false positives, TN denoted true negatives, and FN denoted false negatives. The sensitivity was the ratio of positive samples which could be accurately distinguished, and the specificity represented the percentage of negative samples which could be correctly predicted. Figure 7 shows the experimental results. The value of AUC was calculated from the corresponding area under the ROC curve. As shown in Figure 7, MlrLDAcp with an AUC value of 0.872 was better than KRWRH with an AUC value of 0.838 and LRLSLDA with an AUC value of 0.822. As a result, the prediction accuracy of lncRNA–disease associations in MlrLDAcp increased by 3.4% (versus KRWRH) and 5.0% (versus LRLSLDA). These results suggested that MlrLDAcp might have a preferable ability to predict lncRNA–disease associations.

4. Conclusions

In this study, a model of MlrLDAcp was constructed. MlrLDAcp took the expression quantity of lncRNAs transcript as an independent variable and the clinical prognosis data as a dependent variable. Using MlrLDAcp, 60 lncRNAs, which were most closely related to cancer prognosis information (survival time), were selected from 481 alternative lncRNAs. MlrLDAcp could realize not only the cancer survival prediction but also the lncRNA–disease association prediction.

Further research directions about lncRNA–disease association prediction are as follows.

(a) The lncRNA–disease association prediction should take into account clinical prognostic data in future investigations. The lncRNAs associated with diseases may have a clinical value as therapeutic targets. Hence, the clinical prognostic data is quite valuable to lncRNA–disease association prediction. The clinical implications and the mechanism underlying the association of lncRNAs with diseases are definitely worth exploring further.

(b) How to build an effective computational model to construct an lncRNA similarity function, which can reasonably integrate the similarity scores of different biological information, is worthy of further research.

(c) With the increase in lncRNA–disease correlation, the prediction accuracy can be further improved. Furthermore, most computing models rely heavily on unobtainable negative samples, which is an urgent problem to be solved.

(d) The new network-based computing model should be implemented on heterogeneous networks instead of single networks. Hence, more heterogeneous networks, such as lncRNA–disease network, disease similarity network, lncRNA functional similarity network, and lncRNA interactive networks, should be integrated in the future.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by NSFC (2017-2020, No. 51679058), the funds for the 2013–2016 China Higher Specialized Research Fund (PhD supervisor category) (No. 20132304110018), and the Fundamental Research Funds in Heilongjiang Provincial Universities (No. 135109249, No. 135109241).