Computational and Mathematical Methods in Medicine

Volume 2018, Article ID 6747453, 12 pages

https://doi.org/10.1155/2018/6747453

## A Novel Approach for Predicting Disease-lncRNA Associations Based on the Distance Correlation Set and Information of the miRNAs

^{1}College of Information Engineering, Xiangtan University, Xiangtan 411105, China^{2}Key Laboratory of Intelligent Computing & Information Processing, Xiangtan University, Xiangtan 411105, China

Correspondence should be addressed to Lei Wang; nc.ude.utx@ielgnaw

Received 6 December 2017; Revised 4 April 2018; Accepted 17 April 2018; Published 26 June 2018

Academic Editor: Michele Migliore

Copyright © 2018 Haochen Zhao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Recently, accumulating laboratorial studies have indicated that plenty of long noncoding RNAs (lncRNAs) play important roles in various biological processes and are associated with many complex human diseases. Therefore, developing powerful computational models to predict correlation between lncRNAs and diseases based on heterogeneous biological datasets will be important. However, there are few approaches to calculating and analyzing lncRNA-disease associations on the basis of information about miRNAs. In this article, a new computational method based on distance correlation set is developed to predict lncRNA-disease associations (DCSLDA). Comparing with existing state-of-the-art methods, we found that the major novelty of DCSLDA lies in the introduction of lncRNA-miRNA-disease network and distance correlation set; thus DCSLDA can be applied to predict potential lncRNA-disease associations without requiring any known disease-lncRNA associations. Simulation results show that DCSLDA can significantly improve previous existing models with reliable AUC of 0.8517 in the leave-one-out cross-validation. Furthermore, while implementing DCSLDA to prioritize candidate lncRNAs for three important cancers, in the first 0.5% of forecast results, 17 predicted associations are verified by other independent studies and biological experimental studies. Hence, it is anticipated that DCSLDA could be a great addition to the biomedical research field.

#### 1. Introduction

For long time, RNA was just considered to be transcriptional noise and intermediary between a DNA sequence and its encoded protein [1, 2]. However, sequence analyses point out that more than 98% of the human genome does not encode protein sequences [3]. Furthermore, increasing studies based on biological experiments have indicated that ncRNAs play important roles in numerous critical biological processes such as chromosome dosage compensation, epigenetic regulation, and cell growth [4]. In particular, the lncRNAs, as a class of important ncRNAs with a length more than 200 nucleotides [5], have been found to be associated with a wide range of human diseases, such as breast cancer [6], colorectal cancer [7], lung cancer [8], and cardiovascular diseases [9]. Hence, the study of finding novel disease-lncRNA associations has captured the attention of a lot of researchers and has been considered as one of the hottest topics in the research fields of diseases and lncRNAs. The identification of disease-lncRNA association can not only accelerate the understanding of human complex disease mechanism at the lncRNA level, but also serve as a biomarker identification for human disease diagnosis, treatment, and prevention [10]. So far, a lot of studies have generated a large amount of lncRNAs related biological data about sequence, expression, function, and so on [11–13]. However, compared with the rapidly increasing number of newly discovered lncRNAs, only few known lncRNA-disease associations have been reported. Hence, it is challenging and urgently needed to develop efficient and successful computational approaches to predict potential lncRNA-disease associations. In recent years, some computational methods have been proposed to predict novel lncRNA-disease associations, which can significantly decrease the time and cost of biological experiments by calculating the association probability of lncRNA-disease pairs. For example, Chen G et al. presented the first prediction method (genomic locus based) and constructed a lncRNA-disease association database as well [14]. Liang et al. proposed a genetic mediator and key regulator model to unveil the subtle relationships between lncRNAs and lung cancer. Liu et al. developed a computational framework to accomplish this by combining human lncRNA expression profiles, gene expression profiles, and human disease-associated gene data. Applying this framework to available human long intergenic noncoding RNAs (lincRNAs) expression data, Chen et al. developed a semi-supervised learning method based on framework of Laplacian Regularized Least Squares, LRLSLDA, to infer potential lncRNA-disease associations which did not need negative samples and could obtain a reliable AUC of 0.7760 in the leave-one-out cross-validations [15]. In 2014, Sun et al. constructed a lncRNA functional similarity network and applied random walk with restart (RWR) to infer potential lncRNA-disease associations [16]. In the same year, Li et al. presented a bioinformatics method based on genomic location to predict the lncRNAs associated with vascular disease [17]. Then, Zhao et al. developed a computational method based on the naïve Bayesian classifier to identify cancer-related lncRNAs by integrating genome, regulome, and transcriptome data [18]. In 2015 Zhou et al. proposed a novel rank-based method named RWRHLDA to prioritize candidate lncRNA-disease associations by integrating miRNA-associated lncRNA-lncRNA crosstalk network, disease-disease similarity network, and known lncRNA-disease association network into a heterogeneous network and implemented a random walk with restart on the newly generated heterogeneous network [19].

Nowadays, with advent of many biological datasets, such as LncRNADisease [14], lncRNAdb [20], and NONCODE [13], the number of lncRNA-disease associations is still very limited. In 2015, Chen developed a method, named HGLDA, based on the information of miRNA [21], which predicted lncRNA-disease associations by integrating disease-miRNA associations with lncRNA-miRNA interactions and did not rely on known lncRNA-disease associations. Different from the method of HGLDA proposed by Chen et al., in this article, on the basis of experimentally reported lncRNA-disease associations collected from the HMDD database [22] and miRNA-lncRNA associations collected from the starBase database [23], a novel model based on distance correlation set is developed to predict potential lncRNA-disease associations by integrating known lncRNA-miRNA associations and known miRNA-disease associations. Compared with HGLDA, the advantage of DCSLDA lies in the introduction of the similarity of disease pairs and lncRNA pairs and distance correlation set. In addition, to optimize the prediction performance of DCSLDA, new methods to calculate the similarity of disease-disease pairs and lncRNA-lncRNA pairs are developed simultaneously. Finally, to evaluate the prediction performance of DCSLDA, LOOCV is implemented on the basis of the known lncRNA-disease associations and known lncRNA-cancer associations separately, and simulation results demonstrate that DCSLDA is superior to the state-of-the-art methods and can achieve a reliable AUC of 0.8517 in the LOOCV when the pregiven threshold parameter is set at 6. Additionally, to further evaluate the prediction performance of DCSLDA, case studies of breast cancer, colorectal cancer, and lung cancer are implemented for DCSLDA; as a result, among the first 0.5% of predictive results, 9, 6, and 2 predicted potential associations are confirmed by recent experimental reports, respectively. Hence, considering the excellent prediction performance of DCSLDA, it is obvious that DSCLDA can become a useful and efficient computational tool for biomedical researches.

#### 2. Materials and Methods

##### 2.1. Disease-miRNA Associations

We downloaded known disease-miRNA associations from the Human MicroRNA Disease Database (HMDD) in July 2017 (see Supplementary file 1), which included 10381 experimentally verified disease-miRNA associations (including 572 miRNAs and 383 diseases). After merging miRNAs which produce the same mature miRNA and eliminating duplicate data, we obtained* dataset1* including 5430 disease-miRNA associations (including 383 human diseases and 495 lncRNAs). Let be the number of different diseases and* M1* be the number of different miRNAs collected from the* dataset1*, respectively, represent the set of these different diseases, and represent the set of these* M1* different miRNAs; then for any given and , we can define the* Association Strong Correlation *(*ASC1*) between and as follows:

##### 2.2. miRNA-lncRNA Associations

We downloaded known miRNA-lncRNA associations dataset from starBase v2.0 dataset in July 2017, which provided the most comprehensive experimentally confirmed lncRNA-miRNA interactions based on large scale CLIP-seq data. After data preprocessing (including elimination of duplicate values, erroneous data, disorganized data, and so on),* dataset2* (including 10195 lncRNA-miRNA associations, 275 miRNAs, and 1127 lncRNAs) was obtained from the starBase v2.0 (see Supplementary file 2). Let* M2* be the number of different miRNAs and be the number of different lncRNAs collected from the* dataset2*, represent the set of these* M2* different miRNAs, and represent the set of these different lncRNAs; then, for any given and , we can define the* ASC2* between* m2*_{i} and as follows:

##### 2.3. lncRNA-Disease Associations

In order to evaluate the performance of DCSLDA, the newly lncRNA-disease associations were downloaded from LncRNADisease database, which integrated more than 1000 lncRNA-disease entries and 475 lncRNA interaction entries, including 321 lncRNAs and 221 diseases from ~500 publications. In this dataset, after duplicate associations and the lncRNA-disease associations involved in either diseases or lncRNAs which were not contained in the* dataset1* or* dataset2* were removed, 203 high-quality lncRNA-disease associations were obtained finally (see Supplementary file 3).

##### 2.4. Disease Functional Similarity Based on miRNAs

For calculating the functional similarity between diseases, we introduced the concept of social network. In the social network, for any two nodes, we can calculate the similarities between them by comparing and integrating the similarities of nodes associated with these two nodes. In this section, based on the assumption that similar diseases tend to show a similar interaction and noninteraction pattern with the miRNAs, we calculated the disease similarity in the disease-miRNA interactive network. As illustrated in Figure 1, the calculation procedures of disease functional similarity based on miRNAs include 3 steps. First, we constructed miRNA-disease interactive network from known miRNA-disease associations (*dataset1*), whose topology can be abstracted as an undirected graph , where is the set of vertices, is the set of edges, and, for any two nodes , , there is an edge between and in , if and only if there are , , and . However, since different miRNA terms in the* dataset1* may relate to different numbers of diseases, it is not suitable to assign the same contribution value to different miRNAs. Hence, we define the contribution value of each miRNA as follows:Finally, we defined the functional similarity between diseases and by integrating the miRNAs related to , , or both of them as follows:where* FSD* is the disease functional similarity matrix calculated based on miRNA and and are the number of related edges and related edges in E_{1}, respectively. As an example, in Figure 1, there is* FSD* .