Abstract

Aberrant expression of microRNAs (miRNAs) can be applied for the diagnosis, prognosis, and treatment of human diseases. Identifying the relationship between miRNA and human disease is important to further investigate the pathogenesis of human diseases. However, experimental identification of the associations between diseases and miRNAs is time-consuming and expensive. Computational methods are efficient approaches to determine the potential associations between diseases and miRNAs. This paper presents a new computational method based on the SimRank and density-based clustering recommender model for miRNA-disease associations prediction (SRMDAP). The AUC of 0.8838 based on leave-one-out cross-validation and case studies suggested the excellent performance of the SRMDAP in predicting miRNA-disease associations. SRMDAP could also predict diseases without any related miRNAs and miRNAs without any related diseases.

1. Introduction

MicroRNAs (miRNAs) are small endogenous noncoding RNAs which are approximately 22nt long. Since the discovery of the first two miRNAs lin-4 and let-7, thousands of miRNAs have been identified in eukaryotic cells [1, 2]. A series of studies have shown that miRNAs play an important role in many biological processes, such as cell growth and apoptosis, proliferation, differentiation, and signal transduction [36]. Given that miRNAs are involved in the normal function of cells, aberrant miRNA expression has been associated with many types of human diseases, ranging from common diseases to cancers [79]. Therefore, the identification of disease-related miRNAs is beneficial in understanding the molecular mechanism of the disease pathogenesis and disease diagnosis and to further promote the level of treatment and prevention.

To date, many biological experimentations have been performed to determine a large number of miRNA-disease associations. Many studies have built databases, such as HMDD [10], miR2Disease [11], dbDEMC [12], miRCancer [13], and PhenomiR [14], to serve as a solid data foundation for predicting miRNA-disease associations. HMDD is a database manually retrieved from the literature [10]. The latest version is HMDD v2.0, which integrates 10,368 miRNA-disease associations of approximately 572 miRNA genes and 378 diseases from 3,511 papers. MiR2Disease documents 1,939 manually curated miRNA-disease associations between 299 human miRNAs and 94 human diseases [11]. The dbDEMC stores differentially expressed miRNAs in human cancers obtained from microarray data [12]. The updated version dbDEMC 2.0 contains 2,224 differentially expressed miRNAs in 36 cancer types [15]. The miRCancer stores miRNA-cancer associations obtained by text mining method [13]. PhenomiR provides information about differentially regulated miRNA expression in diseases and other biological processes [14].

However, using experimental methods to identify the disease-related miRNAs is time-consuming and costly. Based on existing data, computational methods have been developed as a valuable supplement to the experimental methods to save experimental time and cost. Computational methods can calculate and rank the similarity scores of all miRNAs for a given disease. Top-ranked miRNAs are treated as the most promising candidate disease miRNAs for further experimental studies. Similarity calculation is the key issue in computational methods [16]. According to the calculation of similarity score, most computational methods are divided into two categories [17, 18], namely, network-based methods [1928] and machine-learning-based methods [24, 2934]. Network-based methods predict miRNA-disease associations by considering the hypothesis that miRNAs with similar functions usually tend to be associated with phenotypically similar diseases [10]. Jiang et al. [19] constructed a human phenome-miRNAome functional association miRNA network using the hypergeometric distribution scoring system to select the candidate disease miRNAs. However, high final prediction accuracy may not be obtained if only the local information of each miRNA is issued and the study is strongly dependent on the predicted miRNA-target interactions. Chen et al. [21] adopted global network similarity measures and developed RWRMDA to infer the associations between diseases and miRNAs by implementing random walk on the miRNA-miRNA function similarity network. Based on the weighted k most similar neighbors, Xuan et al. [22] proposed HDMP to infer disease-related miRNAs. HDMP evaluates miRNA function similarity by incorporating the information content of disease terms, disease phenotype similarity, and weight information of the miRNA family or cluster. However, RWRMDA and HDMP cannot be useful for predicting disease without any related miRNAs. Based on social network analysis, Zou et al. [24] proposed KATZ method to compute the similarity score based on walks of different lengths between the miRNA and disease nodes. However, KATZ has relatively poor capability of sparing known associations. Gu et al. [25] calculated miRNA similarity and disease similarity of known miRNA-disease associations through the Jaccard similarity measure. They incorporated miRNA similarity of known miRNA-disease associations, miRNA functional similarity, and miRNA family information to construct miRNA similarity network and incorporated disease similarity of known miRNA-disease associations to construct disease similarity network. Then, they applied network consistency projection method to predict the disease-related miRNAs.

Machine-learning-based methods extract features from data to initially obtain effective features of miRNAs and diseases and then utilize machine learning models to predict miRNA-disease associations. Jiang et al. [29] showed a support vector machine (SVM) classifier method by integrating the feature vectors of miRNA-target and phenotype similarity. Xu et al. [31] introduced an approach based on the miRNA-target-dysregulated network to prioritize novel disease miRNAs. This method also constructs a support vector machine classifier based on the features and changes in miRNA expression. However, these two computational methods are mainly limited by the difficulty or impossibility of obtaining negative training samples, and this drawback would largely influence the predictive accuracy. To solve this problem, Chen and Yan [30] developed a semisupervised method of regularized least squares for miRNA-disease association (RLSMDA). RLSMDA integrates known disease-miRNA associations, disease similarity dataset, and miRNA functional similarity network to infer potential disease-related miRNAs. The main drawback of RLSMDA is the intricate adjustment of parameters. Xiao et al. [35] used graph-regularized nonnegative matrix factorization framework to predict potential miRNA-disease associations using weighted nearest neighbor profiles to incorporate miRNA similarity and disease matrices. Chen et al. [34] presented a computational method DRMDA based on stacked autoencoder, greedy layer-wise unsupervised pretraining algorithm and SVM, and this method was implemented to predict potential miRNA-disease associations. However, DRMDA results are not highly accurate, because of the difficulty in obtaining negative samples and optimizing the complex parameters.

Similarity calculation mainly considers miRNA-miRNA similarity measurement. Several computational methods use the known miRNA-disease associations in calculating miRNA-miRNA similarity [1926, 29, 30]. In these methods, miRNA-miRNA similarity measurement is completed by disease-disease measurement and known experimental miRNA-disease associations. However, these methods are restricted by the possible overestimation of the predictive accuracy. This drawback may be due to the fact that cross-validation experiments are not correctly performed, and the miRNA-miRNA similarity depends heavily on the known miRNA-disease associations. These methods fail to remove known information of the tested element for similarity calculation at each round of cross-validation. Other limitations include the inability to predict isolated miRNA and lack of disease semantic similarity [36]. An isolated miRNA signifies that a miRNA has no associated disease; that is, no relationship exists between this isolated miRNA and diseases. Thus, miRNA-disease associations cannot be used to calculate miRNA similarity of an isolated miRNA. Instead of using experimentally verified miRNA-disease associations, other computational methods calculate miRNA similarity using the interaction of miRNAs with other biomolecules [31, 3638]. For example, Liu et al. [36] calculated miRNA similarity using the miRNA-target gene and miRNA-long noncoding RNA associations. However, the performances of these methods are deficient.

Based on the assumption that miRNAs with similar functions are normally associated with phenotypically similar diseases and vice versa, we solved the aforementioned limitations by establishing a novel computational method based on SimRank [39] and density-based clustering [40] recommender model for miRNA-disease association prediction (SRMDAP). The SRMDAP constructs miRNA similarity subnetwork using SimRank to calculate network topological similarity between miRNAs based on miRNA-message RNA (mRNA) interaction network. The disease similarity subnetwork is similar to miRNA similarity subnetwork and is based on the disease-gene network. Then, the SRMDAP uses the density-based clustering recommender model to integrate miRNA similarity subnetwork, disease similarity subnetwork, and experimentally verified miRNA-disease associations to predict potential associations between miRNAs and diseases. In this work, leave-one-out cross-validation experiment and case studies about two important cancers, namely, kidney and colorectal neoplasms, have indicated the excellent predictive performance of SRMDAP. The SRMDAP can also predict isolated diseases and isolated miRNAs.

2. Methods

2.1. Data

Three datasets were used in our approach. Experimentally verified miRNA-mRNA interactions were downloaded from the miRTarBase database to construct the miRNA similarity network [41] (http://mirtarbase.mbc.nctu.edu.tw/, Release 6.0: Sept-15-2015). Meanwhile, experimentally verified disease-related mRNAs were downloaded from the DisGeNET database [42] (http://www.disgenet.org/web/DisGeNET/menu/home, DisGeNET 4.0: October 2016) to construct a disease similarity network. Experimentally verified miRNA-disease network was downloaded from the HMDD v2.0 database [43] (http://www.cuilab.cn/hmdd, Jun-14-2014 Version).

2.2. Data Processing
2.2.1. MiRNA-Disease Association Network

The disease names of the DisGeNET and HMDD databases were mapped to the MeSH description (https://www.ncbi.nlm.nih.gov/mesh). Diseases in the HMDD database not found in the DisGeNET database and repeated associations were removed. Then, we obtained 5,048 known miRNA-disease associations, including 475 miRNAs and 334 diseases, as the benchmark dataset. Formally, we denoted the miRNA set as and the disease set as . The variables and denote the number of miRNAs and diseases, respectively. Matrix represents the adjacency matrix of miRNA-disease associations. denotes miRNA associated with disease ; otherwise, .

2.2.2. MiRNA Similarity Network

SimRank [39] was employed to calculate the disease and miRNA similarities based on miRNA-mRNA interaction network and disease-related mRNA associations. SimRank is a model to measure the degree of similarity between any two objects on the basis of the information of the topology graph, which has been successfully applied to web page ranking [44], recommender systems [45], outlier detection [46], network graph clustering [47], and approximate query processing [48], among others. The SimRank model defines the similarity of two nodes based on a recursive thinking. When other nodes pointing to the two nodes are similar, then the two nodes are similar. SimRank defines the similarity of two nodes as follows:where is the similarity between nodes and and is a decay factor. denotes all node sets that point to node , and is the number of elements of .

The adjacency matrix of the miRNA-mRNA interaction bipartite network is represented as , where in row and column is 1 if miRNA is associated with mRNA , and 0 otherwise. The matrix is normalized by column to determine the matrix , and the similarity matrix can be calculated as follows:where is the miRNA similarity matrix and is the similarity between miRNAs and . is the transpose matrix of , is a decay factor, and is the unit matrix.

2.2.3. Disease Similarity Network

We can obtain the similarity matrix of diseases using the same process in determining the miRNA similarity network. The adjacency matrix of the disease-gene network is represented as , where in row and column is 1 if the disease is associated with gene , and 0 otherwise. Matrix is normalized by column to obtain the matrix , and the similarity matrix can be calculated as follows:where is the disease similarity matrix and is the similarity between diseases and . is the transpose matrix of , is a decay factor, and is the unit matrix. A simple example of constructing miRNA and disease similarity is provided in Figure 1.

2.3. Prediction Method

In this work, a density-based clustering recommendation model is developed based on the miRNA and disease similarity network to predict potential miRNA-disease associations. The flowchart of SRMDAP is shown in Figure 2.

For example, the calculation for predicting the association of miRNA and disease is as follows. First, given the assumption that miRNAs with similar functions are normally associated with phenotypically similar diseases and vice versa [10, 49], the closer the neighbors of miRNA are to disease , the closer miRNA will be to disease in the miRNA similarity network. Using miRNA as cluster center and greedy method, we added the most similar neighbor nodes to form new clusters, until the cluster density no longer increased. The cluster density of cluster is defined as follows:where and denote the sum of the weights of inner and external sides of cluster , respectively [50]. Item is a penalty item, and is the number of members of cluster . In our experiments, we set . Then, using , which denotes the closest neighbors of miRNA , the predictive score between miRNA and disease is calculated as follows:where is the predictive score between miRNA and disease calculated by the neighbors of miRNA ; and is the similarity of miRNA and miRNA ; and is the association between miRNA and disease . Equation (5) calculates the predictive score based on the nearest neighbors of miRNA and the associations between the neighbors and disease .

Second, in the same way, based on the assumption that diseases with similar functions often have similar semantic descriptions and vice versa [20], the closer the neighbors of disease are to miRNA , the closer the disease will be to miRNA in the disease similarity network; the predictive score between miRNA and disease is calculated as follows:where is the closest neighbor to disease .

Finally, the final predictive score between miRNA and disease is calculated by integrating and as follows:where is an integration parameter to balance the contributions from miRNA and disease similarities. in row and column is the prediction value of miRNA to disease .

When the predictive score between isolated disease and miRNA is calculated, all associations of isolated disease are ignored, and the contribution of the neighbors of miRNA to the predictor is zero. Thus, equals 0. The final predictive score between isolated disease and miRNA is , which is the predictive score between the similarity neighbors of disease and miRNA . Therefore, SRMDAP can predict associated miRNAs for an isolated disease. Similarly, when the predictive score between new miRNA and disease is calculated, is the predictive score between the similarity neighbors of miRNA and disease , and only is used as the predictive score between the new miRNA and related diseases.

To explore for a suitable value, we tested different values from 0.1 to 0.9 and calculated the average area under the curve (AUC) in the framework of leave-one-out cross-validation. The results showed that SRMDAP achieved the highest average AUCs when was 0.4 (Figure 3).

3. Results

3.1. Characteristics of the miRNA-Disease Association Network

In our study, 5,048 known miRNA-disease associations consisting of 475 miRNAs and 334 diseases were included. To comprehensively illustrate the known miRNA-disease association network, we demonstrated the characteristics of known miRNA-disease association network in Table 1. The degree of a disease (or miRNA) represented the neighboring miRNAs (or disease) related to it. The average degrees of the disease and miRNAs were 15.11 and 10.63, respectively. The degree of distribution of diseases and miRNAs of the known miRNA-disease association network (Figure 4) revealed a power-law distribution. Most of the miRNAs and diseases presented a degree of 1. Hepatocellular carcinoma showed that the maximum degree, that is, 208 miRNAs, was related to this malignancy. Meanwhile hsa-mir-21 showed the maximum degree, with 112 diseases related to this miRNA.

3.2. Performance Evaluation of SRMDAP

We implemented the leave-one-out cross-validation (LOOCV) on the known miRNA-disease associations to evaluate the predictive performance of the SRMDAP. For a given disease , each known association between miRNA and disease was ignored in turn as a test sample, and other known associations between miRNAs and disease were considered as a training set. The remaining miRNAs without evidence to show their relation to disease composed the candidate miRNA set. We calculated the relevance score of these candidate miRNAs with disease and ranked them by their scores. If the rank exceeded a given threshold, then the SRMDAP model successfully predicted this miRNA-disease association. The threshold was varied to draw the receiver operating characteristic (ROC) curve, and the score of the AUC was calculated to demonstrate the predictive performance. The ROC plots the relationship between the true positive rate (TPR, sensitivity) and the false positive rate (FPR, 1 − specificity) at different thresholds. Sensitivity represents the percentage of test miRNA-disease associations with ranking above a given threshold. Meanwhile, specificity represents the percentage of miRNA-disease associations below the threshold.

The TPR and FPR were calculated as follows:where TP, FP, TN, and FN indicate true positive, false positive, true negative, and false negative, respectively. Given a threshold, TP and FP are the number of known and unknown associations above the threshold, respectively. TN and FP are the number of unknown and known associations below the threshold, respectively. The AUC value of 1 indicates perfect performance of the prediction method. Moreover, an AUC value of 0.5 implies the random performance of the prediction method.

To our knowledge, RLSMDA [30], KATZ [24], and Liu et al.’s method [36] are three the-state-of-the-art computation methods that predict miRNA-disease associations. In our work, we compared SRMDAP with these methods and implemented a LOOCV for the three methods. The SRMDAP achieved the highest AUC of 0.8838 when . When optimal parameters were selected as described by the authors, AUC values corresponding to RLSMDA, KATZ, and Liu’s method were 0.8584, 0.8522, and 0.7983, respectively. Comparative results of overall ROC curves and AUCs of all methods are shown in Figure 5.

To obtain a reliable judgment, we tested 18 human diseases associated with at least 70 miRNAs, because diseases related to a few miRNAs were not sufficient to evaluate the performance of the prediction methods. Table 2 shows that the SRMDAP achieved the highest AUC of 0.8874 with lung neoplasms and lowest AUC of 0.7367 with renal cell carcinoma. The average AUC value for the 18 diseases was 0.8056. The average AUC values for the 18 diseases obtained from RLAMDA, KATA, and Liu’s method were 0.6671, 0.6901, and 0.5178, respectively. The average AUC achieved by SRMDAP was 14%, 12%, and 29% higher than those of the other three methods, respectively. The AUC values of the SRMDAP for the 18 diseases were all higher than those of RLSMDA, KATZ, and Liu’s method. These facts indicated that the prediction performance of SRMDAP was superior to RLSMDA, KATZ, and Liu’s method.

3.3. Case Studies

To further evaluate the SRMDAP’s ability to discover potential miRNA-disease associations, we selected two important diseases (kidney neoplasms and colorectal neoplasms) as case studies. We analyzed the top 50 candidates in detail. Prediction results were supported by dbDEMC [15] database and literature.

Kidney neoplasm, which forms in tissues of the kidneys, is one of the top 10 cancer killers. This malignancy is still difficult to diagnose and treat. Based on 2010–2014 cases and deaths, the annual number of new cases of kidney and renal pelvis cancer was 15.6 per 100,000 persons. The five-year survival rate in the United State is 74.1% [51]. MiRNAs showing altered expression in the kidney are promising biomarkers for diagnosis. For example, miR-141 and miR-200b are underexpressed in renal cell carcinoma (a kidney neoplasm type) from normal kidney and oncocytoma in tissue samples. The miRNA expression profiles of miR-141 or miR-200b might provide an ancillary tool for the correct discrimination of kidney neoplasms [52]. Candidate miRNAs were ranked based on the SRMDAP. The top 50 potential miRNAs associated with kidney neoplasms and evidence for the associations with kidney are listed in Table 3. Among the top 50 predicted candidates, 49 miRNA have been confirmed by dbDEMC, and only hsa-mir-7 is not confirmed by dbDEMC. However, downregulation of miR-7 with synthesized inhibitor inhibited cell migration in vitro, suppressed cell proliferation, and induced renal cancer cell apoptosis. Thus, miR-7 could be characterized as an oncogene in renal cell carcinoma [53].

Colorectal neoplasm is the third most common cancer and the fourth most common cancer-related cause of death worldwide, with more than 1.2 million new cases and 600,000 deaths annually [54]. MiRNAs can be used as useful biomarkers for colorectal cancer diagnosis, prognosis, and prediction of treatment response because of their several unique characteristics [55]. For example, serum miR-21, miR-29a, and miR-125b levels could discriminate early colorectal neoplasms patients from healthy controls [56]. The top 50 potential miRNAs associated with colorectal neoplasms and evidence for associations with kidney are listed in Table 4. Among the top 50 predicted candidates, 49 miRNAs were confirmed by dbDEMC. Only 1 miRNA (hsa-mir-663a) was not confirmed in the dbDEMC.

3.4. Prediction of Isolated Diseases and Isolated miRNAs

An isolated disease signifies a disease without any known related miRNAs or newly discovered disease. When we tested the capability of SRMDAP to predict isolated diseases, we removed all known verified miRNAs, which have been shown to be related to the predicted disease. This operation was performed to confirm that we only used the similarity information of other miRNAs-related diseases to predict candidate miRNAs associated with the given disease. Then, these candidate miRNAs were ranked according to their scores. The average AUC of SRMDAP to predict isolated disease was 0.7990. For colorectal neoplasms, we removed 143 known miRNA related to colorectal neoplasms and ranked candidate miRNAs based on the predictive result of SRMDAP. Among the top 50 predicted candidates, 49 miRNAs have been confirmed by dbDEMC. The potential candidate hsa-mir-494 is supported by the literature [PMID: 25270723]. However, hsa-mir-494 is an independent prognostic marker for colorectal neoplasm patients, and this miRNA promotes cell migration and invasion in colorectal neoplasms by directly targeting PTEN [57]. The predicted results of colorectal neoplasms are listed in Table 5.

As previously stated, an isolated miRNA is a miRNA without any known related disease, such as newly discovered miRNAs. The known verified disease-miRNA associations related to predictive miRNAs were removed to demonstrate the ability of SRMDAP to predict miRNAs without any known related disease. This procedure ensures the use of only known disease-miRNA associations and similarity information of other miRNAs to predict candidate disease. Then, these candidate diseases were ranked according to their scores. The average AUC of the SRMDAP to predict isolated miRNAs was 0.8464. The predicted results of hsa-mir-106b are listed in Table 6. For hsa-mir-106b, we removed 31 related diseases associations and ranked candidate diseases based on the predictive result of the SRMDAP. Among the top 10 predicted candidates, all diseases have been confirmed by dbDEMC, miR2Disease, or HMDD. These results demonstrate that the SRMDAP may be recommended to predict isolated diseases and miRNAs.

4. Discussion

The success of SRMDAP could largely be attributed to several factors. First, SRMDAP is a novel method to predict human miRNA-disease associations. This similarity measurement method does not depend on experimentally supported miRNA-disease associations to calculate the functional similarity of miRNAs and diseases. Thus, overestimation of the predictive accuracy was avoided. In SRMDAP, we proposed a density-based recommender model to integrate miRNA similarity subnetwork and disease similarity subnetwork using experimentally verified miRNA-disease associations. Second, SRMDAP incorporates miRNA-mRNA information, disease-gene information, and experimentally verified miRNA-disease associations. This characteristic improved prediction accuracy. Third, only one parameter was used to balance the contributions from miRNA similarity subnetwork and disease similarity subnetwork, and this parameter was easy to adjust. Fourth, LOOCV experiment and case studies about kidney and colorectal neoplasms demonstrated that SRMDAP had excellent predictive performance. Finally, the SRMDAP could predict isolated diseases and isolated miRNAs for disease similarity, and miRNA similarity was obtained independently on the known miRNA-disease associations.

Although SRMDAP contains several innovative concepts, this process has several limitations in its current version. First, a similarity measurement is of vital importance. Hence, miRNA similarity measurement should use more interaction information of miRNAs with other biomolecules. Disease similarity measurement should consider not only functional similarities but also semantic similarities. A fusion of more information sources can benefit the similarity measurement. Second, considering that the SRMDAP is constructed on the basis of known miRNA-disease associations, the performance of SRMDAP can be improved by obtaining more available experimentally verified miRNA-disease associations.

5. Conclusions

Identifying most promising miRNA-disease associations facilitates biological experimentation to save time and cost. In this work, we developed SRMDAP to predict miRNA-disease associations using established miRNA similarity subnetwork and disease similarity subnetwork based on the SimRank and density-based clustering recommender model. We integrated these similarity networks with known experimentally verified miRNA-disease associations using the density-based clustering recommender model. SRMDAP obtained average AUC of 0.8838 in LOOCV. Case studies of kidney and colorectal neoplasms were evaluated, and 49 miRNAs in the top 50 miRNAs were confirmed. SRMDAP also performed well in predicting isolated diseases and miRNAs. For colorectal neoplasms and hsa-mir-106b, all top 50 predicted miRNAs and all top 10 predicted diseases have been confirmed by dbDEMC, miRCancr, HMDD, or the literature. These results demonstrated that SRMDAP has superior performance over the other tested processes.

Conflicts of Interest

There are no conflicts of interest to declare.

Acknowledgments

This work was supported by the Natural Science Foundation of China under Grant no. 61672223 and the Natural Science Foundation of Hunan Provincial under Grant no. 2016jj4029.