Distributed Artificial Intelligence Models for Knowledge Discovery in Bioinformatics
View this Special IssueResearch Article  Open Access
Joel Perdiz Arrais, José Luís Oliveira, "RecRWR: A Recursive Random Walk Method for Improved Identification of Diseases", BioMed Research International, vol. 2015, Article ID 747156, 7 pages, 2015. https://doi.org/10.1155/2015/747156
RecRWR: A Recursive Random Walk Method for Improved Identification of Diseases
Abstract
Highthroughput methods such as nextgeneration sequencing or DNA microarrays lack precision, as they return hundreds of genes for a single disease profile. Several computational methods applied to physical interaction of protein networks have been successfully used in identification of the best disease candidates for each expression profile. An open problem for these methods is the ability to combine and take advantage of the wealth of biomedical data publicly available. We propose an enhanced method to improve selection of the best disease targets for a multilayer biomedical network that integrates PPI data annotated with stable knowledge from OMIM diseases and GO biological processes. We present a comprehensive validation that demonstrates the advantage of the proposed approach, Recursive Random Walk with Restarts (RecRWR). The obtained results outline the superiority of the proposed approach, RecRWR, in identifying disease candidates, especially with high levels of biological noise and benefiting from all data available.
1. Introduction
A major research domain in molecular biology is the study of the causal association between genomic variations and clinical phenotypes [1–3]. Classical methods use a manual approach where one or a limited number of genomic targets are individually tested. However, due to the resources needed to systematically perform this procedure and due to the difficulty in controlling all experimental variables, improved strategies were required. The possibility to use computational methods to identify the best disease candidates to be further validated was a major breakthrough [4–8]. A common constraint of most methods is the need for training data, which is scarce and difficult to validate.
A recent research trend consists of exploiting the topological properties of proteinprotein interaction (PPI) networks combined with other biological data to envisage the underlying mechanisms of genetic diseases. Barabási et al. [9] and Joy et al. [10] discuss the role of proteins with high betweenness as mediators of relevant metabolic processes. Ma and Zeng [11] explore the use of the closeness centrality to quickly identify the top central metabolites in large scale networks. Approaches proposed by Erten et al. [12] and Arrais and Oliveira [13] explore the potentialities of the nodes with high degree for the prioritization of diseaseassociated genes.
While the previous methods focus on evaluation of the weights given to each node, a complementary strategy consists of evaluating the proximity of two given nodes in the network. Some common methods to conduct this task are the shortest path, loglikelihood, propagation matrix, and the RWR (Random Walk with Restarts). Previous studies confirm that the RWR outperforms other methods [14–18]. One common limitation of these studies is that they assume the graph is single concept, meaning that every node is equally treated. However, as we demonstrate in this study, those methods are poor when the graph integrates nodes from distinct data types.
In this paper, we propose a novel method to improve selection of the best disease targets for multiconcept graphs. Towards this aim we build a multilayer biomedical graph that stores PPI data, annotated with stable knowledge from OMIM diseases and Biological Process from Gene Ontology. The inherent improvements of the proposed method are (a) the use of multilayer networks formed with PPI data and by the terms’ associations; (b) the combination of this data to establish new associations among nodes; and (c) the use of degreebased methods to evaluate node weights.
Finally, we present comprehensive validation that demonstrates the superiority of the proposed approach, Recursive Random Walk with Restarts (RecRWR).
2. Methods
The method proposed herein uses a graph representation of biomedical knowledge centred on proteins enriched with biomedical terms. The first step consisted of selecting and curating the required data and using it to construct the graph. For performance issues this network is represented as a matrix of adjacencies. Based on this groundbased biomedical graph we apply a modified version of the Hubs and Authorities (HITS) [13] algorithm adapted to this particular subject, in order to obtain a normalized and more accurate association among relations. Although here we are interested in tuning to proteindisease association, it is important to stress that it can be extended to the study of general association of manytomany biomedical terms. Finally, we formulate how the proposed method, RecRWR, can be applied to this subject.
2.1. Multiconcept Graph Modelling
A graphbased representation is used to store the relations among the biomedical terms. Since we are integrating three distinct data sources, three interconnected subgraphs are obtained.(i)PPI data are retrieved from STRING database [19], where the average confidence level is considered. A filter is applied to select only human.(ii)Disease data are extracted from OMIM morbid map [20] data where the genotypephenotype associations are preserved. The morbid map is also used to extract the mapping relation for known protein diseases.(iii)Biological Process from Gene Ontology (GO) [21] Directed Acyclic Graph structure is extracted and replicated. The GOGO mapping is also retrieved and stored.
For each of the previous data sources a curated set of terms is extracted: with representing the content of the th term from the interval . Each term is a tuple of three elements that can be represented as where the element has an association with the element , with a confidence score where and .
The set of vectors are modelled as a nonoriented weighted graph .(i)Each vertex is obtained by identifying the unique entry  or  of all the association tuples contained in vector . The vertices are labelled by their unique identifier.(ii)Each edge connects vertexes representing an association between the terms represented by the vertices and contained in vector .(iii)The weight of each edge corresponds to the score between two nodes.
The graph is then mapped to an adjacency matrix representation that consists of a matrix : Because the graph is undirected the adjacency matrix is symmetric and therefore .
The compiled graph resulted in 60.000 nodes with an average degree of 5. The memory space required to represent the graph is , which is realistically equivalent to a memory space of 6.0 MB, excluding hash tables required for node mapping. The adjacency matrix requires a memory space of , 7.2 GB.
2.2. RecRWR: Recursive Random Walk with Restarts
Next we formulate the RecRWR algorithm including a detailed pseudocode description of the algorithm (Algorithm 1). The three main components are(i)Random Walk with Restarts;(ii)recursive cross subgraph mapping;(iii)node replacement.

2.3. Random Walk with Restarts
The final probability vector of Random Walker is defined as where is the columnnormalized adjacency matrix and is a vector in which the th element holds the probability of being at node at time step . The vector holds the probability of the initial states and is constructed such that equal probabilities are assigned to the list of seed nodes where the sum of the probabilities is equal to 1. This is obtained by a given list of seed nodes, where .
2.4. Recursive Cross Subgraph Mapping
We extend the previous formulation to a symmetric matrix composed of of submatrixes, where corresponds to the number of data sources. The submatrix that corresponds to the mapping between the subgraphs and is obtained by where and are binary vectors with n elements that represent the mask of the source and target subgraphs where .
The result of each iteration of the Random Walk with Restarts is given by where in fact the algorithm stabilizes when the following condition is met: where is disease mask vector and is weight vector at time . The product will result in a scalar that corresponds to the sum of the differences between two iterations. The condition is true when this value is less than a given constant .
2.5. Node Replacement
The recursive iteration of the cross subgraph mapping returns a new term. A node replacement strategy is used to replace the genes to be used. The selection of the node index to be replaced by the node index is given by the minimum value of and where the candidate node is given by .
3. Results and Discussion
In this section we explore and evaluate the performance of the proposed method. We present a systematic evaluation using a synthetic datasets based on wellknown disease profiles. We also present how the results of RecRWR can be used to explore the resemblances mechanisms on breast cancer.
3.1. Validation Procedure
For each selected phenotype entry on the OMIM database we created a dataset with the associated genotypes. We have selected 100 phenotype diseases with at least 10 associated genotypes each. Then, we iteratively replace genes from the original dataset by random ones, in 20% increments, and the dataset is progressively converted to a fully random dataset. We use each of these protein datasets as seed nodes on the graph. We end up with a test space of 600 gene sets (6 random step levels plus 100 diseases).
3.2. Information Paradox
Previous use of RWR on molecular biology typically concentrates on PPI networks. One would expect that including additional data would contribute to an improved overall result. Figure 1 presents a comparison of the relative frequency of the ranks for each of the analysed datasets, for two of the tested methods (RWR over only PPI data and RWR over the whole network) and for four levels of randomness. From analysis of this graph it is clear there is no improvement with including external annotations on the original PPI network. Indeed for original dataset, with random effect, there are no perceptible differences between the two methods. This statement is even sharper when we test progressive levels of randomness. For instance, when 20% of the genes on the dataset are random, 55% of the RWR over PPI ranks the disease in the top 3, while with the RWR over all data this frequency drops to 48%. For 60% randomness, 35% of the RWR over PPI ranks the disease in the top 5, while with the RWR over all data the frequency drops to 23%. These results were the primary motivation for the work presented in this paper, as they clearly show that the RWR method is not suitable for dealing with multiple biological data.
3.3. RecRWR Results on Synthetic Datasets
We evaluate the performance of the RecRWR method using the receiver operating characteristic (ROC) curves where each curve contains the results for each level of randomness. A higher AUC (area under curve) corresponds to a better overall performance. Figure 2 and Table 1 compile the obtained results.

(a)
(b)
(c)
(d)
With 0% randomness the AUC is approximately the same for the three methods, the proposed one having the lowest minimum value, which can be perceived visually. This means that in the absence of biological noise the protein annotation data does not contribute to improving the final result. However, if randomness is introduced the proposed method shows a strong improvement.
With 20% randomness the RecRWR AUC is 0.9834, which compared to 0.9453 on the RWR corresponds to a 4.0% of improvement. Comparing the behaviour of the RecRWR the 20% randomness reflects no real impact (−0.22%) on the obtained AUC.
For 40% and 60% the difference is even higher (7.1% for 40% and 7.6% for 60%) demonstrating the greater capability of the proposed method.
It is also relevant to note a 1.0 TPR (true positive rate, axis on the graphs from Figure 2), meaning that the disease is always correctly identified and is consistently obtained at the expense of a lower FPR (false positive rate, axis).
3.4. RecRWR Results on Breast Cancer
Breast cancer (MIM:114480) is considered a complex disorder having 23 known genotypes that are shared with other cancerrelated disorders. We have used RecRWR over the common expression profiles of breast cancer to explore the network of diseases that share common mechanism. The diseases most closely related to breast cancer are hepatocellular carcinoma, bladder cancer, and lung cancer.
From analysing the network of associations, we can see that the proteins most related with breast cancer are responsible for important cellular functions, such as DNA repairing, cell cycle arrest and its regulation, induced cell death (apoptosis), and tumor suppression. Also, we can see that the more closely GO terms associated with breast cancer are protein binding and apoptotic process. This means that the probable causes of breast cancer are related to the impairment of all these functions. For instance, a genetic mutation causing loss of function on a tumor suppressor gene (such as the cellular tumor antigen p53, P04637) product would result in unrestrained cellular proliferation. Conversely, the transformation of a protooncogene (a gene that participates in a cellgrowth pathway) into an oncogene (a protein that can induce cancer on animals) requires a gainoffunction mutation that will allow its permanent activation. An example of this is the epidermal growth factor receptor (EGFR, P00533), also present in Figure 3. EGFR is involved in the conversion of extracellular stimulus to cellular responses. Also, transcription errors are usually immediately corrected by DNA repairing proteins, like the DNA repair and recombination protein RAD54like (Q92698), shown in the network. A mutation in this gene would result in the defective proteins, and subsequently the correction of transcription and translation errors would cease. Finally, the protein caspase8 also seems to be a possible cause of breast cancer. Since caspase8 is involved in the apoptotic process, impairment of this protein would result in the absence of apoptosis, and defective cells would not be destroyed.
The shortest path between the two diseases is mediated by the cellular tumor antigen p53. There are however other connections between the two nodes. For instance, the proteins receptor tyrosineprotein kinase erbB2 (P04626), GTPase KRas (P01116), and caspase8 (Q14790) also connect the two cancers. The influence of caspase8 mutations on the onset of cancer was explained above. ERBB2 is a protooncogene, with the potential of being converted into an oncogene and inducing cancer. The GTPase KRas is involved in a great variety of important biological processes, including regulation of both of cell proliferation and gene expression, signal transduction, and cell signalling. The majority of the proteins analysed here are part of the same KEGG pathways: pathways in cancer (hsa05200), neurotrophin signaling pathway (hsa04722), and focal adhesion (hsa04510). The first pathway consists of an integration of the various cancer pathways. The neurotrophin signalling pathway is responsible for the differentiation and survival of neural cells. However, this second pathway is heavily regulated by other intracellular signalling cascades, in which some of the proteins presented in Figure 3 participate. The focal adhesion pathway plays important roles in the proliferation, differentiation, and survival of cells and in gene expression. In case of compromise of any of the proteins involved on this pathway cellular communication becomes defective, which can also result in cancer.
4. Conclusion
In this paper, we have proposed a graphbased approach to address the problem of selecting the best disease targets for multiconcept graphs. Towards this aim we build a multilayer biomedical graph that stores PPI data, annotated with stable knowledge from OMIM diseases and Biological Process from Gene Ontology. The inherent improvements of the proposed method are the use of multilayer networks formed with the PPI data and by the terms’ associations; combination of this data to establish new associations among nodes; and use of degreebased methods for evaluating node weights.
Finally, we have presented comprehensive validation that demonstrates the superiority of the proposed approach, Recursive Random Walk with Restarts (RecRWR). The obtained results outline the superiority of the proposed approach in identifying disease candidates, especially with high levels of biological noise and benefiting from all data available.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Authors’ Contribution
Joel P. Arrais and José Luís Oliveira contributed equally to the work presented here.
Acknowledgments
This work has received support from the RDCONNECT European Project (EC contract no. 305444). Research Unit IEETA is funded by National Funds through FCT, Foundation for Science and Technology, in the context of the Project PxEstOE/EEI/UI0127/2014.
References
 C. Giallourakis, C. Henson, M. Reich, X. Xie, and V. K. Mootha, “Disease gene discovery through integrative genomics,” Annual Review of Genomics and Human Genetics, vol. 6, pp. 381–406, 2005. View at: Publisher Site  Google Scholar
 H. G. Brunner and M. A. van Driel, “From syndrome families to functional genomics,” Nature Reviews Genetics, vol. 5, no. 7, pp. 545–551, 2004. View at: Publisher Site  Google Scholar
 C. Auffray, Z. Chen, and L. Hood, “Systems medicine: the future of medical genomics and healthcare,” Genome Medicine, vol. 1, no. 1, article gm2, 2009. View at: Publisher Site  Google Scholar
 J. Chen, H. Xu, B. J. Aronow, and A. G. Jegga, “Improved human disease candidate gene prioritization using mouse phenotype,” BMC Bioinformatics, vol. 8, article 392, 2007. View at: Publisher Site  Google Scholar
 S. Aerts, D. Lambrechts, S. Maity et al., “Gene prioritization through genomic data fusion,” Nature Biotechnology, vol. 24, no. 5, pp. 537–544, 2006. View at: Publisher Site  Google Scholar
 S. Rossi, D. Masotti, C. Nardini et al., “TOM: a webbased integrated approach for identification of candidate disease genes,” Nucleic Acids Research, vol. 34, pp. W285–W292, 2006. View at: Publisher Site  Google Scholar
 Y. Moreau and L.C. Tranchevent, “Computational tools for prioritizing candidate genes: boosting disease gene discovery,” Nature Reviews Genetics, vol. 13, no. 8, pp. 523–536, 2012. View at: Publisher Site  Google Scholar
 J. P. Arrais, J. Fernandes, J. Pereira, and J. L. Oliveira, “GeneBrowser 2: an application to explore and identify common biological traits in a set of genes,” BMC Bioinformatics, vol. 11, article 389, 2010. View at: Publisher Site  Google Scholar
 A.L. Barabási, N. Gulbahce, and J. Loscalzo, “Network medicine: a networkbased approach to human disease,” Nature Reviews Genetics, vol. 12, no. 1, pp. 56–68, 2011. View at: Publisher Site  Google Scholar
 M. P. Joy, A. Brock, D. E. Ingber, and S. Huang, “Highbetweenness proteins in the yeast protein interaction network,” Journal of Biomedicine & Biotechnology, vol. 2005, no. 2, pp. 96–103, 2005. View at: Publisher Site  Google Scholar
 H.W. Ma and A.P. Zeng, “The connectivity structure, giant strong component and centrality of metabolic networks,” Bioinformatics, vol. 19, no. 11, pp. 1423–1430, 2003. View at: Publisher Site  Google Scholar
 S. Erten, G. Bebek, R. M. Ewing, and M. Koyutürk, “DADA: degreeaware algorithms for networkbased disease gene prioritization,” BioData Mining, vol. 4, no. 1, article 19, 2011. View at: Publisher Site  Google Scholar
 J. P. Arrais and J. L. Oliveira, “Using biomedical networks to prioritize gene–disease associations,” Open Access Bioinformatics, vol. 2011, no. 3, pp. 123–130, 2011. View at: Google Scholar
 K. Macropol, T. Can, and A. K. Singh, “RRW: repeated random walks on genomescale protein networks for local cluster discovery,” BMC Bioinformatics, vol. 10, article 283, 2009. View at: Publisher Site  Google Scholar
 L. Yu, L. Gao, and K. Li, “A method based on local density and random walks for complexes detection in protein interaction networks,” Journal of Bioinformatics and Computational Biology, vol. 8, supplement 1, pp. 47–62, 2010. View at: Publisher Site  Google Scholar
 D.H. Le and Y.K. Kwon, “GPEC: A Cytoscape plugin for random walkbased gene prioritization and biomedical evidence collection,” Computational Biology and Chemistry, vol. 37, pp. 17–23, 2012. View at: Publisher Site  Google Scholar
 M. Re, M. Mesiti, and G. Valentini, “A fast ranking algorithm for predicting gene functions in biomolecular networks,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 6, pp. 1812–1818, 2012. View at: Publisher Site  Google Scholar
 S. Kohler, S. Bauer, D. Horn, and P. N. Robinson, “Walking the interactome for prioritization of candidate disease genes,” The American Journal of Human Genetics, vol. 82, no. 4, pp. 949–958, 2008. View at: Publisher Site  Google Scholar
 D. Szklarczyk, A. Franceschini, M. Kuhn et al., “The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored,” Nucleic Acids Research, vol. 39, no. 1, pp. D561–D568, 2011. View at: Publisher Site  Google Scholar
 A. Hamosh, A. F. Scott, J. S. Amberger, C. A. Bocchini, and V. A. McKusick, “Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders,” Nucleic Acids Research, vol. 33, pp. D514–D517, 2005. View at: Publisher Site  Google Scholar
 M. Ashburner, C. A. Ball, J. A. Blake et al., “Gene ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2015 Joel Perdiz Arrais and José Luís Oliveira. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.