MOfinder: A Novel Algorithm for Detecting Overlapping Modules from Protein-Protein Interaction Network
Since organism development and many critical cell biology processes are organized in modular patterns, many algorithms have been proposed to detect modules. In this study, a new method, MOfinder, was developed to detect overlapping modules in a protein-protein interaction (PPI) network. We demonstrate that our method is more accurate than other 5 methods. Then, we applied MOfinder to yeast and human PPI network and explored the overlapping information. Using the overlapping modules of human PPI network, we constructed the module-module communication network. Functional annotation showed that the immune-related and cancer-related proteins were always together and present in the same modules, which offer some clues for immune therapy for cancer. Our study around overlapping modules suggests a new perspective on the analysis of PPI network and improves our understanding of disease.
PPI networks have been widely used to understand biology at the system level [1–3]. However, PPI data sets suffer from high false positive and false negative rates . Network module, a group of proteins that are connected with each other to carry out a function , will be more accurate because a loss or gain of interaction will not break down the module structure. Modules have been applied to predict protein function  and disease genes  and trace the evolutionary history of networks [8–10].
To perform complex biochemical or developmental functions, modules have to work together. Thus several proteins are used to pass information from one module to another. For example, three modules in S. cerevisiae—the Set3C complex, protein phosphatase type 2A (PP2A) complex, and cell polarity budding—share a protein: Zds1 . Zds1 can bind PP2A to control mitotic progression , and it also participates in Set3C complex during budding processes and repress meiotic process , so Zds1 may serve as a bridge between mitosis and meiosis. Here we define these three modules as overlapping modules and define the shared protein as the overlapping nodes. The overlapping modules can form a module-module communication network. Construction of such network can be helpful for understanding the coordinated relationship between different biological processes.
The problem of identifying modules has been studied by bioinformatics, applied mathematics, and physics . Many methods have been developed to identify modules within a network, and they have been reviewed and evaluated [15–19]. We thought these approaches can be classified into two types. (1) Local seed-based methods which start from a node or clique (fully connected subgraph) and follow by an expanding search strategy. MCODE  is the first method for module detecting, and it expands highly scoring seed nodes by a local search procedure. But this method only detects a few modules. CFinder  is the first algorithm for overlapping communities detection, and it develops a Clique Percolation Method (CPM) where k-cliques are explored by rotating about its component (k-1)-cliques. CFinder is too slow when applied to dense PPI networks, and particularly it cannot detect spoken-like module (noncliques). To overcome this problem, Zhang et al.  combines the Line Graph Transformation (LGT) and CPM to detect overlapping network modules and builds the overlapping modules network. Wu et al.  proposes COACH (core-attachment based method) to predict complexes by detecting protein-complex core and then adding attachments. The Local Protein Community Finder , LPCF for short, uses two local clustering algorithms to find a community close to a queried protein. (2) Global cluster methods. (NeMo)  combine a neighbour-sharing score with hierarchical agglomerative clustering to identify both dense network and dense bipartite network structures in a single approach. Reichardt and Bornholdt  propose a method to detect overlapping (fuzzy) communities that maps the graph onto a zero-temperature q-Potts model with nearest-neighbor interactions. Zhang et al.  combine the idea of modularity function Q, spectral relaxation, and fuzzy c-means clustering method for detecting overlapping community structure. Wang et al.  propose a BCD (Betweenness-Commonality Decomposition) algorithm which uses edge commonality and edge-betweenness. Other methods such as nonnegative matrix factorization (NMF) technique were also used for uncovering overlapping (fuzzy) communities [27, 28]. Besides these topological-based methods, Chen and Yuan  integrate 265 microarray datasets to detect functional modules in yeast protein-protein interaction network.
Here we describe MOfinder, an alternative method we have developed that can effectively identify functional modules, especially overlapping modules, from a PPI network. MOfinder allows flexibility and user customization with adjustable parameters. We compared the performance of MOfinder with other available methods. We explored the overlapping information of modules in yeast and human PPI network. We used all the overlapping modules detected from human PPI network to generate a graph of module-module communication, and we analyzed the functional properties of the overlapping modules.
2. Materials and Methods
2.1. Data Sources
The human PPI data sets were downloaded from HPRD (release 8) . The yeast PPI data sets were collected from DIP . Cancer Genes  (“Tumor Suppressor” and “Oncogene”) and Immunome [33, 34] were used to annotate cancer- and immune-related proteins.
2.2. Definition of Clustering Coefficient
Clustering coefficient of node is defined as , where is the degree of and is the number of connected links between all neighbors of .
2.3. Definition of Functional Module
Given a predicted module, the -value of it with respect to a GO term is computed by the hypergeometric distribution in (1) and corrected by Bonferroni correction. The functional module is defined as a module enriched in at least one GO term (Bonferroni -value <0.01): where a predicted complex with size , proteins share a GO term, and in a total of proteins, of them have the same GO term.
2.4. Functional Similarity of Modules
Assuming and are two sets of GO terms that annotate modules and , respectively, the following Jaccard index was used to calculate the functional similarity between modules and :
3.1. MOfinder Algorithm
MOfinder is based on an AMD (Approximate Minimum Degree Ordering) algorithm [35, 36] which has been used for network clustering from electrical engineering . AMD algorithm is usually used in ordering a sparse matrix prior to Cholesky factorization (or for LU factorization with diagonal pivoting), and it can transform the sparse matrix to make the nonzero elements close to the diagonal. The approach used by MOfinder is summarized in Figure 1. MOfinder first converts the PPI file into a sparse matrix, where a nonzero element represents a protein-protein interaction. It then performs a global AMD of the sparse matrix in which the densely connected elements (module) will be clustered along the diagonal. Besides the global AMD, which produces the global ordering, a local AMD is performed to give the approximate minimum degree ordering. MOfinder uses a sliding window along the diagonal to fetch the local sparse matrix and make the local AMD. The clustering coefficient (CC)  value of the submatrix in the sliding window is calculated; if the CC value is not less than the cut-off, MOfinder will save the submatrix as a module. Then the sliding window moves one step along the diagonal to find new modules, and the iteration process is repeated until the sliding window reaches the end. Lastly, MOfinder removes redundant modules (if module belongs to module , is removed) and saves results. The pseudocode of MOfinder algorithm is (see Algorithm 1).
3.2. MOfinder Is a Flexible Method
MOfinder contains two adjustable parameters: the CC cut-off value and the size of sliding window. Different parameters will vary the results. To optimize the parameters, the performance was assessed in term of accuracy of identified modules with respect to annotated function. MOfinder was tested over a broad range of parameters for CC cut-off value (0.2–1) and sliding window (20–450) using PPI data from yeast and human.
First, the percentage of functional modules was plotted against a range of CC cut-off values, and for each CC cut-off value, all sizes of sliding window (20–450, step = 10) were tested and the resulting percentages of functional modules were plotted as a group of points. As shown in Figure 2, the percentage of functional modules increases with the increase of CC cut-off value, and it is observed to have 4 distinct and stable ranges for values of CC cut-off, , , , and , respectively. Although the highest percentage of functional modules is achieved in the last range (CC cut-off value ∈), using CC cut-off value of this range will identify densely connected complex and ignore other modules. Additionally in this range, MOfinder only generates a small number of modules (e.g., it predicts, on average, 36 modules from human PPI network when CC cut-off = 0.84). Since the purpose is to detect modules instead of complex, we recommended that the suitable setting of threshold would be in the third range (CC cut-off value ∈). The best choice for CC cut-off value is 0.67 because the number of predicted modules decrease with CC cut-off value (data not shown).
Second, we investigate how the variation of sliding window affects the performance. Figure 3 shows the number of functional modules matched for the 0.67 cut-off value over all tried sizes of sliding window (20–450, step = 10). The curve of the resulting number of functional modules first increases and then decreases. So the sliding window should be set to 350 which maximized the number of functional modules. To achieve best performance, we recommended that the parameter set was CC cut-off value = 0.67 and size of sliding window = 350.
3.3. Performance Evaluation
MOfinder was tested using PPI data from yeast and human and compared with the performance of other five software available algorithms: MCODE (default parameters), CFinder (, as suggested), COACH (default parameters) NeMo (default parameters), and LPCF (community size was set to 3–11 which was comparable to MOfinder). The percentage of functional modules was used to indicate accuracy, and MOfinder was the top performing algorithm with respect to accuracy in yeast (93.9%) (Figure 4(a)) and human (81.5%) (Figure 4(b)). Also, we compared the major module size of six methods in yeast (Table 1) and in human (See supplementary Table 1 in Supplementary Material available online at http://dx.doi.org/10.1155/2012/103702). Most of the modules detected by MCODE are of size 3, size 4 for CFinder, size 3 for COACH, size 4 for NeMo, size 10 for LPCF, and size 5 for MOfinder. Although the number of modules and the number of proteins assigned to modules were smaller for MOfinder than some of these methods, the percentage of functional modules was highest for MOfinder.
3.4. Overall Overlapping Properties in Yeast and Human
We applied MOfinder to the yeast and human PPI network with default parameters (CC cut-off = 0.67, sliding window = 350). Then we explored the distribution of overlapping size. As shown in Figure 5, the overlapping size distribution is different between yeast and human. Most of the modules in yeast PPI network share one protein (Figure 5(a)), but in human PPI network the most common overlapping size is 4 (Figure 5(b)). Some overlapping parts might be overcounted. For example, three modules (, , and ) share a protein D, so protein D is counted 3 times (-, -, -). To avoid the overcount problem, we deleted the repeats, so protein D is only counted once. Figure 6(a) shows that the resulting distribution of overlapping size in yeast is obviously changed, and the most common overlapping size changes into 4 which is similar to human (Figure 6(b)). These observations suggest that although modules in yeast tend to share less proteins than modules in human, the small overlapping parts (size 1 and size 2) are more repeatedly used in yeast than human, and thus the distribution of overlapping size becomes similar in yeast and human after removing repeats.
Since proteins in one module work together to perform functions, a similar function is expected to appear if two modules are overlapping with each other. And the larger the overlapping size, the more likely the same function. To verify this, we used the GO annotation similarity to represent the functional similarity. Figure 7 shows that the average functional similarity is increased with the increase of overlapping size. Such a trend has been observed in both yeast (Figure 7(a)) and human (Figure 7(b)).
3.5. Overlapping Modules in the Human Interactome
MOfinder identified 221 modules, of which 152 were overlapped with at least one other module. These overlapped modules were used to construct a module-module communication network (Figure 8(a)). In the communication network, each node is a module, two modules being connected if they share at least one protein. To explore the functional of this network, we used DAVID 6.7 [40, 41] to search for enrichment of Gene Ontology (GO) terms and the KEGG pathways. We found that GO terms and pathways related to cancer and immune response were enriched in the network proteins, so we mapped the cancer and immune-related proteins to the modules. As shown in Figure 8(a), of the 47 modules containing immune-related proteins, 33 included cancer-related proteins, and the ratio (33/47) was greater than expected by chance (62 of 152 modules have cancer-related proteins, Binomial test, ). Therefore, the modules containing immune-related proteins always included cancer-related proteins and vice versa (33/62 was greater than expected 47/152, Binomial test, ).
To explore the communication between functional modules, we map the functional annotation to each module and evaluate the functional similarity between two overlapping modules. The functional similarity is shown as edge color in Figure 8: the values between 0 and 1 are painted with a pink/blue color gradient, and modules without GO annotation have gray edges. Figure 8(b) gives the functional annotation of modules from the largest cluster in Figure 8(a). Some overlapping modules have the same function, such as the three modules involve in the acetylation of peptidyl-lysine, while several overlapping modules have distinct function, for instance, a module involved in the change of mast cell is overlapping with another module which takes part in the reactions mediated by protein kinases. Figure 9 shows an example of two overlapping modules. One module function is in B-cell activation processes and it contains five proteins: Q15464, O75791, O43561, Q13094, and P08575. The other module (P08575, P20963, P06729, and P06127) involves in T-cell activation. These two Modules share a protein: P08575 (receptor-type tyrosine-protein phosphatase C, CD45), which plays a critical role in receptor-mediated signalling in both B and T-cells [42, 43]. The shared node between two modules suggests a pathway crosstalk between them. Consistent with this hypothesis, several studies have illustrated T-cell-dependent B-cell activation .
The module-module communication network included 341 overlapping nodes (nodes belonging to two or more modules). Several studies showed that modular overlaps are potential drug targets because they are key determinants of cooperation between network modules . So we investigated the potential druggability of overlapping nodes: 56 of them were established drug targets and another 43 proteins were from druggable family , which were 99 druggable proteins in all. The ratio of druggable proteins (99/341) was significantly higher than expected (2000–3000 druggable proteins in human , Binomial test, ).
For both yeast and human interactomes, MOfinder surpasses the other five methods in accuracy. Furthermore, MOfinder is fast in practice for large networks. For example, when applied to a yeast network including nearly 40,000 interactions (from I2D ), the running time of MOfinder was only 15 seconds. Since the size of biological networks continues to grow, MOfinder is likely to meet the needs of biological analysis. However, MOfinder has two possible limitations. One is that MOfinder specifically detects small-sized modules (less than 12), but the major module size (5) is close to the average size of MIPS complexes (6) . MOfinder detects 125 modules from the yeast PPI network, which is less than COACH and LPCF. From the perspective of the covered proteins of predicted modules, MOfinder is rank 4. These observations suggest another limitation: MOfinder is of too strict to detect loosely connected modules, partly because the CC cut-off value is set to 0.67. We suppose that setting the CC cut-off value to a small value can increase the number of detected modules especially loosely connected modules (including pathways). But what is the biological significance of the different clustering coefficient thresholds is still an open question.
Yeast is a simple single-celled eukaryote, so the overlapping modules in yeast generally use one protein for communication. On the contrary, human, a multicellular organism, employs more complex system, and thus the overlapping size of human is larger than that of yeast. We also found the overall distribution of overlapping size is similar between yeast and human after removing repeats. And in Figure 2 the functional steps occur at similar places between yeast and human. These observations reflect the evolutionary conservation across eukaryotes. Although overlaps may lead to redundant modules which overlap with each other heavily, excluding the overlapping size 4 (a heavily overlap because the major module size is 5) from Figures 5, 6, and 7 does not change the overall pattern of results.
Overlapping modules will work together to carry out several complicated jobs, such as signal transduction. So constructing a module-module communication network to explore how these modules communicate with each other can help to understand biological complexity. Although we just built such a network in human, similar approach can be applied to other species. We found that the immune- and cancer-related proteins are always in the same modules. The association between immune cells and cancer has been discussed , and several clinical studies and experiment have proven that the immune system is a new weapon against cancer . Antitumor adaptive immune responses can suppress tumor growth , and several immunotherapy drugs could cure cancers . We provided the evidence for their close relationship on the system level.
In this paper, we describe a novel algorithm for the identification of overlapping modules in PPI networks. MOfinder performs competitively with other methods and uses two adjustable parameters that enable it to identify modules flexibly. MOfinder is a cross-platform package which is implemented as a C/C++ script, and it can be downloaded and installed free of charge (http://bsb.kiz.ac.cn/mofinder/). The application of MOfinder to human PPI gives clues for fighting against cancer using immune system. And the overlapping nodes, which are in charge of intermodule crosstalk, could help to identify potential drug targets.
This work was supported by the National Basic Research Program of China (Grant no. 2009CB941300) for J.-F. Huang. Q. Yu and G. H. Li contributed equally to this work.
Supplementary material provides the statistics for the major module size and other features when applied six methods to the human PPI data.
R. Sharan, I. Ulitsky, and R. Shamir, “Network-based prediction of protein function,” Molecular Systems Biology, vol. 3, p. 88, 2007.View at: Google Scholar
L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray, “From molecular to modular cell biology,” Nature, vol. 402, no. 6761, pp. C47–C52, 1999.View at: Google Scholar
W. W. M. Pim Pijnappel, D. Schaft, A. Roguev et al., “The S. cerevisiae SET3 complex includes two histone deacetylases, Hos2 and Hst1, and is a meiotic-specific repressor of the sporulation gene program,” Genes and Development, vol. 15, no. 22, pp. 2991–3004, 2001.View at: Publisher Site | Google Scholar
S. Zhang, H. W. Liu, X. M. Ning, and X. S. Zhang, “A hybrid graph-theoretic method for mining overlapping functional modules in large sparse protein interaction networks,” International Journal of Data Mining and Bioinformatics, vol. 3, no. 1, pp. 68–84, 2009.View at: Publisher Site | Google Scholar
M. Zarei, D. Izadi, and K. A. Samani, “Detecting overlapping community structure of networks based on vertex-vertex correlations,” Journal of Statistical Mechanics-Theory and Experiment, vol. 2009, article P11013, 2009.View at: Google Scholar
L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, and D. Eisenberg, “The database of interacting proteins: 2004 update,” Nucleic Acids Research, vol. 32, pp. D449–D451, 2004.View at: Google Scholar
P. R. Amestoy, T. A. Davis, and I. S. Duff, “An approximate minimum degree ordering algorithm,” SIAM Journal on Matrix Analysis and Applications, vol. 17, no. 4, pp. 886–905, 1996.View at: Google Scholar
A. O. M. Saleh and M. A. Laughton, “Cluster-analysis of power system networks for array-processing solutions,” IEE Proceedings C, vol. 132, no. 4, pp. 172–178, 1985.View at: Google Scholar
G. Dennis, B. T. Sherman, D. A. Hosack et al., “DAVID: database for annotation, visualization, and integrated discovery,” Genome Biology, vol. 4, no. 5, p. P3, 2003.View at: Google Scholar
D. C. Parker, “T cell-dependent B cell activation,” Annual Review of Immunology, vol. 11, pp. 331–360, 1993.View at: Google Scholar