About this Journal Submit a Manuscript Table of Contents
Journal of Biomedicine and Biotechnology
Volume 2008 (2008), Article ID 860270, 10 pages
http://dx.doi.org/10.1155/2008/860270
Research Article

An Algorithm for Finding Functional Modules and Protein Complexes in Protein-Protein Interaction Networks

1School of Computer Science and Engineering, Inha University, Incheon 402-751, South Korea
2Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, China

Received 1 September 2007; Revised 12 November 2007; Accepted 24 December 2007

Academic Editor: Daniel Howard

Copyright © 2008 Guangyu Cui et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Biological processes are often performed by a group of proteins rather than by individual proteins, and proteins in a same biological group form a densely connected subgraph in a protein-protein interaction network. Therefore, finding a densely connected subgraph provides useful information to predict the function or protein complex of uncharacterized proteins in the highly connected subgraph. We have developed an efficient algorithm and program for finding cliques and near-cliques in a protein-protein interaction network. Analysis of the interaction network of yeast proteins using the algorithm demonstrates that 59% of the near-cliques identified by our algorithm have at least one function shared by all the proteins within a near-clique, and that 56% of the near-cliques show a good agreement with the experimentally determined protein complexes catalogued in MIPS.

1. Introduction

Proteins in a highly connected subgraph of a protein interaction network usually share a common function [1]. Therefore, a highly connected subgraph such as clique and near-clique in a protein interaction network can be used to predict the function of uncharacterized proteins in the highly connected subgraph. Finding a clique with a maximum size in a graph is an NP-hard problem [2]. There are several heuristic algorithms for the maximum clique problem [2, 3], but most of them focus on finding a complete subgraph (i.e., clique) and cannot be used to find near-cliques.

Several topological analysis methods have been developed for identifying biologically meaningful groups from protein interaction networks or for assessing the reliability of protein interactions. A recent program called CFinder [4, 5] finds overlapping cliques in protein interaction networks. It allows a protein to belong to more than one clique, but cannot find near-cliques. Our study shows that the near-cliques can reveal higher functional coherence than the overlapping cliques.

The primary focus of this study is to find functional groups by identifying cliques and near-cliques in protein interaction networks. This study attempts to answer two questions as follows. “Can we efficiently find all cliques and near-cliques?" and “does a dense subgraph such as clique and near-clique indeed represent a functional module or protein complex?" This study demonstrates that the answers to both questions are “yes." This paper presents an algorithm for finding near-cliques and its application to the interaction network of yeast proteins.

2. Algorithms for Finding Near-Cliques

A clique is a complete graph 𝐺=(𝑁,𝐸) in which every node is connected to every other node in the graph. In our previous work, we developed a heuristic algorithm and implemented the algorithm in a program called InterViewer [6], which identifies all edge-disjoint cliques (i.e., cliques that do not share an edge).

Our experience with protein interaction networks suggests that a near-clique as well as a clique often represents a biologically meaningful unit such as functional module or protein complex. A near-clique is almost a clique but is not a clique due to a few missing edges. We consider near-cliques of the following basic types, which are biologically meaningful clusters (see Figure 1).

fig1
Figure 1: Near-cliques of types A, B, and C. Proteins outside a clique are represented as shaded nodes.

Type A
When a protein outside a clique interacts with two or more proteins in the clique, the protein and the clique forms a near-clique.

Type B
When a clique shares a protein with other cliques, the cliques form a near-clique.

Type C
When two or more cliques interact with a common protein outside them and the protein has at least two interactions with each clique, the cliques and the protein form a near-clique.

The near-cliques of types A and C can be refined using the indegree and outdegree of a node (there is no change to the near-clique of type B). For a node 𝑥 in subgraph 𝐺𝐺, indegree(𝑥,𝐺) is the number of the edges connecting node 𝑥 to other nodes in 𝐺, and outdegree(𝑥,𝐺) is the number of edges connecting node 𝑥 to other nodes that are in 𝐺 but not in 𝐺. We use the definition of a community in a strong sense [7] to find more near-cliques in a graph.

Definition 1. A subgraph 𝐺 is a community in a strong sense if indegree(𝑥,𝐺)>outdegree(𝑥,𝐺) for every 𝑥 in 𝐺.

The original definition of a strong community misses many near-cliques due to a single node in the communities. For example, in Figure 2(a), node 𝑥 cannot belong to a near-clique since indegree(𝑥,𝐺)=3<outdegree(𝑥,𝐺)=4. Likewise, node 𝑥 in Figure 2(b) cannot belong to a near-clique because indegree(𝑥,𝐺)<outdegree(𝑥,𝐺). Thus, nodes with only one edge connected to them and their edges are removed from the graph when we search near-cliques in the graph. In the graph of Figure 2(a), nodes 𝑝, 𝑞, 𝑟, and 𝑠 and their edges are removed. After removing them, node 𝑥 and the existing clique form a near-clique of type A. A cluster that satisfies indegree(𝑥,𝐺)0.5|𝐺| for every 𝑥 in 𝐺, where |𝐺| is the number of nodes in 𝐺, forms a near-clique, too. The example shown in Figure 2(b) becomes a near-clique since it satisfies indegree(𝑥,𝐺)0.5|𝐺| even if it does not satisfy indegree(𝑥,𝐺)<outdegree(𝑥,𝐺).

fig2
Figure 2: (a) After removing nodes 𝑝, 𝑞, 𝑟, and 𝑠 and their edges, node 𝑥 forms a near-clique of type A with the remaining nodes. (b) This graph becomes a near-clique 𝐺 of type C since indegree(𝑥,𝐺)0.5|𝐺|. (c) A big near-clique is too big (e.g., near-clique with more than 50 nodes) and is split into smaller near-cliques (in this example, 3 small near-cliques).

Therefore, a near-clique 𝐺 of basic types A and C should satisfy at least one of the following conditions.

(1)indegree(𝑥,𝐺)outdegree(𝑥,𝐺) for every 𝑥 in 𝐺.(2)indegree(𝑥,𝐺)0.5|𝐺|.

After finding all edge-disjoint cliques first, we identify near-cliques as follows. More detailed description of finding near-cliques are outlined in Algorithms 1 and 2. In the algorithms, cIdx represents the index of a clique.

alg1
Algorithm 1: AssignNearCliqueIdx.

alg2
Algorithm 2: ExtendNearClique.

(1) Assign every node of a clique the index of the clique containing the node.(2) When a node of a clique has already an assigned clique index, assign the index to all nodes of the clique, and merge two cliques into a near-clique of type B.(3) When a node 𝑥 outside a clique forms a basic near-clique 𝐺 of type A due to the interactions with two or more proteins in the clique, and either indegree(𝑥,𝐺)outdegree(𝑥,𝐺) or indegree𝑖(𝑥,𝐺)0.5|𝐺| is true, assign the index of the clique to the node.(4)When two or more cliques form a near-clique 𝐺 due to two or more interactions with a common protein outside the cliques, and either indegree(𝑥,𝐺)outdegree(𝑥,𝐺) or indegree𝑖(𝑥,𝐺)0.5|𝐺| is true, merge the cliques and the protein into a near-clique of type C. A near-clique is formed by selecting nodes with the same clique index (𝑐𝐼𝑑𝑥) as those nodes with 𝑐𝐼𝑑𝑥>0.

Since the most relevant processes form a group of proteins of moderate size in biological networks [8], we obtain near-cliques smaller than the maximum size specified by a user. That is, when a near-clique bigger than the maximum size is found (e.g., near-clique with more than 50 nodes), it is split into smaller near-cliques (3 near-cliques in Figure 2(c)). The way we split a big near-clique is as follows. When our program finds a big near-clique with the minimum clique size set to 𝑘, we rerun the program on the big near-clique with the minimum clique size set to 𝑘+1 to find a new clique and a near-clique with the clique. After removing the new near-clique from the original, big near-clique, we run the program again with the minimum clique size set to 𝑘. The big near-clique shown in Figure 2(c) is split into 3 small near-cliques with at least 4 proteins each.

3. Results and Comparison with Experimental Data

We tested the algorithms on the data with 8,397 interactions between 4,380 yeast proteins, which is the combined data of Ito et al. [9], Uetz et al. [10], and MIPS (http://mips.gsf.de) with redundant data removed. To every protein in the near-cliques, we assigned the functional categories of the Functional Catalog (FunCat) version 2.0 [11], which includes 97 functional categories. There are six levels of hierarchy in the FunCat structure.

In the data with 8,397 interactions between 4,380 yeast proteins, we found 100 near-cliques with the minimum size of a clique set to 3 and the maximum size of a near-clique set to 40. Only one near-clique contains more than 40 proteins, and so it was split into 17 small near-cliques, resulting in total 116 near-cliques. Figure 3 shows an example of the network of yeast protein interactions with 6 near-cliques. Proteins in each near-clique share at least one function with other proteins within the near-clique.

860270.fig.003
Figure 3: Six near-cliques found in yeast protein interaction networks. Proteins in each near-clique share at least one function with other proteins within the near-clique.

As shown in Table 1, 68 (59%) out of the 116 near-cliques have at least one function shared by all the proteins in the near-cliques (100% sharing), and 39 near-cliques have a function shared by more than 50% of the proteins in the near-cliques, supporting data are available at http://wilab.inha.ac.kr/ppi/homepage.mht. Only 9 near-cliques have no function shared by > 50% of the proteins in the near-cliques. As shown in Figure 4, the functional coherence of each near-clique is high. The functional coherence was computed by the ratios of the number of proteins having a specific functional category to the group size (i.e., the number of proteins in the group).

tab1
Table 1: Functional groups identified from the yeast protein interaction data. 68 modules have at least one function shared by all the proteins in the groups (100% sharing), and 39 groups have a function shared by more than 50% of the proteins in the groups. Only 9 groups have no function shared by >50% of the proteins in the group. This table shows only one function with the highest functional coherence in each group. All the functions shared by more than 50% of the proteins in each group are available at http://wilab.inha.ac.kr/ppi/homepage.mht.
fig4
Figure 4: The functional coherence in each of the 116 groups, computed as the ratio of the number of proteins having a specific functional category to the number of proteins in the group. The black, white, and grey bars represent functional categories with the ratios 0.5 and the maximum number of such ratios is limited to 3 in each group.

Interestingly, most near-cliques found by our algorithm belong to multifunctional categories. For example, two functional categories are common to all the proteins in a near-clique of Figure 5. As shown in Table 2, the near-clique identified as group 93 by our program is involved in both stress response (functional category 32.01) and biosynthesis of vitamins, cofactors, and prosthetic groups (functional category 01.07.01).

tab2
Table 2: Functional annotation of group 93 shown in Figure 5. The code represents functional category.
860270.fig.005
Figure 5: Group 93 identified as a near-clique by our algorithm.

Near-cliques may correspond to protein complexes in addition to functional modules. So, we compared the near-cliques identified by our algorithms with known yeast protein complexes, which are cataloged in the MIPS Saccharomyces cerevisiae genome database (http://mips.gsf.de/genre/proj/yeast). For each near-clique, we found a best-matching protein complex by minimizing the probability of a random overlap between the two, using the following equation [4, 5]: 𝑃overlap=𝑛2𝑘𝑁𝑛2𝑛1𝑘𝑁𝑛1,(1) where 𝑛1, 𝑛2 are the sizes of a known protein complex and a computed module, 𝑘 is the number of their common proteins, and 𝑁 is the size of the network.

As shown in Table 3, 65 near-cliques (56% of the total 116 near-cliques) identified by our algorithm show a good agreement (ln(𝑃overlap)<14) with the protein complexes cataloged in MIPS.

tab3
Table 3: The near-cliques matched with experimentally determined protein complexes cataloged in MIPS. The overlap column represents the number of proteins common to the near-cliques and the protein complexes.
tab4
Table 4: Comparison of our method and CFinder in terms of the number of functional categories shared by all the proteins in the groups.

To compare the functional coherence of the groups found by our program with that of cliques found by CFinder, we tested both programs on the same dataset. 75.9% of the groups identified by our program have at least two functional categories shared by all the proteins in the groups, whereas 63.1% of the groups identified by CFinder have at least two functional categories shared by all the proteins in the groups (Table 4). This result indicates that our program finds groups with stronger functional coherence than CFinder.

Table 5 shows the actual running times of our program and CFinder on three datasets of yeast protein interactions. Our program is faster than CFinder on all datasets, and the difference in speed becomes more obvious as the dataset becomes bigger.

tab5
Table 5: Running times of the programs on 3 data sets of yeast protein interactions on a Pentium IV 3.0 GHz processor with 512 MB memory.

4. Conclusion

Identifying hidden topological structures of protein interaction networks often unveil biologically relevant functional groups and structural complexes. We developed an efficient heuristic algorithm for finding cliques and near-cliques in protein interaction networks. From the interaction data of yeast proteins, the algorithm identified 116 near-cliques. Comparison with the experimental data showed that 59% of the near-cliques have at least one function shared by all the proteins within a near-clique, and that 56% of the near-cliques show a good agreement with known protein complexes, which are cataloged in the MIPS Saccharomyces cerevisiae genome database.

Acknowledgments

This work was supported by the Korea Science and Engineering Foundation (KOSEF) under Grant no. F01-2007-000-10140-0 and Grant no. F01-2007-000-10140-0 through the Systems Bio-Dynamics Research Center.

References

  1. D. Bu, Y. Zhao, L. Cai, et al., “Topological structure analysis of the protein-protein interaction network in budding yeast,” Nucleic Acids Research, vol. 31, no. 9, pp. 2443–2450, 2003. View at Publisher · View at Google Scholar
  2. R. Battiti and M. Protasi, “Reactive local search for the maximum clique problem,” Algorithmica, vol. 29, no. 4, pp. 610–637, 2001.
  3. K. Katayama, A. Hamamoto, and H. Narihisa, “An effective local search for the maximum clique problem,” Information Processing Letters, vol. 95, no. 5, pp. 503–511, 2005. View at Publisher · View at Google Scholar
  4. B. Adamcsek, G. Palla, I. Farkas, I. Derényi, and T. Vicsek, “CFinder: locating cliques and overlapping modules in biological networks,” Bioinformatics, vol. 22, no. 8, pp. 1021–1023, 2006. View at Publisher · View at Google Scholar
  5. G. Palla, I. Derényi, I. Farkas, and T. Vicsek, “Uncovering the overlapping community structure of complex networks in nature and society,” Nature, vol. 435, no. 7043, pp. 814–818, 2005. View at Publisher · View at Google Scholar
  6. B.-H. Ju and K. Han, “Complexity management in visualizing protein interaction networks,” Bioinformatics, vol. 19, 1, pp. i177–i179, 2003. View at Publisher · View at Google Scholar
  7. F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Paris, “Defining and identifying communities in networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 9, pp. 2658–2663, 2004. View at Publisher · View at Google Scholar
  8. V. Spirin and L. A. Mirny, “Protein complexes and functional modules in molecular networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 21, pp. 12123–12126, 2003. View at Publisher · View at Google Scholar
  9. T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki, “A comprehensive two-hybrid analysis to explore the yeast protein interactome,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 8, pp. 4569–4574, 2001. View at Publisher · View at Google Scholar
  10. P. Uetz, L. Giot, G. Cagney, et al., “A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae,” Nature, vol. 403, no. 6770, pp. 623–627, 2000. View at Publisher · View at Google Scholar
  11. A. Ruepp, A. Zollner, D. Maier, et al., “The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes,” Nucleic Acids Research, vol. 32, no. 18, pp. 5539–5545, 2004. View at Publisher · View at Google Scholar