Abstract
Biological processes are often performed by a group of proteins rather than by individual proteins, and proteins in a same biological group form a densely connected subgraph in a protein-protein interaction network. Therefore, finding a densely connected subgraph provides useful information to predict the function or protein complex of uncharacterized proteins in the highly connected subgraph. We have developed an efficient algorithm and program for finding cliques and near-cliques in a protein-protein interaction network. Analysis of the interaction network of yeast proteins using the algorithm demonstrates that 59% of the near-cliques identified by our algorithm have at least one function shared by all the proteins within a near-clique, and that 56% of the near-cliques show a good agreement with the experimentally determined protein complexes catalogued in MIPS.
1. Introduction
Proteins in a highly connected subgraph of a protein interaction network usually share a common function [1]. Therefore, a highly connected subgraph such as clique and near-clique in a protein interaction network can be used to predict the function of uncharacterized proteins in the highly connected subgraph. Finding a clique with a maximum size in a graph is an NP-hard problem [2]. There are several heuristic algorithms for the maximum clique problem [2, 3], but most of them focus on finding a complete subgraph (i.e., clique) and cannot be used to find near-cliques.
Several topological analysis methods have been developed for identifying biologically meaningful groups from protein interaction networks or for assessing the reliability of protein interactions. A recent program called CFinder [4, 5] finds overlapping cliques in protein interaction networks. It allows a protein to belong to more than one clique, but cannot find near-cliques. Our study shows that the near-cliques can reveal higher functional coherence than the overlapping cliques.
The primary focus of this study is to find functional groups by identifying cliques and near-cliques in protein interaction networks. This study attempts to answer two questions as follows. βCan we efficiently find all cliques and near-cliques?" and βdoes a dense subgraph such as clique and near-clique indeed represent a functional module or protein complex?" This study demonstrates that the answers to both questions are βyes." This paper presents an algorithm for finding near-cliques and its application to the interaction network of yeast proteins.
2. Algorithms for Finding Near-Cliques
A clique is a complete graph in which every node is connected to every other node in the graph. In our previous work, we developed a heuristic algorithm and implemented the algorithm in a program called InterViewer [6], which identifies all edge-disjoint cliques (i.e., cliques that do not share an edge).
Our experience with protein interaction networks suggests that a near-clique as well as a clique often represents a biologically meaningful unit such as functional module or protein complex. A near-clique is almost a clique but is not a clique due to a few missing edges. We consider near-cliques of the following basic types, which are biologically meaningful clusters (see Figure 1).
(a)
(b)
(c)
Type A
When a protein outside a clique interacts with two or more proteins in the clique, the protein and the clique forms a near-clique.
Type B
When a clique shares a protein with other cliques, the cliques form a near-clique.
Type C
When two or more cliques interact with a common protein outside them and the protein has at least two interactions with each clique, the cliques and the protein form a near-clique.
The near-cliques of types A and C can be refined using the indegree and outdegree of a node (there is no change to the near-clique of type B). For a node in subgraph , is the number of the edges connecting node to other nodes in , and is the number of edges connecting node to other nodes that are in but not in . We use the definition of a community in a strong sense [7] to find more near-cliques in a graph.
Definition 1. A subgraph is a community in a strong sense if for every in .
The original definition of a strong community misses many near-cliques due to a single node in the communities. For example, in Figure 2(a), node cannot belong to a near-clique since . Likewise, node in Figure 2(b) cannot belong to a near-clique because . Thus, nodes with only one edge connected to them and their edges are removed from the graph when we search near-cliques in the graph. In the graph of Figure 2(a), nodes , , , and and their edges are removed. After removing them, node and the existing clique form a near-clique of type A. A cluster that satisfies for every in , where is the number of nodes in , forms a near-clique, too. The example shown in Figure 2(b) becomes a near-clique since it satisfies even if it does not satisfy .
(a)
(b)
(c)
Therefore, a near-clique of basic types A and C should satisfy at least one of the following conditions.
(1) for every in .(2).
After finding all edge-disjoint cliques first, we identify near-cliques as follows. More detailed description of finding near-cliques are outlined in Algorithms 1 and 2. In the algorithms, cIdx represents the index of a clique.
|
|
(1) Assign every node of a clique the index of the clique containing the node.(2) When a node of a clique has already an assigned clique index, assign the index to all nodes of the clique, and merge two cliques into a near-clique of type B.(3) When a node outside a clique forms a basic near-clique of type A due to the interactions with two or more proteins in the clique, and either or is true, assign the index of the clique to the node.(4)When two or more cliques form a near-clique due to two or more interactions with a common protein outside the cliques, and either or is true, merge the cliques and the protein into a near-clique of type C. A near-clique is formed by selecting nodes with the same clique index () as those nodes with .
Since the most relevant processes form a group of proteins of moderate size in biological networks [8], we obtain near-cliques smaller than the maximum size specified by a user. That is, when a near-clique bigger than the maximum size is found (e.g., near-clique with more than 50 nodes), it is split into smaller near-cliques (3 near-cliques in Figure 2(c)). The way we split a big near-clique is as follows. When our program finds a big near-clique with the minimum clique size set to , we rerun the program on the big near-clique with the minimum clique size set to to find a new clique and a near-clique with the clique. After removing the new near-clique from the original, big near-clique, we run the program again with the minimum clique size set to . The big near-clique shown in Figure 2(c) is split into 3 small near-cliques with at least 4 proteins each.
3. Results and Comparison with Experimental Data
We tested the algorithms on the data with 8,397 interactions between 4,380 yeast proteins, which is the combined data of Ito et al. [9], Uetz et al. [10], and MIPS (http://mips.gsf.de) with redundant data removed. To every protein in the near-cliques, we assigned the functional categories of the Functional Catalog (FunCat) version 2.0 [11], which includes 97 functional categories. There are six levels of hierarchy in the FunCat structure.
In the data with 8,397 interactions between 4,380 yeast proteins, we found 100 near-cliques with the minimum size of a clique set to 3 and the maximum size of a near-clique set to 40. Only one near-clique contains more than 40 proteins, and so it was split into 17 small near-cliques, resulting in total 116 near-cliques. Figure 3 shows an example of the network of yeast protein interactions with 6 near-cliques. Proteins in each near-clique share at least one function with other proteins within the near-clique.
As shown in Table 1, 68 (59%) out of the 116 near-cliques have at least one function shared by all the proteins in the near-cliques (100% sharing), and 39 near-cliques have a function shared by more than 50% of the proteins in the near-cliques, supporting data are available at http://wilab.inha.ac.kr/ppi/homepage.mht. Only 9 near-cliques have no function shared by 50% of the proteins in the near-cliques. As shown in Figure 4, the functional coherence of each near-clique is high. The functional coherence was computed by the ratios of the number of proteins having a specific functional category to the group size (i.e., the number of proteins in the group).
(a)
(b)
(c)
(d)
Interestingly, most near-cliques found by our algorithm belong to multifunctional categories. For example, two functional categories are common to all the proteins in a near-clique of Figure 5. As shown in Table 2, the near-clique identified as group 93 by our program is involved in both stress response (functional category 32.01) and biosynthesis of vitamins, cofactors, and prosthetic groups (functional category 01.07.01).
Near-cliques may correspond to protein complexes in addition to functional modules. So, we compared the near-cliques identified by our algorithms with known yeast protein complexes, which are cataloged in the MIPS Saccharomyces cerevisiae genome database (http://mips.gsf.de/genre/proj/yeast). For each near-clique, we found a best-matching protein complex by minimizing the probability of a random overlap between the two, using the following equation [4, 5]: where , are the sizes of a known protein complex and a computed module, is the number of their common proteins, and is the size of the network.
As shown in Table 3, 65 near-cliques (56% of the total 116 near-cliques) identified by our algorithm show a good agreement () with the protein complexes cataloged in MIPS.
To compare the functional coherence of the groups found by our program with that of cliques found by CFinder, we tested both programs on the same dataset. 75.9% of the groups identified by our program have at least two functional categories shared by all the proteins in the groups, whereas 63.1% of the groups identified by CFinder have at least two functional categories shared by all the proteins in the groups (Table 4). This result indicates that our program finds groups with stronger functional coherence than CFinder.
Table 5 shows the actual running times of our program and CFinder on three datasets of yeast protein interactions. Our program is faster than CFinder on all datasets, and the difference in speed becomes more obvious as the dataset becomes bigger.
4. Conclusion
Identifying hidden topological structures of protein interaction networks often unveil biologically relevant functional groups and structural complexes. We developed an efficient heuristic algorithm for finding cliques and near-cliques in protein interaction networks. From the interaction data of yeast proteins, the algorithm identified 116 near-cliques. Comparison with the experimental data showed that 59% of the near-cliques have at least one function shared by all the proteins within a near-clique, and that 56% of the near-cliques show a good agreement with known protein complexes, which are cataloged in the MIPS Saccharomyces cerevisiae genome database.
Acknowledgments
This work was supported by the Korea Science and Engineering Foundation (KOSEF) under Grant no. F01-2007-000-10140-0 and Grant no. F01-2007-000-10140-0 through the Systems Bio-Dynamics Research Center.