Biological processes are often performed by a group of proteins rather than by individual proteins, and proteins
in a same biological group form a densely connected subgraph in a protein-protein interaction network. Therefore,
finding a densely connected subgraph provides useful information to predict the function or protein complex of uncharacterized proteins in the highly connected subgraph. We have developed an efficient algorithm and program for finding cliques and near-cliques in a protein-protein interaction network. Analysis of the interaction network of yeast proteins using the algorithm demonstrates that 59% of the near-cliques identified by our algorithm have at least one function shared by all the proteins within a near-clique, and that 56% of the near-cliques show a good agreement with the experimentally determined protein complexes catalogued in MIPS.
1. Introduction
Proteins in a
highly connected subgraph of a protein interaction network usually share a
common function [1]. Therefore, a highly connected subgraph such as clique and
near-clique in a protein interaction network can be used to predict the
function of uncharacterized proteins in the highly connected subgraph. Finding
a clique with a maximum size in a graph is an NP-hard
problem
[2]. There are several heuristic algorithms for the
maximum clique problem [2, 3], but most of them focus on finding a complete
subgraph (i.e., clique) and cannot be used to find near-cliques.
Several topological analysis methods have been
developed for identifying biologically meaningful groups from protein
interaction networks or for assessing the reliability of protein interactions.
A recent program called CFinder [4, 5] finds overlapping cliques in protein
interaction networks. It allows a protein to belong to more than one clique, but cannot find near-cliques. Our study shows that the near-cliques can reveal
higher functional coherence than the overlapping cliques.
The primary focus of this study is to find functional
groups by identifying cliques and near-cliques in protein interaction networks.
This study attempts to answer two questions as follows. “Can we
efficiently find all cliques and near-cliques?" and “does a dense
subgraph such as clique and near-clique indeed represent a functional module or
protein complex?" This study demonstrates that the answers to both
questions are “yes." This paper presents an algorithm for finding
near-cliques and its application to the interaction network of yeast proteins.
2. Algorithms for Finding Near-Cliques
A clique is a complete graph in which every
node is connected to every other node in the graph. In our previous work, we
developed a heuristic algorithm and implemented the algorithm in a program
called InterViewer [6], which identifies all edge-disjoint cliques (i.e.,
cliques that do not share an edge).
Our experience with protein interaction networks
suggests that a near-clique as well as a clique often represents a biologically
meaningful unit such as functional module or protein complex. A near-clique is
almost a clique but is not a clique due to a few missing edges. We consider
near-cliques of the following basic types, which are biologically meaningful clusters (see Figure 1).
Figure 1: Near-cliques of types A, B, and C. Proteins outside a
clique are represented as shaded nodes.
Type A
When a
protein outside a clique interacts with two or more proteins in the clique, the
protein and the clique forms a near-clique.
Type B
When a
clique shares a protein with other cliques, the cliques form a near-clique.
Type C
When two or
more cliques interact with a common protein outside them and the protein has at
least two interactions with each clique, the cliques and the protein form a
near-clique.
The near-cliques of types A and C can be refined using
the indegree and outdegree of a node (there is no change to the near-clique of
type B). For a node in subgraph , is the number
of the edges connecting node to other nodes
in , and is the number
of edges connecting node to other nodes
that are in but not in . We use the definition of a community in a strong
sense [7] to find more near-cliques in a graph.
Definition 1. A subgraph is a community
in a strong sense if for every in .
The original definition of a strong community misses
many near-cliques due to a single node in the communities. For example, in
Figure 2(a), node cannot belong
to a near-clique since . Likewise, node in Figure 2(b)
cannot belong to a near-clique because . Thus, nodes with only one edge connected to them and
their edges are removed from the graph when we search near-cliques in the
graph. In the graph of Figure 2(a), nodes , , , and and their edges
are removed. After removing them, node and the
existing clique form a near-clique of type A. A cluster
that satisfies for every in , where is the number
of nodes in , forms a near-clique, too.
The example shown in Figure 2(b) becomes a near-clique since
it satisfies even if it does
not satisfy .
Figure 2: (a) After removing nodes , , , and and their
edges, node forms a near-clique of type A with the remaining nodes. (b) This
graph becomes a near-clique of type C since . (c) A big near-clique is too big (e.g., near-clique
with more than 50 nodes) and is split into smaller near-cliques (in this example, 3
small near-cliques).
Therefore, a near-clique of basic types
A and C should satisfy at least one of the
following conditions.
(1) for every in .(2).
After finding all edge-disjoint cliques first, we
identify near-cliques as follows. More detailed description of finding
near-cliques are outlined in Algorithms 1 and 2. In the algorithms, cIdx
represents the index of a clique.
Algorithm 1: AssignNearCliqueIdx.
Algorithm 2: ExtendNearClique.
(1) Assign every
node of a clique the index of the clique containing the node.(2) When a node of
a clique has already an assigned clique index, assign the index to all nodes of
the clique, and merge two cliques into a near-clique of type
B.(3) When a node outside a
clique forms a basic near-clique of type A due
to the interactions with two or more proteins in the clique, and either or is true, assign
the index of the clique to the node.(4)When two or
more cliques form a near-clique due to two or
more interactions with a common protein outside the cliques, and either or is true, merge
the cliques and the protein into a near-clique of type C. A near-clique is
formed by selecting nodes with the same clique index () as those
nodes with .
Since the most relevant processes form a group of
proteins of moderate size in biological networks [8], we obtain near-cliques
smaller than the maximum size specified by a user. That is, when a near-clique
bigger than the maximum size is found (e.g., near-clique with more than 50
nodes), it is split into smaller near-cliques (3 near-cliques in Figure 2(c)).
The way we split a big near-clique is as follows. When our program finds a big
near-clique with the minimum clique size set to , we rerun the program on the big near-clique with the
minimum clique size set to to find a new
clique and a near-clique with the clique. After removing the new near-clique
from the original, big near-clique, we run the program again with the minimum
clique size set to . The big near-clique shown in Figure 2(c) is split
into 3 small near-cliques with at least 4 proteins each.
3. Results and Comparison with Experimental Data
We tested the algorithms on the data with 8,397
interactions between 4,380 yeast proteins, which is the combined data of Ito et
al. [9], Uetz et al. [10], and MIPS (http://mips.gsf.de) with redundant data
removed. To every protein in the near-cliques, we assigned the functional
categories of the Functional Catalog (FunCat) version 2.0 [11], which
includes 97 functional categories. There are six
levels of hierarchy in the FunCat structure.
In the data with 8,397 interactions between 4,380
yeast proteins, we found 100 near-cliques with the minimum size of a clique set
to 3 and the maximum size of a near-clique set to 40. Only one near-clique
contains more than 40 proteins, and so it was split into 17 small near-cliques,
resulting in total 116 near-cliques. Figure 3 shows an example of the network
of yeast protein interactions with 6 near-cliques. Proteins in each near-clique
share at least one function with other proteins within the near-clique.
Figure 3: Six near-cliques found in yeast protein interaction networks.
Proteins in each near-clique share at least one function with other proteins
within the near-clique.
As shown in Table 1, 68 (59%) out of the 116
near-cliques have at least one function shared by all the proteins in the
near-cliques (100% sharing), and 39 near-cliques have a function shared by more
than 50% of the proteins in the near-cliques, supporting data are
available at http://wilab.inha.ac.kr/ppi/homepage.mht. Only
9 near-cliques have no function shared by 50% of the
proteins in the near-cliques. As shown in Figure 4, the functional coherence of
each near-clique is high. The functional coherence was computed by the ratios
of the number of proteins having a specific functional category to the group
size (i.e., the number of proteins in the group).
Table 1: Functional groups identified from the yeast protein interaction data. 68
modules have at least one function shared by all the proteins in the groups
(100% sharing), and 39 groups have a function shared by more than 50% of the
proteins in the groups. Only 9 groups have no function shared by 50% of the
proteins in the group. This table shows only one function with the highest
functional coherence in each group. All the functions shared by more than 50%
of the proteins in each group are available at
http://wilab.inha.ac.kr/ppi/homepage.mht.
Figure 4: The functional coherence in each of the 116 groups,
computed as the ratio of the number of proteins having a specific functional
category to the number of proteins in the group. The black, white, and grey
bars represent functional categories with the ratios and the maximum
number of such ratios is limited to 3 in each group.
Interestingly, most near-cliques found by our
algorithm belong to multifunctional categories. For example, two functional
categories are common to all the proteins in a near-clique of Figure 5. As
shown in Table 2, the near-clique identified as group 93 by our program is
involved in both stress response (functional category 32.01) and biosynthesis
of vitamins, cofactors, and prosthetic groups (functional category 01.07.01).
Table 2: Functional annotation of group 93 shown in Figure
5.
The code represents functional category.
Figure 5: Group 93 identified as a near-clique by our
algorithm.
Near-cliques may correspond to protein complexes in
addition to functional modules. So, we compared the near-cliques identified by
our algorithms with known yeast protein complexes, which are cataloged in the
MIPS Saccharomyces cerevisiae genome database
(http://mips.gsf.de/genre/proj/yeast). For each near-clique, we found a
best-matching protein complex by minimizing the probability of a random overlap
between the two, using the following equation [4, 5]:
where , are the sizes
of a known protein complex and a computed module, is the number
of their common proteins, and is the size of
the network.
As shown in Table 3, 65 near-cliques (56% of the total
116 near-cliques) identified by our algorithm show a good agreement () with the
protein complexes cataloged in MIPS.
Table 3: The near-cliques matched with experimentally determined
protein complexes cataloged in MIPS. The overlap column
represents the number of proteins common to the near-cliques and
the protein complexes.
Table 4: Comparison of our method and CFinder in terms of the
number of functional categories shared by all the proteins in the
groups.
To compare the functional coherence of the groups
found by our program with that of cliques found by CFinder, we tested both
programs on the same dataset. 75.9% of the groups identified by our program
have at least two functional categories shared by all the proteins in the
groups, whereas 63.1% of the groups identified by CFinder have at least two
functional categories shared by all the proteins in the groups (Table 4). This result
indicates that our program finds groups with stronger functional coherence than
CFinder.
Table 5 shows the actual running times of our program
and CFinder on three datasets of yeast protein interactions. Our program is
faster than CFinder on all datasets, and the difference in speed becomes more
obvious as the dataset becomes bigger.
Table 5: Running times of the programs on 3 data sets
of yeast protein interactions on a Pentium IV 3.0 GHz processor with 512 MB
memory.
4. Conclusion
Identifying
hidden topological structures of protein interaction networks often unveil
biologically relevant functional groups and structural complexes. We developed
an efficient heuristic algorithm for finding cliques and near-cliques in
protein interaction networks. From the interaction data of yeast proteins, the
algorithm identified 116 near-cliques. Comparison with the experimental data
showed that 59% of the near-cliques have at least one function shared by all
the proteins within a near-clique, and that 56% of the near-cliques show a good
agreement with known protein complexes, which are cataloged in the MIPS
Saccharomyces cerevisiae genome database.
Acknowledgments
This work was
supported by the Korea Science and Engineering Foundation (KOSEF) under Grant
no. F01-2007-000-10140-0 and Grant no. F01-2007-000-10140-0 through the Systems
Bio-Dynamics Research Center.