Abstract

With the availability of more and more genome-scale protein-protein interaction (PPI) networks, research interests gradually shift to Systematic Analysis on these large data sets. A key topic is to predict protein complexes in PPI networks by identifying clusters that are densely connected within themselves but sparsely connected with the rest of the network. In this paper, we present a new topology-based algorithm, HKC, to detect protein complexes in genome-scale PPI networks. HKC mainly uses the concepts of highest k-core and cohesion to predict protein complexes by identifying overlapping clusters. The experiments on two data sets and two benchmarks show that our algorithm has relatively high F-measure and exhibits better performance compared with some other methods.

1. Introduction

With the development of high-throughput methods (such as mass spectrometry [1] and two-hybrid [2]), more and more genome-scale protein-protein interaction (PPI) networks are now available, enabling us to systematically analyze the behaviors and properties of biological molecules. Among the various researches on PPI networks, a key topic is to predict protein complexes in PPI networks. A protein complex is a group of proteins that interact with each other at the same time and place, forming a single multimolecular machine [3]. These complexes are a cornerstone of many biological processes. PPI networks are modular [3, 4] and contain modules that are densely connected within themselves but sparsely connected with the rest of the network. These modules are called cluster and they may represent protein complexes [5, 6]. Thus, protein complexes can be detected by identifying clusters from PPI networks. Detecting protein complexes in PPI networks is of vital importance to the understanding of the structural and functional properties of PPI networks and can also help to predict the function of unknown proteins.

Due to the high level of noise as well as the topological features of PPI networks, traditional clustering techniques in a metric space cannot successfully predict protein complexes in PPI networks [3], thus various graph analysis approaches have been proposed to solve this problem. These approaches can be classified into the following three types. (1) Agglomerative methods: Bader and Hogue [6] proposed a graph theoretic clustering algorithm to detect molecular complexes in PPI networks. The method was based on vertex weighting by local neighborhood density and outward traversal from a locally dense seed protein to isolate the dense regions according to given parameters. The problem of this method is that using only vertex weighting, it would find sparsely connected subgraphs (such as a rope-like subgraph or a mixed subgraph consisting of several connected clusters) instead of dense clusters. (2) Graph partition methods: King et al. [7] partitioned networks and found an approximate optimal solution in the space of partitions using a cost-based local search algorithm. This technique was nondeterministic and got different result in each run, and it consumed huge space. Chen and Yuan [8] extended a betweenness-based partition algorithm (Girvan-Newman algorithm [9], GN for short) and used it to partition PPI networks into subgraphs and then obtained function modules by filtering these subgraphs. The chief drawback of GN algorithm is it is time consuming. Generally, graph partition methods can only find nonoverlapping clusters, while protein complexes tend to overlap with each other. (3) Methods based on clique (see Section 2.2 Concepts): Palla et al. [10] suggested clique percolation method (CPM) to find overlapping communities. The core concept of CPM is k-clique community which is defined as the union of all k-cliques (complete subgraphs of size k) that can be reached from each other through a series of adjacent k-cliques (where adjacency means sharing k-1 nodes). Some other researchers [11, 12] predicted protein complexes by identifying cliques or near-cliques in PPI networks. Clique-based approaches were too stringent with the topological structure and cannot detect protein complexes with other types of topological structure.

In this paper, we present a new topology-based algorithm, HKC (as it mainly uses two important concepts, highest k-core and cohesion), to detect protein complexes in genome-scale PPI networks. Our algorithm uses the concepts of highest k-core, and cohesion to predict protein complexes by identifying overlapping clusters. It first calculates the score for each node in the PPI network based on the concept of highest k-core then uses the nodes with high scores and high degrees as seeds; from each seed, gets a core according to the corresponding highest k-core of the direct neighborhood of the seed and expands this core to include nodes which are highly possible to form a cluster based on the criteria of node score and cohesion; finally, protein complexes can be predicted by filtering all the found clusters according to predefined features. We apply HKC on two data sets (MIPS and SGD-MC data set) and evaluate the results on two benchmarks (complexcat and Gavin benchmarks). The experiments show that our algorithm has relatively high precision and recall, that is, most of the predicted clusters match well with known protein complexes, and at the same time most of the known protein complexes have been recalled. HKC exhibits better performance compared with some other methods, and besides, it is general and can be used in any type of biological interaction networks and even in nonbiological networks.

2. Materials and Method

2.1. Materials

Protein-protein interactions of the model organism Saccharomyces cerevisiae (yeast) has been studied thoroughly, and the data of yeast protein complexes is the most comprehensive so far, so we test HKC on the yeast PPI data and use yeast complex data to validate its effectiveness.

We use the following two data sets as the input network: (1) MIPS data set, it is processed based on the Munich Information Center for Protein Sequences (MIPS) [13] yeast PPI data set (containing 15,456 records) downloaded from the Comprehensive Yeast Genome Database (CYGD) [14]; after removing those repetitive interactions and taking no account of the edge direction, the final input PPI network contains 4,554 proteins and 12,526 interactions. (2) SGD-MC data set, which is downloaded from the Literature Curation Data in SGD database [15]; the original data set contains 252247 records, including two types of interactions, high-throughput and manually curated interactions, and we only use the manually curated interactions. After deleting all repetitive interactions and taking no account of the edge direction, the final SGD-MC data set contains 4,448 proteins and 29,068 interactions.

To evaluate the algorithm, we collect two data sets as benchmarks: (1) complexcat benchmark, it is obtained by processing the MIPS yeast protein complex catalog in CYGD [14]. The MIPS yeast protein complex catalog was last modified in 2006, and has been manually curated from the literature; therefore, it is more realistic than other data obtained by high-throughput methods and has been used in many researches because of its quality [6, 7, 16]. However, it is not proper to use this data set directly as the benchmark, since it contains many complexes composed of a single protein which do not fit the definition of clusters and also contains the complexes by Systematic Analysis [17, 18] which are not so reliable as results by small-scale experiments. After removing those complexes by Systematic Analysis (the type of 550) and the complexes consisting of only one protein, the final obtained complexcat benchmark contains 217 protein complexes. (2) Gavin benchmark, it is processed from the experiment results by Gavin et al. [19]. They use affinity purification and mass spectrometry to get 491 complexes that differentially combine with additional attachment proteins or protein modules to enable a diversification of potential functions. We adopt the core proteins (those present in 2/3 of the isoforms) of the 491 complexes and 36 additional known complexes, and after removing those of size less than 3, finally we get 204 protein complexes in Gavin benchmark.

2.2. Concepts
2.2.1. PPI Network

PPI networks can be intuitively modeled as a static graph , where is the set of nodes (proteins), and is the set of edges (protein-protein interactions).

An undirected edge is drawn between each pair of nodes for which there is evidence of a protein-protein interaction.

2.2.2. Basic Concepts in Graph Theory

Degree
The degree of the node is the number of edges attached to it and is denoted as .

Graph Density
There is no standard graph theory definition of density, but definitions are normally based on the connectivity level of a graph [6]. For undirected simple graphs, the graph density is defined as the quotient of the number of real edges in the graph divided by the number of all possible edges: where is the density of graph , is the number of real edges in , and is the number of nodes in .
For undirected graph with loops (a loop is the edge that connects a node and the node itself), the number of all possible edges is , thus the density of graph is defined as So, the range of graph density is between 0 and 1.

Clique
A clique is a fully connected subgraph, that is, a set of nodes that are all neighbors of each other [20]. For instance, Figure 1(c) is a clique consisting of 4 nodes.

k-Core
A k-core of a graph is a maximal subgraph such that each node in the subgraph has at least degree and is denoted as . For example, in Figure 1, the graph in (b) is a 2-core of the graph in (a).

Highest k-Core
The highest k-core of a graph is the one with the maximal value among all the k-cores and is denoted as . It is the central most densely connected subgraph. The highest k-core of can be found in the following way [6]: suppose the lowest degree of nodes in is , delete all nodes with degree , if all remaining nodes have a least degree (), we will get the -core of ; if some of the remaining nodes have degree lower than or equals to , continue delete all nodes with the least degree until all remaining nodes have a degree higher than , or until all nodes have been deleted. In this way we can find all k-cores and the one with the maximal k value is the highest k-core. For example, Figure 1(a) shows a graph , if we delete the node of degree 1 (i.e., node 6), we can obtain a subgraph in which each node has a least degree 2, that is, the 2-core of , as shown in Figure 1(b); if we continue delete the node of degree 2 (node 1), we can get a subgraph in which each node has a least degree 3, the 3-core of , as shown in Figure 1(c), and it is also the highest k-core of graph .

2.2.3. Cohesion

During the expansion of a core, only based on the score information, we cannot efficiently decide whether a node should be included into the core. We define a new concept cohesion to measure the connectivity between a node and an existing cluster , and we denote it as . Cohesion is calculated in the following way: where is the number of edges between node n and cluster c, and is the number of nodes in cluster . is a real number between 0 and 1. The larger is, the tighter node connects to cluster , and the more likely they belong to a larger cluster. Therefore, cohesion can be used as a criterion during the core expansion process. A node can only be included into a core when the cohesion between this node and the core is greater than a specified threshold.

2.3. The Algorithm

The algorithm HKC consists of the following three steps: scoring, cluster finding, and filtering.

2.3.1. Scoring

The first step of HKC is to score all nodes in the PPI network. For each node, firstly we find the highest k-core of its direct neighborhood (the subgraph consisting of all nodes connecting to the node, including the node itself), which we denote as H. Then we score H using the properties of highest k-core. Larger and denser cores will get higher scores, and it is computed in the following way: where denotes the number of nodes in , refers to the density of , is the maximal k value corresponding to the highest k-core , and the larger these three values are, the larger and denser the corresponding highest k-core is.

As highest k-core is the most densely connected central core in the local area, in order to make sure that the node score gives better reflection of the connectivity in the local area, we assign score (H) to each node in . For node , it may be contained in more than one highest k-core, thus it may be given more than one score, and the final score of this node is defined as the maximal score of all highest k-cores in which node n is contained (see the pseudocode in line 14–16 in Pseudocode 1).

Input:
 PPI network (an indirect simple graph):
 Node score threshold: T1 and T2
 Cohesion threshold: T3
Output:
 The predicted protein clusters: Clusters
Call Scoring
Call ClusterFinding
Call Filtering
// step1: scoring
Procedure Scoring
for all node n in do
  score(n) = 0
end for
 for all node n in do
  compute the degree of n
  N = neighborhood (n) // neighborhood returns the
           // direct neighborhood of n (including n)
  H = hkc(N) // hkc returns the highest k-core of N
   = the maximal k value corresponding to H
  NH = the number of nodes in H
  den(H) = the density of H
  
  for all node m in H do
      // compute the score of node
  end for
end for
end procedure

In this way, we can make sure that all nodes in densely connected subgraphs will have high score and those with little neighbors will have low scores, and therefore we can distinguish the nodes in clusters from those not in clusters with the help of node score.

2.3.2. Cluster Finding

The second step of this algorithm is to find clusters in the scored graph. The process of finding a cluster is as follows: first choose the seed, then obtain the core based on the seed, and then expand the core to include the noncore nodes. The final cluster obtained in this way is thus a circular subgraph with densely connected core and less densely connected noncore, as shown in Figure 2, which fits our expectation of the topological structure of protein complexes.

The detailed process of finding a cluster can be divided into the following three steps. (1)Choose the seed. As node scores represent the local density, if a node has very high score, it must have a very dense neighborhood; besides, for the nodes with the same score value (e.g., according to our scoring scheme, nodes in the same highest k-core may have the same score value), the one with the highest degree can be deemed as the most densely connected node among them. Therefore, among all the unseen nodes with the highest score we choose the one with the highest degree as the seed (see the pseudocode in line 2–4 in Pseudocode 2), and this way of seed selection can insure that the seed is in the center of the corresponding cluster. (2)Get the core based on the selected seed. First get the highest k-core of the direct neighborhood of the seed, and after removing all nodes that have been already seen and those with too low scores we can get the core (see the pseudocode in line 8–13 in Pseudocode 2). In order to avoid repetitive computation, all nodes in the core (including the seed) would be marked seen and cannot be used as the core or the seed of another cluster. (3)Expand the core. As the node score indicates the local connectivity in the neighborhood of this node, it can be used as a criterion during core expanding. When expanding a core, in order to guarantee the local density, nodes with too low scores could not be included into the core; on the other side, to avoid excessively expanding a little core to include a denser and larger cluster, nodes with extortionate scores could neither be included into the core. Furthermore, if we use node score threshold as the only criterion to decide whether a node should be included into the core, connected nodes with similar scores will be detected as one cluster when they do not actually make up a densely connected subgraph, such as those with rope-like shape. So we adopt cohesion as another criterion. The process of expanding a core is as follows (see the pseudocode in line 15–26 in Pseudocode 2): for any unexpanded node n, looking in its neighbors for nodes such that satisfy the following conditions: (1) the score of the node is greater than or equal to and less than or equal to ; (2) the cohesion between the node and the core is greater than or equal to the threshold . Where and are the node scores lower bound and upper bound, respectively, is a real number between 0 and 1, and is an integer greater than 1; is the cohesion threshold, a real number ranging from 0 to 1. Add the found nodes to the core and continue to expand the core until all nodes (including the new added nodes) in the core have been expanded.

Input:
 PPI network (an indirect simple graph):
 Node score threshold: T1 and T2
 Cohesion threshold: T3
Output:
 The predicted protein clusters: Clusters
Call Scoring
Call ClusterFinding
Call Filtering
// step2: Cluster finding
Procedure ClusterFinding
S = Sort( ) // sort all nodes descendingly according to node score, for nodes
      // with the same score put the one with higher degree ahead
n = the first node in S
Clusters = a empty list of cluster
while is not the last node in S do
  if is not seen then
   C = hkc(neighborhood(n))
   for all node q in C do
    if is seen or score(q) < score(n)*T1 then
     remove q in C
    end if
   end for
   mark all nodes in C seen
   for all node m in C do
    if m is expanded then continue
    for all i in neighbors of m do
     co = the cohesion between i and C
     if score(i) >= score(m)*T1  and
      score(i) <= score(m)*T2  and
      co >= T3  then
       add i to C
     end if
    end for
    mark m expanded
   end for
   add C to Clusters
  end if
  n = the next node of n in S
end while
end procedure

After finding a cluster using the above method, choose another seed and repeat the above cluster-finding process until no satisfying nodes can be considered as seed, in this way we can get all clusters in the PPI network.

It is worth noting that while expanding a core, we never consider whether a node has been seen or not. As a result, a noncore part of a cluster can include nodes that have been seen as the core of another cluster, that is to say the different clusters detected by HKC can have overlaps between cores and noncores. In this way, we can find overlapping clusters, which better coincide with the fact that different protein complexes have overlaps.

2.3.3. Filtering

The clusters found in step two contain many clusters of size one or two, and these little clusters are insignificant, since they can be obtained by randomly select nodes in a PPI network. Thus we filter out the clusters that contain less than three nodes. Score all clusters using the product of cluster density and cluster size, and larger and denser clusters will get higher scores (see the pseudocode in line 7–11 in Pseudocode 3). As the algorithm allows overlaps between cores and non-cores, the results in step two may contain highly similar clusters, which must be filtered in the postprocessing. We use overlap ratio (OR, see Section 3.1 for more details) to measure the similarity between clusters; compare each two clusters, when their overlap ratio is higher than 0.95, delete the one with lower score (see the pseudocode in line 12–19 in Pseudocode 3). Finally, rank all remaining clusters in descending order according to cluster score.

Input:
 PPI network (an indirect simple graph):
 Node score threshold: T1 and T2
 Cohesion threshold: T3
Output:
 The predicted protein clusters: Clusters
Call Scoring
Call ClusterFinding
Call Filtering
// step3: Filtering
Procedure Filtering
for all cluster c in Clusters do
  if the size of c is less than 3 then
   remove c in Clusters
  end if
end for
for all cluster c in Clusters do
  den(c) = the density of c
  s = size of c
  score(c) = den(c)* s  // compute the score of cluster
end for
for all cluster , in Clusters do
  o = the overlap ratio of and
  if > 0.95 then
   if score( ) > score( ) then delete in Clusters
   else delete in Clusters
   end if
  end if
end for
Clusters = sort Clusters descendingly according to cluster scores
end procedure

2.3.4. Pseudocode
2.4. Implementation

The algorithm has been implemented in Java and we plan to convert it into a Cytoscape plug-in. Now the source code of the algorithm is available freely for noncommercial purposes upon request. All maps of networks were performed by Cytoscape [21].

3. Results and Discussions

3.1. Evaluation of the Algorithm

To evaluate the performance of our algorithm, we compare the predicted clusters with the protein complexes in two different benchmarks: complexcat and Gavin benchmarks. Each of the predicted clusters is compared with the benchmark complexes. The similarity between a predicted cluster and a benchmark complex is measured by overlap ratio (), which is defined as follows: where is the number of proteins shared by a predicted cluster and a benchmark complex, is the number of proteins in the predicted cluster and is the number of proteins in the benchmark complex. The scope of is between 0 and 1. means the predicted cluster has no proteins in common with the benchmark complex; means it is perfectly matched with the benchmark complex. The higher is, the more biologically meaningful the detected cluster would be. A detected cluster can be deemed as being matched with a benchmark complex only when their overlap ratio is above a given threshold. And we call a cluster an effective cluster as long as it has at least one benchmark complex matching with it. In the same way, a matched complex refers to the benchmark complex that has a least one detected cluster that matching with it. A rational threshold should ensure that the detected cluster shares a large proportion of proteins with the matching benchmark complex, and meanwhile it could not be too stringent. In this paper, we adopt 0.4 as the threshold.

In [6] they use overlap score, defined as , to determine how effectively a predicted cluster matched to a known complex in the benchmark set, and it is assumed that a predicted cluster is more or less matches a known complex when is its overlap score is above 0.2. Here, we did not adopt this scoring scheme, because it is biased, that is, it would get a relatively high score when a small predicted cluster matching with a large known complex or a large predicted cluster matching with a small complex. For example, when a predicted cluster of size 2 shares 2 proteins with a known complex of size 10, its overlap score equals 0.2 and would thus be considered as matched with the known complex; actually, it is not so appropriate to deem such a small cluster as matching a known complex much larger that it. However, our definition of overlap ratio is more balanced and only gives high score when the matching cluster and complex have similar sizes. As for the above example, its overlap ratio is only 0.33, less than the threshold 0.4, and would not be deemed as a match.

To compare the performance of different algorithms, we define three criteria: precision, recall, and F-measure, defined as the following formulas: where EC is the number of effective clusters found by the algorithm, AC is the number of all clusters predicted by the algorithm, MC is the number of matched complexes in the benchmark set, and BC is the total number of benchmark complexes. Note that according to the overlap score threshold, EC may not equal MC, since one predicted cluster may match with several benchmark complexes as long as their overlap scores are higher than the given threshold, and in the same way one benchmark complex may correspond with several predicted complexes. Precision describes the accuracy of the algorithm result; recall denotes the percentage of benchmark complexes that are recovered by the algorithm. F-measure, which is the harmonic mean of precision and recall, shows a good balance of precision and recall, and thus can be used to measure the overall performance of algorithms.

3.2. Experiments and Comparison

We, respectively, use MIPS and SGD-MC data sets as the input PPI network and run HKC with 120 groups of parameter combination. The range of is between 0.3 and 0.7, with the step of 0.1, and the range of is between 5 and 20, with the step of 5, and the range of is between 0.4 and 0.9, with the step of 0.1. We evaluate the results using the two benchmarks: complexcat and Gavin benchmarks, and then choose the optimized parameters which enable F-measure to get the highest value. The best result and the corresponding optimized parameters for HKC are shown in Table 1.

To show the influence of different parameters on the algorithm performance, we draw the plot of average F-measure versus , , and , respectively, as shown in Figure 3. Note that here the F-measure in -axis is the average value of all F-measures with one parameter specified among the 120 groups of experiment results evaluated by the complexcat benchmark. From Figure 3, we can see that among the three parameters has the greatest influence on average F-measure, and for MIPS data set, the average F-measure gets the maximum value when , while for SGD-MC data set, the average F-measure gets the maximum value when . This is understandable, as the SGD-MC network (which contains 4,448 proteins and 29,068 interactions) is much denser than the MIPS network (which contains 4,554 proteins and 12,526 interactions), and during the core expansion process the nodes in SGD-MC network would have higher cohesion with the core than the nodes in MIPS network. Therefore, for SGD-MC network when expanding the core, the cohesion threshold should be higher than that for MIPS network. Furthermore, the figures show that a good range for is in the middle, between 0.4 and 0.6, the best value for is 10, and for the best range is between 0.5 and 0.8.

To show the performance of HKC, we compare it with MCODE [6], as shown in Table 1. We run the MCODE plugin in Cytoscape with 840 parameter combinations (the same with that used in [6]) on MIPS and SGD-MC data sets respectively, and then use the two benchmarks to evaluate the results. The optimized parameters (see Table 1) are chosen based on the highest F-measure. The result of HKC is the best one in 120 groups of parameters, and the corresponding optimized parameters are shown in Table 1. From this table, we can see that for MIPS data set, the recall of HKC is 0.429 corresponding to the complexcat benchmark, considerably higher than that of MCODE (0.194); for SGD-MC data set, HKC can recall as high as 58% of protein complexes in complexcat benchmark, notably higher than that of MCODE (around 22%). Experiment results show that whichever benchmark is adopted, for both MIPS and SGD-MC data set, the recall and F-measure of HKC are remarkably higher than that of MCODE, and the overall performance is substantially improved.

As shown in Figure 4, whatever the OR threshold is, HKC can extract much more effective clusters than MCODE in MIPS data set, and also the number of matched complexes by HKC is much higher than that by MCODE.

To show the overall performance improvement of our algorithm, we also plot precision versus recall for all results with different parameters in Figure 5. As can be seen from the figure, for all four cases, the data points resulted by HKC are located in the upper right portion of the plot, corresponding to high values of F-measure, while most of the data points resulted by MCODE are located in the lower left part of the plot. The figure illustrates that both precision and recall of HKC results for most parameter combinations are higher than that of MCODE, showing the overall improvement of the algorithm performance. From this figure, we can also see that the data points resulted by HKC are much more centralized than MCODE, indicating that our algorithm does not rely so severely on parameter selection.

3.3. Discussions

Among the 237 clusters found by HKC in MIPS data set, 8 clusters perfectly match with known protein complexes in complexcat benchmark. Figure 6(a) gives an example of one perfectly matched cluster: cluster 23 (consisting of 11 proteins and 54 interactions) perfectly matches with the TRAPP (transport protein particle) complex (catalog 260.60 in the complexcat benchmark), which plays an essential role in the vesicular transport from endoplasmic reticulum to Golgi.

Figure 6(b) shows an example of a containment match. Cluster 12 (consisting of 14 proteins and 92 interactions) is totally contained in a known complex of size 16, SAGA complex (catalog 510.190.10.20.10 in the complexcat benchmark), and their overlap ratio is 0.93. The two proteins YCL010c and YGL066w that are not recovered by cluster 12 have only one interaction with the cluster and do not exhibit good graph theoretic property. Actually, based on the available information currently, we cannot assert that YCL010c is contained in SAGA complex, and according to [22] it is only a probable subunit of SAGA complex.

Figure 6(c) gives an example of a well-matched cluster: cluster 103 matches with the complex of cytoplasmic translation initiation factor 3 (eIF3, catalog 500.10.40 in the complexcat benchmark). Each of them contains 7 proteins, their overlap is 6 and their overlap ratio is 0.857. As shown in the figure, protein YNL062c is contained in the benchmark complex, but is not included in cluster 103 predicted by HKC. Furthermore, it has only one interaction with cluster 103 and does not show an ideal topological property of belonging to a cluster. We searched it in Gene Ontology (GO) database [23] and found that according to the most updated GO annotation (release date 2011-05-14), YNL062c is not contained in the eIF3 complex (GO:0005852), but is a subunit of tRNA (1-methyladenosine) methyltransferase with Gcd14p required for the modification of the adenine at position 58 in tRNAs, especially tRNAi-Met. This indicates that the complexcat benchmark we use here may contain errors, because it was last modified in 2006 and many new protein complexes have been identified through experiments since then. In a way, it is possible to correct errors in the benchmark by carefully examining the difference between predicted clusters and their corresponding benchmark complexes with high overlap ratio.

Figure 6(d) shows a novel cluster (ranked 61) detected by HKC, and it involves 5 proteins and 10 interactions. Cluster 61 does not match with any known protein complex in the complexcat benchmark, but is highly homogenous in the cellular component ontology and biological process ontology. Search results on Gene Ontology database show that protein YGL153w, YLR191w and YNL214w form the docking complex that facilitates the import of peroxisomal matrix proteins, and YGL153w is a central component of the peroxisomal protein import machinery. The other two proteins in this cluster, YDR244w, and YDR142c, are the PTS1 signal recognition factor and the PTS2 signal recognition factor, respectively, and they also participate in the same biological process protein docking during peroxisome matrix protein import (GO: 0016560) as the docking complex.

The above illustrative examples show that HKC can not only effectively detect protein complexes in genome-scale PPI networks, but also through the comparison of predicted clusters and their matching benchmark complexes, it may help to correct the errors in the benchmark. Furthermore, HKC can discover novel protein complexes which can be used as candidates for experimental verification, and thus greatly helps to reduce the time consumption and cost of experiments. Among the 237 clusters resulted by our algorithm on MIPS data set, 147 clusters do not match with known complexes in complexcat benchmark, and we give all 49 clusters with score ≥4 and size ≥5 as novel predictions in Table 2, which would be a starting point for experimental validation in the future.

4. Conclusions and Future Work

A genome-scale PPI network is usually very large, consisting of thousands of proteins and tens of thousands of interactions, for example, the SGD-MC data set we use as the input PPI network in this paper contains 4,448 proteins and 29,068 interactions. It is a challenging task to extract protein complexes in such a large and complicated network. To solve this problem, many computational methods have been proposed, including the graph theoretic clustering algorithm. In this paper, we presented a new topology-based algorithm, HKC, which mainly used the concepts of highest k-core and cohesion to predict protein complexes by identifying overlapping clusters. The experiments on two data sets and two benchmarks showed that HKC can effectively extract protein complexes from genome-scale PPI networks and exhibited better performance compared with some other methods. Besides, HKC is general and can be used in any type of biological interaction networks.

There is huge amount of work to be done in PPI network analysis. As for protein complex prediction, there are also a lot of researches to be done in the future. One of the problems with the current PPI networks is that they are consisting of interactions that do not necessarily happen at the same time and space. Instead, the interactions in PPI networks may be unstable, transient or conditional, and may also happen in different subcellular locations. However, by definition, a protein complex is a group of proteins that interact with each other at the same time and place. As a result, to increase the precision and recall of complex prediction algorithm, further information about the time, space, or conditions of interactions should be taken into consideration.

Acknowledgments

This work was supported in part by the Grants from the National Natural Science Foundation of China (no. 60835005) and the scientific research project of National University of Defense Technology (jc09-03-04).