Abstract
Protein complexes play a critical role in understanding the biological processes and the functions of cellular mechanisms. Most existing protein complex detection algorithms cannot reflect dynamics of protein complexes. In this paper, a novel algorithm named Improved Cuckoo Search Clustering (ICSC) algorithm is proposed to detect protein complexes in weighted dynamic proteinprotein interaction (PPI) networks. First, we constructed weighted dynamic PPI networks and detected protein complex cores in each dynamic subnetwork. Then, ICSC algorithm was used to cluster the protein attachments to the cores. The experimental results on both DIP dataset and Krogan dataset demonstrated that ICSC algorithm is more effective in identifying protein complexes than other competing methods.
1. Introduction
Proteins are indispensable to cellular life. Biological functions of cells are carried out by protein complexes rather than single proteins [1]. Detecting these protein complexes can help to predict protein functions and explain biological processes, which has great significance in biology, pathology, and proteomics [2]. Therefore, the study of protein complexes has become one of most important subjects. Many of experimental methods combined with computational strategies have been proposed to predict and identify protein complexes, such as affinity purification and mass spectrometry [3–5]. However, they are costly and have difficulty in capturing the protein complexes instantaneous and dynamic changes [6].
The high throughput techniques have generated a large amount of proteinprotein interaction (PPI) data, gene expression data, and protein structure data, which enable scholars to find protein complexes based on the topological properties of PPI networks and structural information of proteins [7]. Bader and Hogue proposed MCODE [8] method to detect protein complexes based on the proteins’ connectivity and density in PPI networks. Liu et al. [9] presented a method called CMC to identify protein complexes based on maximal cliques. Protein complexes integrate multiple gene products to perform cellular functions and may have overlapping. Nepusz et al. [10] developed a clustering algorithm ClusterONE to detect overlapping protein complexes. Gavin et al. [11] suggested that there are two types of proteins in complexes: core components and attachments [11]. According to the coreattachment structure of protein complexes, Leung et al. [12] designed CORE algorithm which calculated the value to detect cores. Wu et al. [13] proposed COACH algorithm to detect dense subgraphs as core components. The biological processes are dynamic and PPIs are changing over time [14]. Therefore, it is necessary to shift the study of protein complexes from static PPI networks to the dynamic characteristics of PPI networks [15]. Wang et al. constructed dynamic PPI network based on time series gene expression data to detect protein complexes [16]. Zhang et al. proposed CSO [17] algorithm by constructing ontology attributed PPI networks based on GO annotation information. Some classical clustering algorithms such as Markov clustering (MCL) [18] and fuzzy clustering [19, 20] were also developed to detect protein complexes.
However, with the birth of the biological simulation technology, bioinspired algorithms provided a new perspective for solving protein complex detection problem [21]. In 2016, Lei et al. proposed FMCL [22] clustering model based on Markov clustering and firefly algorithm which automatically adjusted the parameters by introducing the firefly algorithm. At the same year, Lei et al. proposed FOCA [6] clustering model which was based on the fruit flies’ foraging behavior and protein complexes’ coreattachment structure. The previous studies proved that the protein complex detection methods based on the bioinspired algorithms had shown a relatively better performance.
Cuckoo Search (CS) algorithm is a new intelligence optimization algorithm which has been successfully applied to the global optimization problem, clustering, and other fields [21]. In this study, according to the coreattachment structure of protein complexes and CS mechanism, a new clustering method named Improved Cuckoo Search Clustering (ICSC) algorithm was proposed to detect protein complexes in weighted dynamic PPI networks, in which the corresponding relationships between CS algorithm and clustering procedure of PPI data are established.
2. Methods
2.1. Constructing Weighted Dynamic PPI Network
The static PPI networks data produced by high throughput experiments generally contain a high rate of false positive and false negative interactions [9], which makes it inaccurate to predict protein complexes and impossible to reflect the real dynamic changes of PPIs in a cell. To address this problem, some scholars used the computational methods to evaluate the interactions [23]. On the other hand, the protein dynamic information such as gene expression data, subcellular localization data, and transcription regulation data were integrated to reveal the dynamics of PPIs [24–26]. Tang et al. [27] constructed time course PPI network (TCPIN) by using gene expression data over three successive metabolic cycles. The expression values of genes were compared with a singlethreshold to determine whether a gene was expressed. Some essential genes were filtered out by the singlethreshold for their low expression levels. Wang et al. [28] developed a threesigma method to define an active threshold for each gene and then constructed dynamic PPI network (DPIN) by using active proteins based on the static PPI network in combination with gene expression data. Many previous studies have revealed that the threesigma principle had better prediction performance. In this study, we use threesigma principle to construct the DPIN. The gene expression data includes three successive metabolic cycles; each cycle has 12 timestamps, so the DIPN includes 12 subnetworks.
A protein is considered to be active in a dynamic PPI subnetwork only if its gene expression value is greater than or equal to the active threshold [28]:where is the algorithmic mean of gene expression values of protein over timestamps 1 to and is the standard deviation of its gene expression values. is defined as follows:
A static PPI network is usually described as an undirected graph which consists of a set of nodes and a set of edges , the nodes in represent the proteins and the edges in represent the connections between pairs of proteins and . is denoted as the dynamic PPI subnetwork at timestamp (). Protein interacts with protein in a dynamic PPI subnetwork only if they are active in the same timestamp and connect with each other in the static PPI network.
As shown in Figure 1, threesigma principle was applied to calculate the active threshold for each protein and to determine the active timestamps. After that, 12 dynamic subnetworks were constructed.
Clustering coefficient has been used as an effective tool to analyze the topology of PPI networks [29]. Radicchi et al. proposed the edge clustering coefficient (ECC) [30]. In PPI network, the ECC of an edge connecting proteins and can be expressed as follows:where is the number of triangles built on edge (); and are the degrees of protein and , respectively. Edge clustering coefficient is a local variable which characterizes the closeness of two proteins and .
The Pearson correlation coefficient (PCC) was calculated to evaluate how strong two interacting proteins are coexpressed [31]. The PCC value of a pair of genes and , which encode the corresponding paired proteins and interacting in the PPI network, is defined aswhere and are the mean gene expression value of proteins and , respectively. The value of PCC ranges from −1 to 1; if is a positive value, there is a positive correlation between proteins and .
The protein complex is a group of proteins which show high coexpression patterns and share high degree of functional similarity, so we integrate GOslims data from the point of view of protein functions. If two interacted proteins and have some common GO terms, their functions are more similar. Let denote this correlation which can be computed as follows:where and represent the number of GO terms for proteins and , respectively. In the dynamic PPI subnetwork , the weight between proteins and is defined as follows:
Up to now, the weighted dynamic PPI network was constructed.
2.2. Cuckoo Search Algorithm
CS algorithm was a novel bioinspired metaheuristic optimization algorithm proposed in 2009 [32], which was based on the obligatory brood parasitic behaviors of some cuckoo species in combination with the Lévy flight behaviors.
During the breeding period, some certain species of cuckoos lay their eggs in host nests. The cuckoos usually look for host birds which have similar incubation period and brood period. Moreover, their eggs are similar to each other in many aspects of color, shape, size, and cicatricle. The cuckoo flight strategy demonstrates the typical characteristics of Lévy flights. Lévy flights comprise sequences of randomly orientated straightline movements. Actually, the strategies of frequently occurring but relatively short straightline movements, as well as randomly alternating with more occasionally occurring longer movements, can maximize the efficiency of resource search [33].
Specifically, for a cuckoo when generating new solutions , a Lévy flight is performed by using the following equation:where is the step size which should be related to the scales of the problem of interests. In most cases, we can use ; means the Hadamard product operator. The Lévy flight is a type of random walk which has a power law step length distribution with a heavy tail and the value of between 1 and 3.
2.3. The ICSC Algorithm
Our ICSC is developed to detect protein complexes in weighted dynamic PPI network through the use of improved CS algorithm. It has been widely accepted that protein complexes are organized in the coreattachment structure.
The core is a small subgraph in a PPI network with high density. As shown in Figure 2(a), four highly connected subgraphs constitute cores, denoted by core1, core2, core3, and core4 (red round proteins in the dashed circle). Several peripheral connection protein nodes are attachments (blue square proteins) in this PPI network. The blue square proteins and black diamond proteins are all noncore proteins.
In ICSC algorithm, each cuckoo was viewed as a noncore protein (marked with black round in Figure 2(b)), and the nest was viewed as the core proteins (marked with black circles in Figure 2(b)), while the cuckoo population is denoted as a group of clustering results. The noncore proteins become attachments if a cuckoo finds an appropriate nest to lay eggs. Figure 2 illustrates the corresponding relationships between ICSC algorithm and the clustering procedure of a PPI network. Algorithm 1 indicates the function of the proposed algorithm ICSC. The ICSC method operates in three phases. In the first step, some dense subgraphs were selected as initial nests. Then the cuckoos are generated based on these nests. Last the improved Cuckoo Search strategy was applied to generate protein complexes. The complexes in different dynamic subnetworks may have a high level of similarity, so a refinement procedure is applied in order to filter out redundancies and generate the final set of protein complexes.

“Initial nest” subfunction (Algorithm 1) tries to generate initial nests. The initial nests can be seen as the core proteins for each protein complex. The weight of dynamic PPI subnetwork has considered the PCC, ECC, and GSM, so the weight threshold wth can be used to find some protein pairs which have highly functional similarity and high coexpression. For , if the weight is larger than wth, the node pair () is denoted as one initial nest, where is the average weight of . Protein complex cores often correspond to the small, dense, and reliable subgraphs in PPI networks, but the node pairs may have overlaps with each other. So the node clustering coefficient (NCC) was used to filter out the overlapping nests, which is defined as follows:where is the degree of node , is the number of links connecting the neighbors of node v to each other. Because the PPI network has a large number of nodes and edges, many nodes may have the same value of node clustering coefficient. In this study, the weighted node clustering coefficient (WNCC) was defined to distinguish the importance of nodes in the dynamic PPI network. For two initial nests () and (), if and , they are merged into (). The WNCC of node is defined aswhere We is the weight of edge ; and have the same meanings as in NCC.
After nest detection in the previous steps, the nests are fixed. It is time to find cuckoos around the nests. In , if protein is not in any nests, it is denoted as a cuckoo.
As a “cuckoo” in , there are many “nests” around “cuckoo”; the similarities between “cuckoo” and “nest” is measured based on the closeness between and , defined as follows:where is the set of all ’s neighbors, is the number of vertices in connected with , and is the number of vertices in . In order to keep the diversity of population, the roulette wheel selection was used. For a , if , the is selected to construct the roulette wheel.
The objective function is defined as follows:where is a clustering result determined by a nest; represents a cluster. is the number of edges in the cluster ; is the number of nodes in the cluster . is the number of edges with one node in and another node outside . Finally, the same or highly overlapping protein complexes are filtered out.
2.4. Time Complexity Analysis of ICSC Algorithm
The time complexity is used to estimate the efficiency of the ICSC algorithm. The maximal iterations maxiter is for the external loop; each iteration produces np solutions. In order to generate solutions, there are three main operations, generating the cuckoo, calculating the closeness, and calculating the objective function. Let nv be the number of proteins in G_{t} and ne be the number of interactions in . The time complexity of generating the cuckoos is O(nv). The time complexity of calculating closeness is , where nc is the number of cuckoos; nn is the number of nests. The time complexity of calculating the objective function is . In summary, the time complexity of ICSC algorithm is , which is equivalent to .
3. Experiments and Results
The proposed ICSC algorithm was implemented in Matlab R2015b and executed on a quadcore processor 3.30 GHz PC with 8 G RAM.
3.1. Experimental Dataset
In this study, four PPI datasets DIP [34] (version of 20160114), Krogan et al. [35], MIPS [36], and Gavin et al. [11] were employed to evaluate our algorithm. All the data used were Saccharomyces cerevisiae which have false positive and false negative interactions in the datasets. In this study, selfinteractions and repetitive interactions are removed for data preprocessing. After preprocessing, the DIP dataset consists of 5028 proteins and 22302 interactions, the Krogan dataset consists of 2674 proteins and 7075 interactions, the MIPS dataset consists of 4546 proteins and 12319 interactions, and the Gavin dataset consists of 1430 proteins and 6531 interactions.
Gene expression data was retrieved from GEO (Gene Expression Omnibus, GSE3431) [37]. After preprocessing, the dataset contains 7074 genes in 3 cell life cycles, each cycle having 12 time points. The GSE3431 dataset contains 4876 proteins in the DIP dataset (coverage rate: 4876/5028 = 96.98%), 2644 proteins in the Krogan dataset (the coverage rate: 2644/2674 = 98.88%), 4446 proteins in the MIPS dataset (the coverage rate: 4446/4546 = 97.80%), and 1418 proteins in the Gavin dataset (the coverage rate: 1418/1430 = 99.16%).
The GO database is currently one of most comprehensive ontology databases in bioinformatics. GOslims data are cutdown version of the GO ontologies [17], which is available at http://www.yeastgenome.org/downloaddata/curation. GOslim data provide GO terms to explain gene product feature in biological process (BP), molecular function (MF), and cellular component (CC). we used GOslims to annotate PPI data.
The standard protein complex CYC2008 [38] is used to evaluate our clustering results, which includes 408 protein complexes and covers 1492 proteins.
In this study, threesigma principle is used to construct the dynamic PPI networks based on four static PPI networks (SPIN) DIP, Krogan, MIPS, and Gavin in combination with GSE3431 gene expression dataset. There are 12 timestamps per cycle in GSE3431, so each dynamic PPI network contains 12 subnetworks, as shown in Table 1. These 12 subnetworks have different sizes.
3.2. Evaluation Metrics
Three commonly used metrics sensitivity (SN), specificity (SP), and Fmeasure [8, 25, 39] are used to measure the efficiency of the proposed ICSC algorithm and evaluate the performance of the clustering results:where TP is the number of predicted protein complexes which are matched with 408 standard protein complexes, FP is the number of predicted protein complexes which are not matched with anyone of 408 standard protein complexes, and FN is the number of standard protein complexes which are not matched with predicted protein complexes [8, 25]. The overlapping score OS is used to evaluate the matching degree between predicted protein complexes and standard protein complexes:where and denote the node sets of predicted protein complex pc and standard protein complex sc, respectively. The threshold of OS is set for 0.2 [8, 40]; that is, if OS(pc, sc) is greater than 0.2, the predicted protein complex pc is considered to match standard protein complex sc. OS(pc, sc) = 1 shows that the predicted protein complex pc is perfectly matched with the standard protein complex sc. The value [41], which illustrates the probability that a protein complex is enriched by a given functional group, was used to evaluate the biological significance of the predicted protein complexes in this study:where N, C, and F are the sizes of the whole PPI network, a protein complex, and a functional group in the network, respectively, and is the number of proteins in the functional group in the protein complex [41]. For a protein complex, the smaller the value is, the higher the biological significance is. The protein complex is considered to be insignificant if value is greater than 0.01.
3.3. Parameter Analysis
The proposed algorithm ICSC has three parameters, the maximum iterations maxiter, the cuckoo populations’ size np, and the weight threshold wth. The maximum number of iterations maxiter measures the convergence performance of the algorithm, and the populations’ size np can guarantee the diversity of the population. The convergence curve of ICSC algorithm on the first subnetwork of the dynamic PPI network was shown in Figure 3. The horizontal axis is the number of iterations, and the vertical axis is the objective function value. Figure 3 illustrates that the ICSC algorithm converges with 30 iterations. The populations’ size np is from 5 to 30; the objective function reaches its maximum value at . In this study, we set , .
In ICSC method, chooses the most suitable to form a protein complex; the quality of directly determines the accuracy of protein complexes, and the value of weight threshold wth directly affects the quality of the nest. If the value of wth is too small, a small amount of protein pairs is selected in a nest; the clustering results are not accurate. On the contrary, if the value of wth is too large, lots of meaningless protein complexes are predicted. Therefore, it is critical to select the appropriate value of wth. Matching Rate (MR) is defined to verify the influence of different values of wth. Nest is the set of initial nests of the dynamic PPI network; SC is the set of standard protein complexes CYC2008, and MR(Nest, SC) is defined as follows:where NI is the number of nests which are included in the standard protein complexes, denotes the number of nests in Nest, SI is the number of standard protein complexes which are included in Nest, and denotes the number of protein complexes in SC. The experiments on four dynamic PPI networks with wth from 0.2 to 1.2 were carried out to verify the influence of parameters wth. The results were showed in Figure 4. From Figure 4, in Krogan and Gavin datasets, the MR tends to be stable while wth is greater than or equal to 0.8. In DIP datasets the MR reaches its maximum value at wth = 0.6 and then gradually declines, and the downward trend is from 0.6 to 0.8. The MR curve in MIPS dataset is similar to DIP. Therefore, the value of wth is set as 0.8 in this study.
3.4. Clustering Results
The performance of ICSC is compared with six other previously proposed methods: MCODE, MCL, CORE, CSO, ClusterONE, and COACH. All the six methods were run on the dynamic PPI networks constructed by threesigma principle based on DIP, Krogan, MIPS, and Gavin datasets. The clustering results are shown in Table 2, where PC is the total number of predicted protein complexes, MPC is the count of predicted protein complexes which were matched, and MSC is the number of matched standard protein complexes. Perfect is the count of predicted protein complexes and standard complexes are perfectly matched; that is, OS(pc, sc) = 1. AS represents the average size of the predicted protein complexes. The comparison results are also showed in Table 2, from which it is clear that ICSC performs better than other six methods in terms of sensitivity (SN) and MPC. The measure of ICSC is the highest on DIP, Krogan, and MIPS while on the Gavin the measure of ICSC it was a bit less than that of ClusterONE. The Perfect values of ICSC on DIP and MIPS are 64 and 50, respectively, and are far superior to other algorithms.
In Table 2, the perfect value of ICSC on DIP is 64. The degree distribution of perfectly matched protein complexes is calculated in Table 3. The degree refers to the number of protein nodes contained in the protein complex. There are 408 protein complexes in the standard protein complexes CYC2008; 172 complexes contain 2 protein nodes accounting for 42.16%. However, the MCODE, CSO, and COACH cannot predict this part of protein complexes. The degree of 149 protein complexes greater than or equal to 4 accounted for 36.52% of all standard protein complexes, only a small part of which can be predicted by MCL, CORE, and ClusterONE. It is clear that ICSC algorithm achieved the best performance in these two aspects.
In order to clearly show the clustering results, we visualize the 265th standard protein complex of CYC2008 “nuclear exosome complex” in Figure 5. As shown in Figure 5(a), there are 12 proteins in this standard protein complex. The clustering results of other five methods MCODE (b), MCL (c), CORE (d), ClusterONE (e), and ICSC (f) are all from Krogan dataset. The blue nodes are proteins that are correctly predicted, the red nodes are proteins that are not identified, and the green nodes are the proteins that are wrongly identified. MCODE method only successfully predicted six proteins. Although MCL successfully predicted all 12 proteins in the protein complex, MCL also produced 3 incorrect proteins. The accuracy of CORE is the lowest; only 2 proteins are successfully predicted. Our method ICSC accurately predicted 9 proteins and achieved the best performance in identifying protein complexes.
(a) Standard
(b) MCODE
(c) MCL
(d) CORE
(e) ClusterONE
(f) ICSC
To evaluate the biological significance and functional enrichment of protein complexes identified by ICSI, we randomly selected five predicted protein complexes and calculated the value of on biological process ontologies based on Krogan datasets by using GO: termFinder (http://www.yeastgenome.org/cgibin/GO/goTermFinder.pl). The results are showed in Table 4. The proteins in bold have well matched standard protein complexes. From Table 4, it is obvious that four protein complexes have larger OS values and lower values, which illustrates that the ICSC algorithm is effective, and these protein complexes are reliable and biologically meaningful.
4. Conclusion
Protein complexes are involved in multiple biological processes, and thus detection of protein complexes is essential to understanding cellular mechanisms. There are many methods to identify protein complexes but cannot reflect dynamics of protein complexes. In this study, we have presented a novel protein complex identification method ISCS according to the coreattachment structure of protein complexes. First, a weighted dynamic PPI network is constructed, which integrates the gene expression data and GO terms information. Then, we find functional cores and cluster protein attachments based on the CS algorithm. Compared with the other competing clustering methods, ICSC can effectively identify the protein complexes and has higher precision and accuracy.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This paper is supported by the National Natural Science Foundation of China (61672334, 61502290, and 61401263) and the Industrial Research Project of Science and Technology in Shaanxi Province (2015GY016).