Abstract

The overall goal is to establish a reliable human protein-protein interaction network and develop computational tools to characterize a protein-protein interaction (PPI) network and the role of individual proteins in the context of the network topology and their expression status. A novel and unique feature of our approach is that we assigned confidence measure to each derived interacting pair and account for the confidence in our network analysis. We integrated experimental data to infer human PPI network. Our model treated the true interacting status (yes versus no) for any given pair of human proteins as a latent variable whose value was not observed. The experimental data were the manifestation of interacting status, which provided evidence as to the likelihood of the interaction. The confidence of interactions would depend on the strength and consistency of the evidence.

1. Introduction

Individual proteins cannot perform their biological functions by themselves, and actually they need to perform their functions in the biological process through interacting with other proteins [1]. Usually the interaction between two proteins means either they perform a biological function corporately or there is physical direct contact between them [2]. Most of the important molecular processes in cell, such as DNA replication, need to be performed by a large number of protein complexes. And these complexes are made up by the interactions between proteins. The study of PPIs is also considered to be a central problem in proteomics for living cells. Due to the dynamic interaction between proteins, the impact of surrounding environment should also be taken into account. The study of human PPI network can help to enhance the understanding of the disease but also provide a theoretical foundation for finding new treatment.

With the continuous progress and development of high-throughput experimental technology, more and more large quantities of interactions between human proteins had been confirmed by a variety of experimental methods. And many kinds of biological interaction networks have been investigated [37]. However, current high-throughput experimental techniques also indicated the shortcomings of high error; not only might the different experimental methods induce different experimental results, but also even different research groups using the same experimental method could not guarantee the exact same result. Therefore, it was urgent to integrate the data from different biological experiments, and even different species, to construct a highly credible network of PPIs. So in this paper, a Bayesian hierarchical model of human PPI network was constructed with a variety of sources of protein interaction data. Meanwhile, a Monte Carlo expectation maximization algorithm was used to estimate the parameters of the model. Then the confidence of protein interaction relationship was calculated based on Bayesian model, and human PPI network with high-confidence level could be obtained.

Thereafter, the role of intrinsic disordered proteins (IDPs) was investigated in the high-confidence PPI network. First of all, different functional modules were obtained through clustering of high-confidence PPI network based on the network topology structure. Then we found the functional modules which were significantly correlated with intrinsically disordered proteins and analysed the effect of IDPs in these functional modules, while searching for the associations between these functional modules and diseases.

2. Materials and Methods

2.1. Data Collection

In Table 1, we show the experimental data that will be used for the construction of the human PPI network [820]. Note that the literature or text mining approach represents most of the low-throughput experimental studies of individual protein-protein interaction. It is possible that the result from the same experiment will be recorded in multiple databases. We will eliminate this type of redundancy. It should be emphasized that the MPC experiments provide result in the format of protein complexes instead of pair-wise protein-protein interactions. Since proteins located in the same complex might not interact with one another directly, we will account for this factor in our model.

2.2. Statistical Modeling of Various Data Sources

The overall scheme of our approach is illustrated in Figure 1. We consider an empirical Bayes approach to integrate various sources of evidence. Let be the binary indicator such that means that human proteins and have a direct physical interaction and it is 0 otherwise. Hence, is the true interacting status that is not observed. To infer , we consider individual model for each type of observed data and integrate the evidence to compute the probability of .

2.2.1. Human Y2H Data

It has been found that there are a number of mechanisms that can lead to the expression of the reporter gene in a Y2H experiment, which means that an observed interaction might not necessarily mean a true interaction. In our model, we consider the following mechanisms: (a) true interaction; (b) self-activation; and (c) unknown process. Let be the binary indicator such that if proteins and are observed to interact in a Y2H experiment and it is 0 otherwise. Then only if at least one of the three above mechanisms is functional. Let if protein is a self-activation protein and let it be 0 otherwise. We defineThen we have

2.2.2. Human MPC Data

MPC experiment reveals protein complexes instead of individual pairwise PPI. We say protein B is an -step neighbour of protein A if the shortest path between A and B in the PPI network is of length . We conjecture that the bait will mostly fish out its 1-step neighbours, and 2-step neighbours and distant proteins (at least three step-away) are occasionally observed. Hence, we define the following parameters for the bait proteins:Let be the set of proteins in a complex corresponding to bait protein k. Denote by , the set of 1-step and 2-step neighbours of the bait protein under a given value of . Then the probability of observing can be written as follows:where is the function that maps a set to its size.

2.2.3. Literature Data on Human PPI

Let be the interaction status of proteins and reported. We will account for the false positive rate () and false negative rate ():

2.2.4. Data from Other Organisms

We will also collect from other organisms with corresponding unobserved variables denoted by . Similar models can be used to model for inference of . To connect to , we consider the following models:where is the joint sequence identity between and and between and and is sequence identity between and ; , , , and are functions of the joint or individual sequence identities with parameters , , , and , which can be modeled by parametric structure.

2.3. Construction of Hierarchical Bayesian Model

So far we have introduced the distribution models for the experimental data and genomic features that are conditional on the values of Z and X. To finish the model, we also need to specify the distributions of Z and X, which can be modeled with independent Bernoulli distributions:With the observed data and the unobserved variables, we can infer the posterior probability of using the EM algorithm. Note that there are multiple organisms and multiple data sets for some of the organisms. Different parameters will be used to account for difference in the data.

As illustrated in (10), the complete log likelihood function of our model can be expanded below, and the factor of (10) can be substituted by (3)~(9):where the parameter vector .

2.4. Monte Carlo Expectation Maximization for Parameter Estimation

In the model, it was not possible to estimate the true value of potential variables and model parameters directly. In order to effectively estimate the potential variables and model parameters, this paper used the Monte Carlo expectation maximization algorithm based on incomplete parameter estimation, as illustrated in Algorithm 1.

()     , initialize the parameters
()     while (diff > 0.01)
()       // -Step
()      while ()
()      Sample from
()      Sample from
()      Sample from
()      Sample from
()       
()   
()   calculate function
      
()   // -Step
()          
           
   
  
()   ;
()   
()   
()   while (1)
()    
()    
         
()    
()    if (abs() < abs())
()     break
()    
()   
()   diff =
()   
()   
()  

In the -step of Algorithm 1, we use Gibbs sampling to sample from in turn. Repeat the sampling process until the estimations of missing data are obtained. Then in the -step of Algorithm 1, the parameter vector is estimated by Greedy Hill Climbing. Finally the iteration is stopped when .

3. Results

All the protein names were mapped to the Entrez IDs. Finally we got 32540 proteins, and there were 144603 interactions between these proteins.

3.1. Construction of the Human PPI Network with Reliable Confidence Measure

Four models were established separately using high-throughput Y2H experimental data, high-throughput MPC experimental data, human PPI data, and all the PPI data. The comparisons among these four models were listed in Table 2.

After the estimation of parameter vector by Monte Carlo EM, we recalculated the posterior probability of , which is , with and the observed values . And for each pair of PPI, we considered them as reliable confidence interaction if . Then we got 48361 PPIs with reliable confidence measure among 23286 proteins.

3.2. Characterization of Network and Roles of IDPs Based on Network Topology

We analysed the role of IDPs in the human PPI networks with reliable confidence measure. A IDP was defined as a protein with continuous intrinsically disorder region whose length was larger than 40 amino acids. And 8735 IDPs were identified from 23286 proteins after predictions.

Firstly, the human PPI network was cut into subnetworks or modules by SCAN. SCAN obtained modules based on the similarity between common neighbors. Then we used modularity and similarity-based modularity as metrics. Modularity is a statistical measure of the quality of network clustering, which is defined as follows:where is the number of clusterings, L is the number of edges, is the number of edges for module, and is the degree of all the nodes in module. We could obtain the best clustering by optimizing . And similarity-based modularity is the supplementary for the modularity, which is defined as follows:As shown in Figure 2, on one hand, the modularity monotonically decreased from the position nearby zero, and it could not be maximized. On the other hand, the similarity-based modularity could be maximized while the threshold equals 0.61. Conditional on the , the reliable human PPI network was cut into 241 modules. Under the significant level , the value of each module was calculated by the formula below:where is the number of all the proteins and is the number of all the IDPs. 33 modules among 241 modules were significantly associated with IDPs.

However, due to the fact that acquisition of functional modules is only dependent on the network topology, we analysed the modules with known diseases. And the overlap of PPI in hela cell and a functional module which was highly related with IDPs was shown in Figure 3. The weight of each side is the posterior probability of the real value Z. If a node with more than 5 neighbours was defined as a hub node in this subnetwork, a total of 69% of the hub nodes were IDPs. It is verified that IDPs were easy to become hub nodes of the protein interaction network due to the flexibility of the structure, revealing an important role of IDPs in the regulation of cervical cancer hela cell.

4. Discussion

Our model is unique and novel in the following perspectives. First, it integrates Y2H and MPC data in a cohesive and unified model that connect the two types of data through the unobserved true status of direct physical interaction . Second, the model allows a natural calculation of the confidence of each interacting pair via the posterior probability. This is a critical measurement in downstream analysis and will be accounted for. To our knowledge, no previous study has considered uncertainty in the PPI network analysis.

The inference of the interacting probability involves a large number of latent variables. The combinatorial effects make it impractical to compute the expectation of the missing variables analytically during the -step. It is likely that various data sets carry different amount of information regarding the true interaction status. Hence, the inference can be made by appropriately weighing data of various types instead of treating them equally. This can be achieved by setting parameter constrain.

Competing Interests

The authors confirm that there is no conflict of interests related to the content of this article.

Authors’ Contributions

Yang Hu and Ying Zhang contributed equally to this work.