Abstract

Statistical methods have been intensively applied in genomic signal processing (Dougherty et al. 2005). For budding yeast Saccharomyces cerevisiae with around 6000 proteins, genome-wide protein-protein-interaction (PPI) (Fromont-Racine et al. 2000, Ito et al. 2001, Newman et al. 2000, and Uetz et al. 2000 among others) and protein subcellular localization (PSL) (Huh et al. 2003) data recently became available and for the latter the presence of 4152 proteins is experimentally tested in each of the 22 subcellular compartments. Recent work shows that multiple biological sources are helpful for both PSL and PPI predictions, and this paper studies statistical feasibility of modeling PPI from PSL since PSLs may play different marginal or joint roles in the complex regulatory network. However, our results indicate that PSL may be controversial for this purpose as an independent source.

1. Statistical Methods

1.1. Two-Way PPI Count Contingency Table

We extracted 2712 PPIs from MIPS [1] which were available at http://hto-b.usc.edu/~msms/AssessInteraction/MIPSMatchYPD.txt as of 2005 and used by Lin and Zhao [2] for PPI network robustness study. We use 1641 PPIs with complete PSL information in Huh et al. [3], for example, protein 𝐴 has a 22-dimensional PSL vector 𝐿𝐴=(𝐿𝐴,1,𝐿𝐴,2,…,𝐿𝐴,22), where 𝐿𝐴,𝑖=1 represents presence of protein 𝐴 at PSL 𝑖 and 𝐿𝐴,𝑖=0 represents absence of protein 𝐴 at PSL 𝑖. For proteins 𝐴 and 𝐡, we create 44-dimensional PSL vector 𝐿𝐴𝐡 (𝐿𝐴, 𝐿𝐡) along with an exchanged counterpart 𝐿𝐡𝐴 (𝐿𝐡, 𝐿𝐴) for naive balance. Since log-linear model with large number (244) of cross-classified cells may lack power where the total PPI count is relatively small (<10 000), we instead explore an alternative two-way (222) contingency table whose rows (compartments: 𝑖=1,…,22) and columns (compartments: 𝑗=1,…,22) jointly assign each PPI into cell (𝑖,𝑗) with one protein in compartment 𝑖 and the other one in compartment 𝑗 (𝑖,𝑗=1,2,…,22) (Figure 1). Note that one PPI may be redundantly counted due to multiple PSL occupation. Cytoplasm and nucleus likely play crucial roles since these two compartments hold most PPI entries and other compartment pairs have much less entries. Negative binomial model avoids overdispersion and shows ER to Golgi, lipid particle and nucleus may be significant effects for this two-way contingency table.

1.2. PSL Correlation Pattern

For 44-dimensional joined PSL vectors we calculate all 𝐢244+𝐢144(=990) pairwise Pearson correlation coefficients𝐿Corr(𝐴𝐡,𝐿𝐴𝐡𝐿)=Corr(𝐴,𝐿𝐴𝐿),Corr(𝐴,𝐿𝐡)𝐿Corr(𝐡,𝐿𝐴𝐿),Corr(𝐡,𝐿𝐡)ξƒ­.(1) This calculation was carried over to the following four disjoint sets of protein pairs: (1) interacting protein pairs from PPIs (set [1]), (2) non-PPI protein pairs from those proteins with PPI (set [2]), (3) non-PPI protein pairs from those proteins without PPI (set [3]), and (4) non-PPI protein pairs from combining protein without PPI and protein with PPI (set [4]). As in Section 1.1, by selecting those PPIs (MIPS) with PSL information (Huh et al. [3]), we obtained 2883 proteins without PPI, 3282 exchanging PSL vectors for (set [1]), 1 591 338 exchanging PSL vectors for (set [2]), 8 308 806 exchanging PSL vectors for (set [3]) and 7 317 054 exchanging PSL vectors for (set [4]). 30% of non-PPI protein pairs and 69% of PPI protein pairs have colocalization, thus 31% of PPIs may be transient. The pseudoimages and contour plots for PSL correlations (1) are given in Figure 2, where the upper-left quadrat in panel 1 shows significant between-protein colocalization pattern for (set [1]) and no clear colocalization pattern occurs for between-protein PSLs for (sets [2,3,4]). These observations motivate us to study if between-protein PSL pattern could potentially discriminate between protein pairs with PPI and those without PPI.

1.3. Retrospective Logistic Regression

We propose a realistic model for quantifying PPI tendency from fused PSLs of proteins A and B (with exchanging). The PSL and PPI information is expressed as𝐿𝐴,1,𝐿𝐴,2,…,𝐿𝐴,22,𝐿𝐡,1,𝐿𝐡,2,…,𝐿𝐡,22ξ‚βˆΌπΌπ΄π΅,𝐿𝐡,1,𝐿𝐡,2,…,𝐿𝐡,22,𝐿𝐴,1,𝐿𝐴,2,…,𝐿𝐴,22ξ‚βˆΌπΌπ΅π΄,(2) where 𝐼𝐴𝐡(=𝐼𝐡𝐴) is the binary PPI indicator (response) and the logistic regression model is proposed to be logit ξ€·ξ€Έπ‘ƒπ‘ŸPPI∣𝐴,𝐡=𝛽0+22𝑖=1𝛽𝑖𝐿𝐴,𝑖+𝐿𝐡,𝑖+𝑖<𝑗𝛽𝑖𝑗𝐿𝐴,𝑖𝐿𝐴,𝑗+𝐿𝐡,𝑖𝐿𝐡,𝑗+𝑖≀𝑗𝛽′𝑖𝑗𝐿𝐴,𝑖𝐿𝐡,𝑗+𝐿𝐡,𝑖𝐿𝐴,𝑗,(3) where 𝛽0 and 𝛽𝑖 imply default PPI probability and PPI tendency of single protein with PSL 𝑖, 𝛽𝑖𝑗, and 𝛽′𝑖𝑗 represent PPI tendency of single protein with PSLs 𝑖 and 𝑗 and two proteins with PSLs 𝑖 and 𝑗, respectively, where 𝑖=𝑗 describes PPI tendency of two proteins with common PSL 𝑖. The number of model parameters is 1+2𝐢122+2𝐢222=507. For efficiency we consider a reduced model 𝛽0+βˆ‘π‘–β‰€π‘—π›½β€²π‘–π‘—(𝐿𝐴,𝑖𝐿𝐡,𝑗+𝐿𝐡,𝑖𝐿𝐴,𝑗) which incorporates second-order PSL effects between two proteins. The yeast interactome and proteome are inherent libraries and not subject to arbitrary experimental design, which indicates a retrospective (case-control) study. On the other hand, we have ∼18Γ—106 total protein pairs and only ∼2Γ—103 PPIs in our data. In order to overcome computer memory limitation and achieve reasonable sample sizes for both case (PPI) and control (non-PPI) groups, we need to select out a sample subset under statistical justification. For logistic model with responses 𝑦𝑖s and predictors π‘₯𝑖s, we let 𝑍𝑖 indicate whether subject 𝑖 is selected and assume 𝜌1=π‘ƒπ‘Ÿ(𝑍𝑖=1βˆ£π‘¦π‘–=1) and 𝜌0=π‘ƒπ‘Ÿ(𝑍𝑖=1βˆ£π‘¦π‘–=0), both of them are free of π‘₯𝑖. If the logistic model based on all subjects has logit (π‘ƒπ‘Ÿ(𝑦𝑖=1∣π‘₯𝑖))=𝛼+𝛽π‘₯𝑖, then the retrospective logistic regression (RLR) after selection probability adjustment would be logit (π‘ƒπ‘Ÿ(𝑦𝑖=1∣π‘₯𝑖,𝑧𝑖=1))=𝛼+log(𝜌1/𝜌0)+𝛽π‘₯𝑖 (Chapter 4.3.3, McCullagh and Nelder [4]). We apply case selection probability 1 and control selection probability 2Γ—10βˆ’3 (3282 PPIs, 38 338 entries and 254 parameters) and identify around 60 significant effects. The resultant prospective PPI probabilities are to be adjusted based on foregoing theory.

1.4. PPI Prediction from PSL

After fitting the preceding model, we apply certain threshold 𝜏 to the simple classification ruleξ‚€ξƒ€πΏπ‘ƒπ‘ŸPPIβˆ£π΄π΅ξƒ€πΏor𝐡𝐴𝐿>πœβ‡’PPI,π‘ƒπ‘ŸPPIβˆ£π΄π΅ξƒ€πΏorπ΅π΄ξ‚β‰€πœβ‡’non-PPI.(4) We randomly divide the whole dataset for retrospective study into 10 disjoint portions. Each portion (includes PPIs and non-PPIs in proportion) acts as one testing set and the other nine portions are combined into one training set for 10-fold cross validation. We classify each protein-protein pair in the testing set into PPI or non-PPI by comparing the calculated probabilities (from trained model parameters) with some threshold 𝜏. We find that PPI probability median of the non-PPI subset in the training set is always equal to that of the PPI subset in the training set and the PPI probability median (1.88Γ—10βˆ’4) for PPI subset also equals that of non-PPI subset for the whole dataset in retrospective study. For retrospective study with PPI probability median threshold, we have specificity around 98% and sensitivity around 15%, RandomForest Breiman [5] in R reaches specificity around 99% and sensitivity around 20% and support vector machine (SVM) in R reaches specificity around 50% and sensitivity around 90%. The PPI probabilities from retrospective study dataset and 10-fold cross-validation are plotted in Figure 3. The logistic model-based classification results are found to be sensitive to threshold. If we use β€œξƒ€πΏπ‘ƒπ‘Ÿ(PPIβˆ£π΄π΅ξƒ€πΏor𝐡𝐴)β‰₯πœβ‡’PPI” and β€œξƒ€πΏπ‘ƒπ‘Ÿ(PPIβˆ£π΄π΅ξƒ€πΏor𝐡𝐴)<πœβ‡’non-PPI”, where 𝜏 equals PPI probability median, then we obtained very different classification results. After prospective PPI probability adjustment, the threshold-based classification (4) is applied to the complete PPI and PSL data ([Sets 1,2,3,4], Section 1.2) and the resultant ROC curve is given in Figure 4 with area under curve (AUC) less than 0.5. Since we may simply invert this classifier to make AUC greater than 0.5, Figure 4 indicates that the proposed logistic regression model ((3) in Section 1.3) may not be highly sufficient even if this model is carefully chosen. We also observe the following facts: selection procedure in retrospective study may involve some bias, the joined PSL patterns (from two proteins) are finite with uncertain overlap between PPI set and non-PPI set, false positives and false negatives may exist in both PPI and PSL data and others. From statistical point of view, interprotein PSL pattern may not independently determine PPI tendency, and threshold-based PPI prediction rule may not discriminate PPI from non-PPI either. The former conclusion is also a major concern from biologists who consider PPI mechanism far beyond only PSL information.

2. Discussion

In this article, we proposed statistical analysis of the association between PPI and PSL with the possibility of offering clues for further specific biological experiments. The aforementioned model is only one possible approach out of many helpful tries. It is likely that a totally different approach based on PSL information may lead to disparate results. As an alternative, if we could describe the distribution of 44-dimensional joined binary PSL vectors given PPI or non-PPI: ξƒ€πΏπ‘ƒπ‘Ÿ(𝐴𝐡∣PPI) and ξƒ€πΏπ‘ƒπ‘Ÿ(𝐴𝐡∣non-PPI), then armed with some prior PPI probability, say π‘ƒπ‘Ÿ(PPI)=3∢(1.8Γ—104), we can predict PPI probability for joined PSL pattern 𝐿𝐴𝐡 by Bayes ruleξ‚€ξƒ€πΏπ‘ƒπ‘ŸPPIβˆ£π΄π΅ξ‚=ξ‚€ξƒ€πΏπ‘ƒπ‘Ÿπ΄π΅ξ‚βˆ£PPIπ‘ƒπ‘Ÿ(PPI)𝑄,(5) where 𝐿𝑄=π‘ƒπ‘Ÿ(π΄π΅ξƒ€πΏβˆ£PPI)π‘ƒπ‘Ÿ(PPI)+π‘ƒπ‘Ÿ(𝐴𝐡∣non-PPI)π‘ƒπ‘Ÿ(non-PPI). Section 1.2 is essentially an attempt to work on either the PPI or non-PPI set to study PSL pattern without considering the non-PPI or PPI counterpart, which may be only a matter of exploring ξƒ€πΏπ‘ƒπ‘Ÿ(𝐴𝐡∣PPI) or ξƒ€πΏπ‘ƒπ‘Ÿ(𝐴𝐡∣non-PPI) separately. However, the explicit probability of high-dimensional binary vector is difficult to be constructed. Empirical approaches (Sections 1.1 and 1.2, Huh et al. [3]) offer informative results from different perspectives. On the other hand, Liu et al. [6] modeled PPI based on domain-domain interaction information and computational PSL prediction from other sources which are also feasible, the readers are referred to Lu et al. [7], Szafron et al. [8], HΓΆglund et al. [9], Horton et al. [10], Guda [11], Yu et al. [12], and Zhang et al. [13, 14] among many others.

Acknowledgments

The authors are very grateful to the Associate Editor and two referees for the constructive comments which led to great improvement of their presentation. This research was partially supported by NIH/COBRE Grant NOT-RR-06-001 and NIH Grant GM59507 (J. Liu) and NIH/NCI Grant CA-072720-11 (J. Liu and W. J. Shih).