Is Subcellular Localization Informative for Modeling Protein-Protein Interaction Signal?

Liu, Junfeng; Zhao, Hongyu; Tan, Jun; Luo, Dajie; Yu, Weichuan; Harner, E. James; Shih, Weichung Joe

doi:https://doi.org/10.1155/2008/365152

Journal of Electrical and Computer Engineering

On this page

Abstract Discussion Acknowledgments References Copyright Related Articles

Research Letter | Open Access

Volume 2008 | Article ID 365152 | https://doi.org/10.1155/2008/365152

Is Subcellular Localization Informative for Modeling Protein-Protein Interaction Signal?

Junfeng Liu,^1,2Hongyu Zhao,³Jun Tan,⁴Dajie Luo,⁴Weichuan Yu,⁵E. James Harner,⁴and Weichung Joe Shih^1,2

Academic Editor: Jar-Ferr Kevin Yang

Received21 Aug 2007

Accepted02 Jan 2008

Published28 Feb 2008

Abstract

Statistical methods have been intensively applied in genomic signal processing (Dougherty et al. 2005). For budding yeast Saccharomyces cerevisiae with around 6000 proteins, genome-wide protein-protein-interaction (PPI) (Fromont-Racine et al. 2000, Ito et al. 2001, Newman et al. 2000, and Uetz et al. 2000 among others) and protein subcellular localization (PSL) (Huh et al. 2003) data recently became available and for the latter the presence of 4152 proteins is experimentally tested in each of the 22 subcellular compartments. Recent work shows that multiple biological sources are helpful for both PSL and PPI predictions, and this paper studies statistical feasibility of modeling PPI from PSL since PSLs may play different marginal or joint roles in the complex regulatory network. However, our results indicate that PSL may be controversial for this purpose as an independent source.

1. Statistical Methods

1.1. Two-Way PPI Count Contingency Table

We extracted 2712 PPIs from MIPS [1] which were available at http://hto-b.usc.edu/~msms/AssessInteraction/MIPSMatchYPD.txt as of 2005 and used by Lin and Zhao [2] for PPI network robustness study. We use 1641 PPIs with complete PSL information in Huh et al. [3], for example, protein has a 22-dimensional PSL vector , where represents presence of protein at PSL and represents absence of protein at PSL . For proteins and , we create 44-dimensional PSL vector (, ) along with an exchanged counterpart (, ) for naive balance. Since log-linear model with large number () of cross-classified cells may lack power where the total PPI count is relatively small (10 000), we instead explore an alternative two-way () contingency table whose rows (compartments: ) and columns (compartments: ) jointly assign each PPI into cell () with one protein in compartment and the other one in compartment () (Figure 1). Note that one PPI may be redundantly counted due to multiple PSL occupation. Cytoplasm and nucleus likely play crucial roles since these two compartments hold most PPI entries and other compartment pairs have much less entries. Negative binomial model avoids overdispersion and shows ER to Golgi, lipid particle and nucleus may be significant effects for this two-way contingency table.

(a)

(b)

1.2. PSL Correlation Pattern

For 44-dimensional joined PSL vectors we calculate all pairwise Pearson correlation coefficients This calculation was carried over to the following four disjoint sets of protein pairs: (1) interacting protein pairs from PPIs (set [1]), (2) non-PPI protein pairs from those proteins with PPI (set [2]), (3) non-PPI protein pairs from those proteins without PPI (set [3]), and (4) non-PPI protein pairs from combining protein without PPI and protein with PPI (set [4]). As in Section 1.1, by selecting those PPIs (MIPS) with PSL information (Huh et al. [3]), we obtained 2883 proteins without PPI, 3282 exchanging PSL vectors for (set [1]), 1 591 338 exchanging PSL vectors for (set [2]), 8 308 806 exchanging PSL vectors for (set [3]) and 7 317 054 exchanging PSL vectors for (set [4]). 30% of non-PPI protein pairs and 69% of PPI protein pairs have colocalization, thus 31% of PPIs may be transient. The pseudoimages and contour plots for PSL correlations (1) are given in Figure 2, where the upper-left quadrat in panel 1 shows significant between-protein colocalization pattern for (set [1]) and no clear colocalization pattern occurs for between-protein PSLs for (sets [2,3,4]). These observations motivate us to study if between-protein PSL pattern could potentially discriminate between protein pairs with PPI and those without PPI.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

1.3. Retrospective Logistic Regression

We propose a realistic model for quantifying PPI tendency from fused PSLs of proteins A and B (with exchanging). The PSL and PPI information is expressed as where is the binary PPI indicator (response) and the logistic regression model is proposed to be logit where and imply default PPI probability and PPI tendency of single protein with PSL , , and represent PPI tendency of single protein with PSLs and and two proteins with PSLs and , respectively, where describes PPI tendency of two proteins with common PSL . The number of model parameters is . For efficiency we consider a reduced model which incorporates second-order PSL effects between two proteins. The yeast interactome and proteome are inherent libraries and not subject to arbitrary experimental design, which indicates a retrospective (case-control) study. On the other hand, we have total protein pairs and only PPIs in our data. In order to overcome computer memory limitation and achieve reasonable sample sizes for both case (PPI) and control (non-PPI) groups, we need to select out a sample subset under statistical justification. For logistic model with responses s and predictors s, we let indicate whether subject is selected and assume and , both of them are free of . If the logistic model based on all subjects has logit , then the retrospective logistic regression (RLR) after selection probability adjustment would be logit (Chapter 4.3.3, McCullagh and Nelder [4]). We apply case selection probability 1 and control selection probability (3282 PPIs, 38 338 entries and 254 parameters) and identify around 60 significant effects. The resultant prospective PPI probabilities are to be adjusted based on foregoing theory.

1.4. PPI Prediction from PSL

After fitting the preceding model, we apply certain threshold to the simple classification rule We randomly divide the whole dataset for retrospective study into 10 disjoint portions. Each portion (includes PPIs and non-PPIs in proportion) acts as one testing set and the other nine portions are combined into one training set for 10-fold cross validation. We classify each protein-protein pair in the testing set into PPI or non-PPI by comparing the calculated probabilities (from trained model parameters) with some threshold . We find that PPI probability median of the non-PPI subset in the training set is always equal to that of the PPI subset in the training set and the PPI probability median () for PPI subset also equals that of non-PPI subset for the whole dataset in retrospective study. For retrospective study with PPI probability median threshold, we have specificity around 98% and sensitivity around 15%, RandomForest Breiman [5] in R reaches specificity around 99% and sensitivity around 20% and support vector machine (SVM) in R reaches specificity around 50% and sensitivity around 90%. The PPI probabilities from retrospective study dataset and 10-fold cross-validation are plotted in Figure 3. The logistic model-based classification results are found to be sensitive to threshold. If we use “” and “”, where equals PPI probability median, then we obtained very different classification results. After prospective PPI probability adjustment, the threshold-based classification (4) is applied to the complete PPI and PSL data ([Sets 1,2,3,4], Section 1.2) and the resultant ROC curve is given in Figure 4 with area under curve (AUC) less than 0.5. Since we may simply invert this classifier to make AUC greater than 0.5, Figure 4 indicates that the proposed logistic regression model ((3) in Section 1.3) may not be highly sufficient even if this model is carefully chosen. We also observe the following facts: selection procedure in retrospective study may involve some bias, the joined PSL patterns (from two proteins) are finite with uncertain overlap between PPI set and non-PPI set, false positives and false negatives may exist in both PPI and PSL data and others. From statistical point of view, interprotein PSL pattern may not independently determine PPI tendency, and threshold-based PPI prediction rule may not discriminate PPI from non-PPI either. The former conclusion is also a major concern from biologists who consider PPI mechanism far beyond only PSL information.

(a)

(b)

2. Discussion

In this article, we proposed statistical analysis of the association between PPI and PSL with the possibility of offering clues for further specific biological experiments. The aforementioned model is only one possible approach out of many helpful tries. It is likely that a totally different approach based on PSL information may lead to disparate results. As an alternative, if we could describe the distribution of 44-dimensional joined binary PSL vectors given PPI or non-PPI: and , then armed with some prior PPI probability, say , we can predict PPI probability for joined PSL pattern by Bayes rule where . Section 1.2 is essentially an attempt to work on either the PPI or non-PPI set to study PSL pattern without considering the non-PPI or PPI counterpart, which may be only a matter of exploring or separately. However, the explicit probability of high-dimensional binary vector is difficult to be constructed. Empirical approaches (Sections 1.1 and 1.2, Huh et al. [3]) offer informative results from different perspectives. On the other hand, Liu et al. [6] modeled PPI based on domain-domain interaction information and computational PSL prediction from other sources which are also feasible, the readers are referred to Lu et al. [7], Szafron et al. [8], Höglund et al. [9], Horton et al. [10], Guda [11], Yu et al. [12], and Zhang et al. [13, 14] among many others.

Acknowledgments

The authors are very grateful to the Associate Editor and two referees for the constructive comments which led to great improvement of their presentation. This research was partially supported by NIH/COBRE Grant NOT-RR-06-001 and NIH Grant GM59507 (J. Liu) and NIH/NCI Grant CA-072720-11 (J. Liu and W. J. Shih).

References

H. W. Mewes, D. Frishman, C. Gruber et al., “MIPS: a database for genomes and protein sequences,” Nucleic Acids Research, vol. 28, no. 1, pp. 37–40, 2000.
View at: Publisher Site | Google Scholar
N. Lin and H. Zhao, “Are scale-free networks robust to measurement errors?” BMC Bioinformatics, vol. 6, no. 1, p. 119, 2005.
View at: Publisher Site | Google Scholar
W.-K Huh, J. V. Falvo, L. C. Gerke et al., “Global analysis of protein localization in budding yeast,” Nature, vol. 425, pp. 686–691, 2003.
View at: Publisher Site | Google Scholar
P. McCullagh and J. A. Nelder, Generalized Linear Models, Chapman & Hall, London, UK, 2nd edition, 1989.
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
View at: Publisher Site | Google Scholar
Y. Liu, N. Liu, and H. Zhao, “Inferring protein-protein interactions through high-throughput interaction data from diverse organisms,” Bioinformatics, vol. 21, no. 15, pp. 3279–3285, 2005.
View at: Publisher Site | Google Scholar
Z. Lu, D. Szafron, R. Greiner et al., “Predicting subcellular localization of proteins using machine-learned classifiers,” Bioinformatics, vol. 20, no. 4, pp. 547–556, 2004.
View at: Publisher Site | Google Scholar
D. Szafron, P. Lu, R. Greiner et al., “Proteome analyst: custom predictions with explanations in a web-based tool for high-throughput proteome annotations,” Nucleic Acids Research, vol. 32, Web Server issue, pp. W365–W371, 2004.
View at: Publisher Site | Google Scholar
A. Höglund, P. Dönnes, T. Blum, H.-W. Adolph, and O. Kohlbacher, “MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition,” Bioinformatics, vol. 22, no. 10, pp. 1158–1165, 2006.
View at: Publisher Site | Google Scholar
P. Horton, K.-J Park, T. Obayashi, and K. Nakai, “Protein subcellular localization prediction with WoLF PSORT,” in Proceedings of the 4th Asia-Pacific Bioinformatics Conference (APBC '06), pp. 39–48, Taipei, Taiwan, February 2006.
View at: Publisher Site | Google Scholar
C. Guda, “pTARGET: a web server for predicting protein subcellular localization,” Nucleic Acids Research, vol. 34, Web Server issue, pp. W210–W213, 2006.
View at: Publisher Site | Google Scholar
C.-S. Yu, Y.-C. Chen, C.-H. Lu, and J.-K. Hwang, “Prediction of protein subcellular localization,” Proteins, vol. 64, no. 3, pp. 643–651, 2006.
View at: Publisher Site | Google Scholar
T. Zhang, Y. Ding, and S. Shao, “Protein subcellular location prediction based on pseudo amino acid composition and immune genetic algorithm,” in Proceedings of the International Conference on Intelligent Computing (ICIC '06), vol. 4115, part 3, pp. 534–542, Kunming, China, August 2006.
View at: Publisher Site | Google Scholar
T. Zhang, Y. Ding, and K.-C. Chou, “Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence,” Computational Biology and Chemistry, vol. 30, no. 5, pp. 367–371, 2006.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2008 Junfeng Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1131

Downloads

834

Citations