Table of Contents
ISRN Computational Biology
Volume 2014, Article ID 960902, 7 pages
Research Article

In Silico Proteome Cleavage Reveals Iterative Digestion Strategy for High Sequence Coverage

Department of Chemistry and Biochemistry, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0378, USA

Received 4 February 2014; Accepted 17 March 2014; Published 22 April 2014

Academic Editors: Y. Cai and J. Ruan

Copyright © 2014 Jesse G. Meyer. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


In the postgenome era, biologists have sought to measure the complete complement of proteins, termed proteomics. Currently, the most effective method to measure the proteome is with shotgun, or bottom-up, proteomics, in which the proteome is digested into peptides that are identified followed by protein inference. Despite continuous improvements to all steps of the shotgun proteomics workflow, observed proteome coverage is often low; some proteins are identified by a single peptide sequence. Complete proteome sequence coverage would allow comprehensive characterization of RNA splicing variants and all posttranslational modifications, which would drastically improve the accuracy of biological models. There are many reasons for the sequence coverage deficit, but ultimately peptide length determines sequence observability. Peptides that are too short are lost because they match many protein sequences and their true origin is ambiguous. The maximum observable peptide length is determined by several analytical challenges. This paper explores computationally how peptide lengths produced from several common proteome digestion methods limit observable proteome coverage. Iterative proteome cleavage strategies are also explored. These simulations reveal that maximized proteome coverage can be achieved by use of an iterative digestion protocol involving multiple proteases and chemical cleavages that theoretically allow 92.9% proteome coverage.

1. Introduction

In the postgenome era, biologists have sought system-wide measurements of RNA, proteins, and, metabolites, termed transcriptomics, proteomics, and metabolomics, respectively. Shotgun, or bottom-up, proteomics has become the most comprehensive method for proteome identification and quantification [1]. However, observed protein sequence coverage is often low. The ability to cover 100% of protein sequences in a biological system was likened to surrealism in a recent review by Meyer et al. [2]. Multiple steps in the traditional shotgun proteomics workflow contribute to the deficit in observed sequence coverage, including proteome isolation, proteome digestion, peptide separation, peptide MS/MS, and identification by peptide-spectrum matching. Proteome isolation has been extensively evaluated [3, 4]. Several types of peptide separation have been explored [57]. Mass spectrometers are becoming more sensitive and versatile [810]. Peptide-spectrum matching algorithms are adapting to new data types [11] and becoming more sensitive [12, 13]. Proteome fragmentation into sequenceable peptides is one step with significant room for improvement. DNA sequencing relies on sequence fragmentation into readable pieces by mechanical force [14], which produces a nearly uniform distribution of fragment lengths. In comparison, proteome fragmentation is generally accomplished by targeting one or more amino acid residues for cleavage, and, therefore, the protein cleavage can be likened to a Poisson process that produces an exponential distribution of peptide lengths.

Numerous papers have described the application of new digestion strategies for proteome analysis [1518]; however, no single strategy has emerged as optimal. The greatest observed proteome coverage has plateaued around 25%. 24.6% of the human proteome was recently observed [19], but this was obtained from over 1,000 MS/MS data files that allowed identification of over 260,000 peptide sequences using a new high performance data analysis package. Similarly, multiple protease digests of yeast resulted in 25.2% coverage [20]. Therefore, improved strategies for proteome digestion are needed to allow observation of a complete proteome.

An innovative example demonstrating the application of multiple enzyme digestion (MED) was recently published by Wiśniewski and Mann [21], which demonstrated the utility of multienzyme digestion coupled to filter-aided sample preparation [22] (MED-FASP, Figure 1). This work extends a previous work that described size exclusion to isolate long tryptic peptides for additional digestion [18]. Wiśniewski and Mann compared gains afforded by iterative digestion using various proteases (i.e., GluC, ArgC, LysC, or AspN) followed by trypsin. Their work concluded that iterative digestion with LysC followed by trypsin allowed 31% more protein identifications and a 2-fold gain in observed phosphopeptides for a particular protein. Their work led me to optimize iterative digestion in silico with the hope of identifying a testable digestion strategy that can theoretically achieve complete proteome coverage.

Figure 1: Cartoon describing the multiple-enzyme digestion, filter-assisted sample preparation strategy (MED-FASP) from Wiesinski and Mann. A proteome is digested on top of a size-based filter device and peptides are then spun through the filter. Undigested sequences are retained above the filter because of their length. The process is repeated with various cleavage agents and several peptide pools are collected separately. The peptides are then analyzed by nLC-MS/MS separately and the resulting data is then combined either before or after the database search.

2. Methods

The S. cerevisiae proteome file in FASTA format was downloaded from UniProt on June 20, 2012. Proteome digestion simulations were accomplished using scripts written in [R] [23]. Considered protease specificities include c-terminal of R/K (trypsin), L (LeuC theoretical cleavage agent), E (GluC), and K (LysC). Additionally, simulations utilized chemical digestion agents [24], including cyanogen bromide (CNBr) [25, 26] for cleavage c-terminal of M, 3-bromo-3-methyl-2-(2-nitrophenylthio)-3H-indole (BNPS-skatole) for cleavage c-terminal of W [27], and 2-nitro-5-thiocyanobenzoic acid (NTCB) for cleavage n-terminal of C [28, 29]. Peptide populations were filtered using both length and molecular weight constraints. Since the filtration thresholds affect the proteome coverage prediction, multiple cutoff values are compared. The [R] code is available at

3. Results and Discussion

3.1. Minimum Unique Peptide Length

The probability of a sequence being unique can be calculated assuming a random distribution of sequences in the library. The number of sequences of length n can be described by 20n. Therefore, any given sequence of length five is likely to occur once in a library of 3,200,000 random amino acid sequences (roughly the number of amino acids in the S. cerevisiae proteome). As the number of amino acids in the database grows, a peptide sequence must be longer to expect uniqueness. The human proteome contains 11,323,900 amino acids (not including isoforms, downloaded from UniProt on October 22, 2013), and, therefore, for a sequence to be unique, it must be of length six. Of course, due to common sequence motifs there are less unique peptide sequences in a proteome than would be found in a random library.

3.2. Peptide Length Distributions from Various Cleavages

Initial in silico digestions using single cleavage agents were used to compare the resulting peptide lengths (Figure 2). Many peptide sequences are too short to uniquely match a protein. For all digestion agents, the most frequent peptide length produced is one. Generation of a single amino acid would arise when the target residue is next to itself in the protein. Notably, over 25% of theoretical peptides from trypsin digestion, which cleaves after 11.7% of all residues, are of length one. Not surprisingly, the observable proportion of the residue targeted for cleavage correlates with the resulting average peptide length (Figure 3); more common cleavage targets produce shorter average peptide lengths. Additionally, the residue-level coverage was found to depend on digestion. Proteome cleavage after more common residues results in depletion of the target residues (Figure 4), which is expected to result from production of peptides that are too short to uniquely match a protein sequence. However, cleavage after rare residues results in enriched coverage of the target residue. This result was also observed by amino acid analysis of proteome digestions in recent work [30].

Figure 2: Theoretical peptide length distributions produced from various cleavage agents. (a) Size frequency distributions (density) of peptides from proteome digestion by five real cleavage agents (i.e., trypsin, LysC, GluC, CNBr, and NTCB) and one theoretical cleavage agent (LeuC). The vertical black lines at 7 and 35 indicate general peptide identification size limits. (b) The same distribution focused on the region from 1 to 10 amino acids. (c) The view focused on the region between 30 and 40 amino acids.
Figure 3: Correlation between abundance of the residue targeted for cleavage and the resulting average peptide length. Proteome cleavage targeting abundant residues results in lower average peptide lengths; proteome cleavage targeting rare residues results in higher average peptide length. The line shows the data fit to an exponential equation.
Figure 4: Residue-level coverage observed for various cleavage agents. Proteome cleavage of more common amino acids, such as with (a) trypsin or the theoretical cleavage after (b) leucine, results in residue-specific depletion of the target residues. However, cleavage of rare amino acids, such as (c) methionine or (d) cysteine, results in residue-specific enrichment of the target residues.
3.3. Comparison of Peptide Filtration Parameters

The theoretical distribution of peptides passing through a MWCO ultrafilter certainly does not match the actual distribution. Denatured peptides and proteins are effectively larger than folded proteins, and, in fact, it was found that even 30 kDa or 50 kDa cutoff ultrafilters perform better for peptide yield than 10 kDa cutoff ultrafilters [31], despite the inability to identify such large peptide sequences by bottom-up proteomics. Therefore, multiple length constraints were compared for their influence on the predicted proteome coverage. Figure 5 shows how various minimum peptide length values affect residue-level depletion and theoretical proteome coverage. As the minimum length increases, total coverage decreases and depletion of R/K increases. Figure 6 shows how different upper length thresholds change theoretical coverage. Intuitively, raising the upper length limit of identifiable peptides increases total predicted proteome coverage. Interestingly, although total predicted coverage increases, the coverage of R/K stays around 60%. Since peptide MW also determines identifiable peptides and peptides above 5 kDa are unlikely to be identified with current MSMS technology, an upper limit of 5 kDa was used for subsequent digest simulations. A lower length limit of 7 amino acids was used because this length is more likely to be relevant to actual proteomics experiments.

Figure 5: Effect of minimum peptide length on proteome coverage and residue-level depletion. Residue-level coverage predicted after trypsin digestion keeping all peptides with lengths between (a) 1 and 35, (b) 5 and 35, (c) 7 and 35, and (d) 10 and 35.
Figure 6: Effect of upper length limit on predicted proteome coverage. Upper length limit of identifiable peptides effects predicted proteome coverage. Theoretical residue-level proteome coverage keeping peptides with lengths (a) 5–20, (b) 5–30, (c) 5–40, and (d) 5–100. As the maximum length of identifiable peptides increases, the total theoretical proteome coverage increases, but the depletion of K and R remains. As the upper length limit increases, the theoretical coverage maximum increases.
3.4. Comparison of Digestion Iterations

Several combinations of cleavage agents were simulated to compute theoretical proteome coverage resulting from the iterative MED-FASP (iMED-FASP) strategy. Simulations confirm that iMED-FASP offers theoretically greater coverage of the proteome when the sequence of digestions starts with the protease targeting the rarest residue first (Table 1). As expected, reversal of the optimal digestion sequence results in a negligible improvement to proteome coverage as compared to the limit from using trypsin digestion alone.

Table 1: Theoretical upper limits of coverage upon digestion with various cleavage agents using the iMED-FASP strategy. Iterative cleavage of the proteome starting with the rarest amino acids first results in the greatest theoretical proteome coverage of 92.9%. The reversed sequence of cleavage provides a minimal improvement to theoretical proteome coverage. Peptides were filtered after each digest keeping those with MW > 5 kDa for additional digestion. The final “flowthrough” peptides were filtered keeping only sequences with at least 7 residues.
3.5. Proposed Iterative Digestion Strategy and Challenges Therein

An ideal iterative cleavage strategy must limit sample processing steps and must take place under conditions that are compatible with the ultrafiltration device. Further, because tryptophan fluorescence can be used to quantify peptide yield from each digestion, chemical cleavage after tryptophan should initially be omitted since it destroys the fluorophore that can be used to monitor peptide yield. Therefore, a testable, ultrafilter-compatible strategy, with a balance between sample processing and predicted gains in coverage, is the sequence: NTCB, CNBr, LysC, and trypsin.

Implementation of this method introduces several technical hurdles that must be addressed. First, the buffer conditions required for each separate digestion need to be planned. The requisite use of an ultrafiltration device fortunately allows easy buffer/denaturant exchange to accommodate the different conditions. However, researchers should carefully consider which conditions are best for each step and use controls to ensure the efficient digestion at each step. Limitations of the ultrafilter must also be accounted for. For example, cleavage after methionine by CNBr is usually carried out at a formic acid concentration that would degrade the ultrafilter membrane. Instead, HCl could be substituted to enable use of CNBr with the iterative digestion MED-FASP strategy. Another key consideration is the choice of peptide fragmentation. Nontryptic peptides are less efficiently fragmented by commonly used peptide dissociation methods (e.g., collision-induced dissociation). Therefore, I recommend that any attempt to assess this theory should use electron-transfer dissociation (ETD) [32], which produces more complete fragment ion series that depend less on peptide sequence. Database searching also presents a challenge because the peptide pools will lack defined termini, which therefore requires that the database search be carried out with “no enzyme” specificity. A fast and effective choice for database searching with “no enzyme” specificity is MSGFDB [13], which can learn scoring parameters from a set of annotated peptide-spectra matches in order to improve the sensitivity of peptide identification. Finally, it should be noted that the biological fact of missed cleavages will result in deviations from these simulations. The feature to allow user-defined missed cleavage propensities has been implemented in the code, and an example of the effects is shown in supplemental Figure in the Supplementary Material available online at The missed cleavages result in noisy length distributions. Missed cleavages help limit the proportion of short peptides, suggesting that optimization of partial digestions might further improve proteome coverage.

4. Conclusions

This work provides a publically accessible computational framework for simulation of iterative proteome digestion that can be used with any input protein sequence database to optimize proteome coverage. Further, this work demonstrates how the choice of proteome digestion agent affects the predicted proteome coverage due to the distribution of peptide lengths that are produced. This work also shows how various digestion agents affect proteome coverage at the residue level. Proteome cleavage targeting common residues results in depletion of the cleaved residue, but proteome cleavage after rare residues results in enrichment of the target residue. Finally, this paper finds that the best theoretical proteome coverage is achieved by an iterative digestion strategy that limits production of short peptides by cleaving the rarest residues first.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.


Jesse G. Meyer was supported by the NIH interfaces training Grant (no. T32EB009380) and funding from the NSF (MCB1244506).


  1. Y. Zhang, B. R. Fonslow, B. Shan, M.-C. Baek, and J. R. Yates, “Protein analysis by shotgun/bottom-up proteomics,” Chemical Reviews, vol. 113, no. 4, pp. 2343–2394, 2013. View at Publisher · View at Google Scholar
  2. B. Meyer, D. G. Papasotiriou, and M. Karas, “100% protein sequence coverage: a modern form of surrealism in proteomics,” Amino Acids, vol. 41, no. 2, pp. 291–310, 2011. View at Publisher · View at Google Scholar · View at Scopus
  3. J. M. Gilmore and M. P. Washburn, “Advances in shotgun proteomics and the analysis of membrane proteomes,” Journal of Proteomics, vol. 73, no. 11, pp. 2078–2091, 2010. View at Publisher · View at Google Scholar · View at Scopus
  4. M. Rey, H. Mrázek, P. Pompach et al., “Effective removal of nonionic detergents in protein mass spectrometry, hydrogen/deuterium exchange, and proteomics,” Analytical Chemistry, vol. 82, no. 12, pp. 5107–5116, 2010. View at Publisher · View at Google Scholar · View at Scopus
  5. A. Motoyama and J. R. Yates III, “Multidimensional LC separations in shotgun proteomics,” Analytical Chemistry, vol. 80, no. 19, pp. 7187–7193, 2008. View at Publisher · View at Google Scholar · View at Scopus
  6. Y. Wang, F. Yang, M. A. Gritsenko et al., “Reversed-phase chromatography with multiple fraction concatenation strategy for proteome profiling of human MCF10A cells,” Proteomics, vol. 11, no. 10, pp. 2019–2026, 2011. View at Publisher · View at Google Scholar · View at Scopus
  7. L. H. Betancourt, P.-J. de Bock, A. Staes et al., “SCX charge state selective separation of tryptic peptides combined with 2D-RP-HPLC allows for detailed proteome mapping,” Journal of Proteomics, vol. 91, pp. 164–171, 2013. View at Publisher · View at Google Scholar
  8. A. Michalski, E. Damoc, J.-P. Hauschild et al., “Mass spectrometry-based proteomics using Q exactive, a high-performance benchtop quadrupole orbitrap mass spectrometer,” Molecular & Cellular Proteomics, vol. 10, no. 9, 2011. View at Publisher · View at Google Scholar · View at Scopus
  9. J. V. Olsen, J. C. Schwartz, J. Griep-Raming et al., “A dual pressure linear ion trap orbitrap instrument with very high sequencing speed,” Molecular & Cellular Proteomics, vol. 8, no. 12, pp. 2759–2769, 2009. View at Publisher · View at Google Scholar · View at Scopus
  10. C. K. Frese, A. F. M. Altelaar, M. L. Hennrich et al., “Improved peptide identification by targeted fragmentation using CID, HCD and ETD on an LTQ-Orbitrap velos,” Journal of Proteome Research, vol. 10, no. 5, pp. 2377–2388, 2011. View at Publisher · View at Google Scholar · View at Scopus
  11. R. J. Chalkley, P. R. Baker, K. F. Medzihardszky, A. J. Lynn, and A. L. Burlingame, “In-depth analysis of tandem mass spectrometry data from disparate instrument types,” Molecular & Cellular Proteomics, vol. 7, no. 12, pp. 2386–2398, 2008. View at Publisher · View at Google Scholar · View at Scopus
  12. Y. Shen, N. Tolić, S. O. Purvine, and R. D. Smith, “Improving collision induced dissociation (CID), high energy collision dissociation (HCD), and electron transfer dissociation (ETD) fourier transform MS/MS degradome-peptidome identifications using high accuracy mass information,” Journal of Proteome Research, vol. 11, no. 2, pp. 668–677, 2012. View at Publisher · View at Google Scholar · View at Scopus
  13. S. Kim, N. Mischerikow, N. Bandeira et al., “The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: applications to database search,” Molecular & Cellular Proteomics, vol. 9, no. 12, pp. 2840–2852, 2010. View at Publisher · View at Google Scholar · View at Scopus
  14. S. Linnarsson, “Recent advances in DNA sequencing methods—general principles of sample preparation,” Experimental Cell Research, vol. 316, no. 8, pp. 1339–1343, 2010. View at Publisher · View at Google Scholar · View at Scopus
  15. B. Rietschel, T. N. Arrey, B. Meyer et al., “Elastase digests: new ammunition for shotgun membrane proteomics,” Molecular & Cellular Proteomics, vol. 8, no. 5, pp. 1029–1043, 2009. View at Publisher · View at Google Scholar · View at Scopus
  16. G. Choudhary, S.-L. Wu, P. Shieh, and W. S. Hancock, “Multiple enzymatic digestion for enhanced sequence coverage of proteins in complex proteomic mixtures using capillary LC with ion trap MS/MS,” Journal of Proteome Research, vol. 2, no. 1, pp. 59–67, 2003. View at Publisher · View at Google Scholar · View at Scopus
  17. H. Moura, R. R. Terilli, A. R. Woolfitt et al., “Proteomic analysis and label-free quantification of the large Clostridium difficile toxins,” International Journal of Proteomics, vol. 2013, Article ID 293782, 10 pages, 2013. View at Publisher · View at Google Scholar
  18. B. Q. Tran, C. Hernandez, P. Waridel et al., “Addressing trypsin bias in large scale (Phospho)proteome analysis by size exclusion chromatography and secondary digestion of large post-trypsin peptides,” Journal of Proteome Research, vol. 10, no. 2, pp. 800–811, 2011. View at Publisher · View at Google Scholar · View at Scopus
  19. N. Neuhauser, N. Nagaraj, P. McHardy et al., “High performance computational analysis of large-scale proteome data sets to assess incremental contribution to coverage of the human genome,” Journal of Proteome Research, vol. 12, no. 6, pp. 2858–2868, 2013. View at Publisher · View at Google Scholar
  20. D. L. Swaney, C. D. Wenger, and J. J. Coon, “Value of using multiple proteases for large-scale mass spectrometry-based proteomics,” Journal of Proteome Research, vol. 9, no. 3, pp. 1323–1329, 2010. View at Publisher · View at Google Scholar · View at Scopus
  21. J. R. Wiśniewski and M. Mann, “Consecutive proteolytic digestion in an enzyme reactor increases depth of proteomic and phosphoproteomic analysis,” Analytical Chemistry, vol. 84, no. 6, pp. 2631–2637, 2012. View at Publisher · View at Google Scholar · View at Scopus
  22. J. R. Wiśniewski, A. Zougman, N. Nagaraj, and M. Mann, “Universal sample preparation method for proteome analysis,” Nature Methods, vol. 6, no. 5, pp. 359–362, 2009. View at Publisher · View at Google Scholar · View at Scopus
  23. R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2008.
  24. D. L. Crimmins, S. M. Mische, and N. D. Denslow, “Chemical cleavage of proteins in solution,” in Current Protocols in Protein Science, John Wiley & Sons, New York, NY, USA, 2001. View at Google Scholar
  25. R. Kaiser and L. Metzka, “Enhancement of cyanogen bromide cleavage yields for methionyl-serine and methionyl-threonine peptide bonds,” Analytical Biochemistry, vol. 266, no. 1, pp. 1–8, 1999. View at Publisher · View at Google Scholar · View at Scopus
  26. Y. A. Andreev, S. A. Kozlov, A. A. Vassilevski, and E. V. Grishin, “Cyanogen bromide cleavage of proteins in salt and buffer solutions,” Analytical Biochemistry, vol. 407, no. 1, pp. 144–146, 2010. View at Publisher · View at Google Scholar · View at Scopus
  27. M. M. Vestling, M. A. Kelly, and C. Fenselau, “Optimization by mass spectrometry of a tryptophan-specific protein cleavage reaction,” Rapid Communications in Mass Spectrometry, vol. 8, no. 9, pp. 786–790, 1994. View at Google Scholar · View at Scopus
  28. G. R. Jacobson, M. H. Schaffer, G. R. Stark, and T. C. Vanaman, “Specific chemical cleavage in high yield at the amino peptide bonds of cysteine and cystine residues,” The Journal of Biological Chemistry, vol. 248, no. 19, pp. 6583–6591, 1973. View at Google Scholar · View at Scopus
  29. M. Iwasaki, T. Masuda, M. Tomita, and Y. Ishihama, “Chemical cleavage-assisted tryptic digestion for membrane proteome analysis,” Journal of Proteome Research, vol. 8, no. 6, pp. 3169–3175, 2009. View at Publisher · View at Google Scholar · View at Scopus
  30. J. G. Meyer, S. Kim, D. Maltby, M. Ghassemian, N. Bandeira, and E. A. Komives, “Expanding proteome coverage with orthogonal-specificity alpha-lytic proteases,” Molecular & Cellular Proteomics, vol. 13, no. 3, pp. 823–835, 2014. View at Google Scholar
  31. J. R. Wiśniewski, D. F. Zielinska, and M. Mann, “Comparison of ultrafiltration units for proteomic and N-glycoproteomic analysis by the filter-aided sample preparation method,” Analytical Biochemistry, vol. 410, no. 2, pp. 307–309, 2011. View at Publisher · View at Google Scholar · View at Scopus
  32. J. E. P. Syka, J. J. Coon, M. J. Schroeder, J. Shabanowitz, and D. F. Hunt, “Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 26, pp. 9528–9533, 2004. View at Publisher · View at Google Scholar · View at Scopus