Abstract

In the postgenome era, biologists have sought to measure the complete complement of proteins, termed proteomics. Currently, the most effective method to measure the proteome is with shotgun, or bottom-up, proteomics, in which the proteome is digested into peptides that are identified followed by protein inference. Despite continuous improvements to all steps of the shotgun proteomics workflow, observed proteome coverage is often low; some proteins are identified by a single peptide sequence. Complete proteome sequence coverage would allow comprehensive characterization of RNA splicing variants and all posttranslational modifications, which would drastically improve the accuracy of biological models. There are many reasons for the sequence coverage deficit, but ultimately peptide length determines sequence observability. Peptides that are too short are lost because they match many protein sequences and their true origin is ambiguous. The maximum observable peptide length is determined by several analytical challenges. This paper explores computationally how peptide lengths produced from several common proteome digestion methods limit observable proteome coverage. Iterative proteome cleavage strategies are also explored. These simulations reveal that maximized proteome coverage can be achieved by use of an iterative digestion protocol involving multiple proteases and chemical cleavages that theoretically allow 92.9% proteome coverage.

1. Introduction

In the postgenome era, biologists have sought system-wide measurements of RNA, proteins, and, metabolites, termed transcriptomics, proteomics, and metabolomics, respectively. Shotgun, or bottom-up, proteomics has become the most comprehensive method for proteome identification and quantification [1]. However, observed protein sequence coverage is often low. The ability to cover 100% of protein sequences in a biological system was likened to surrealism in a recent review by Meyer et al. [2]. Multiple steps in the traditional shotgun proteomics workflow contribute to the deficit in observed sequence coverage, including proteome isolation, proteome digestion, peptide separation, peptide MS/MS, and identification by peptide-spectrum matching. Proteome isolation has been extensively evaluated [3, 4]. Several types of peptide separation have been explored [57]. Mass spectrometers are becoming more sensitive and versatile [810]. Peptide-spectrum matching algorithms are adapting to new data types [11] and becoming more sensitive [12, 13]. Proteome fragmentation into sequenceable peptides is one step with significant room for improvement. DNA sequencing relies on sequence fragmentation into readable pieces by mechanical force [14], which produces a nearly uniform distribution of fragment lengths. In comparison, proteome fragmentation is generally accomplished by targeting one or more amino acid residues for cleavage, and, therefore, the protein cleavage can be likened to a Poisson process that produces an exponential distribution of peptide lengths.

Numerous papers have described the application of new digestion strategies for proteome analysis [1518]; however, no single strategy has emerged as optimal. The greatest observed proteome coverage has plateaued around 25%. 24.6% of the human proteome was recently observed [19], but this was obtained from over 1,000 MS/MS data files that allowed identification of over 260,000 peptide sequences using a new high performance data analysis package. Similarly, multiple protease digests of yeast resulted in 25.2% coverage [20]. Therefore, improved strategies for proteome digestion are needed to allow observation of a complete proteome.

An innovative example demonstrating the application of multiple enzyme digestion (MED) was recently published by Wiśniewski and Mann [21], which demonstrated the utility of multienzyme digestion coupled to filter-aided sample preparation [22] (MED-FASP, Figure 1). This work extends a previous work that described size exclusion to isolate long tryptic peptides for additional digestion [18]. Wiśniewski and Mann compared gains afforded by iterative digestion using various proteases (i.e., GluC, ArgC, LysC, or AspN) followed by trypsin. Their work concluded that iterative digestion with LysC followed by trypsin allowed 31% more protein identifications and a 2-fold gain in observed phosphopeptides for a particular protein. Their work led me to optimize iterative digestion in silico with the hope of identifying a testable digestion strategy that can theoretically achieve complete proteome coverage.

2. Methods

The S. cerevisiae proteome file in FASTA format was downloaded from UniProt on June 20, 2012. Proteome digestion simulations were accomplished using scripts written in [R] [23]. Considered protease specificities include c-terminal of R/K (trypsin), L (LeuC theoretical cleavage agent), E (GluC), and K (LysC). Additionally, simulations utilized chemical digestion agents [24], including cyanogen bromide (CNBr) [25, 26] for cleavage c-terminal of M, 3-bromo-3-methyl-2-(2-nitrophenylthio)-3H-indole (BNPS-skatole) for cleavage c-terminal of W [27], and 2-nitro-5-thiocyanobenzoic acid (NTCB) for cleavage n-terminal of C [28, 29]. Peptide populations were filtered using both length and molecular weight constraints. Since the filtration thresholds affect the proteome coverage prediction, multiple cutoff values are compared. The [R] code is available at https://www.github.com/jgmeyerucsd/ProteomeDigestSim.

3. Results and Discussion

3.1. Minimum Unique Peptide Length

The probability of a sequence being unique can be calculated assuming a random distribution of sequences in the library. The number of sequences of length n can be described by 20n. Therefore, any given sequence of length five is likely to occur once in a library of 3,200,000 random amino acid sequences (roughly the number of amino acids in the S. cerevisiae proteome). As the number of amino acids in the database grows, a peptide sequence must be longer to expect uniqueness. The human proteome contains 11,323,900 amino acids (not including isoforms, downloaded from UniProt on October 22, 2013), and, therefore, for a sequence to be unique, it must be of length six. Of course, due to common sequence motifs there are less unique peptide sequences in a proteome than would be found in a random library.

3.2. Peptide Length Distributions from Various Cleavages

Initial in silico digestions using single cleavage agents were used to compare the resulting peptide lengths (Figure 2). Many peptide sequences are too short to uniquely match a protein. For all digestion agents, the most frequent peptide length produced is one. Generation of a single amino acid would arise when the target residue is next to itself in the protein. Notably, over 25% of theoretical peptides from trypsin digestion, which cleaves after 11.7% of all residues, are of length one. Not surprisingly, the observable proportion of the residue targeted for cleavage correlates with the resulting average peptide length (Figure 3); more common cleavage targets produce shorter average peptide lengths. Additionally, the residue-level coverage was found to depend on digestion. Proteome cleavage after more common residues results in depletion of the target residues (Figure 4), which is expected to result from production of peptides that are too short to uniquely match a protein sequence. However, cleavage after rare residues results in enriched coverage of the target residue. This result was also observed by amino acid analysis of proteome digestions in recent work [30].

3.3. Comparison of Peptide Filtration Parameters

The theoretical distribution of peptides passing through a MWCO ultrafilter certainly does not match the actual distribution. Denatured peptides and proteins are effectively larger than folded proteins, and, in fact, it was found that even 30 kDa or 50 kDa cutoff ultrafilters perform better for peptide yield than 10 kDa cutoff ultrafilters [31], despite the inability to identify such large peptide sequences by bottom-up proteomics. Therefore, multiple length constraints were compared for their influence on the predicted proteome coverage. Figure 5 shows how various minimum peptide length values affect residue-level depletion and theoretical proteome coverage. As the minimum length increases, total coverage decreases and depletion of R/K increases. Figure 6 shows how different upper length thresholds change theoretical coverage. Intuitively, raising the upper length limit of identifiable peptides increases total predicted proteome coverage. Interestingly, although total predicted coverage increases, the coverage of R/K stays around 60%. Since peptide MW also determines identifiable peptides and peptides above 5 kDa are unlikely to be identified with current MSMS technology, an upper limit of 5 kDa was used for subsequent digest simulations. A lower length limit of 7 amino acids was used because this length is more likely to be relevant to actual proteomics experiments.

3.4. Comparison of Digestion Iterations

Several combinations of cleavage agents were simulated to compute theoretical proteome coverage resulting from the iterative MED-FASP (iMED-FASP) strategy. Simulations confirm that iMED-FASP offers theoretically greater coverage of the proteome when the sequence of digestions starts with the protease targeting the rarest residue first (Table 1). As expected, reversal of the optimal digestion sequence results in a negligible improvement to proteome coverage as compared to the limit from using trypsin digestion alone.

3.5. Proposed Iterative Digestion Strategy and Challenges Therein

An ideal iterative cleavage strategy must limit sample processing steps and must take place under conditions that are compatible with the ultrafiltration device. Further, because tryptophan fluorescence can be used to quantify peptide yield from each digestion, chemical cleavage after tryptophan should initially be omitted since it destroys the fluorophore that can be used to monitor peptide yield. Therefore, a testable, ultrafilter-compatible strategy, with a balance between sample processing and predicted gains in coverage, is the sequence: NTCB, CNBr, LysC, and trypsin.

Implementation of this method introduces several technical hurdles that must be addressed. First, the buffer conditions required for each separate digestion need to be planned. The requisite use of an ultrafiltration device fortunately allows easy buffer/denaturant exchange to accommodate the different conditions. However, researchers should carefully consider which conditions are best for each step and use controls to ensure the efficient digestion at each step. Limitations of the ultrafilter must also be accounted for. For example, cleavage after methionine by CNBr is usually carried out at a formic acid concentration that would degrade the ultrafilter membrane. Instead, HCl could be substituted to enable use of CNBr with the iterative digestion MED-FASP strategy. Another key consideration is the choice of peptide fragmentation. Nontryptic peptides are less efficiently fragmented by commonly used peptide dissociation methods (e.g., collision-induced dissociation). Therefore, I recommend that any attempt to assess this theory should use electron-transfer dissociation (ETD) [32], which produces more complete fragment ion series that depend less on peptide sequence. Database searching also presents a challenge because the peptide pools will lack defined termini, which therefore requires that the database search be carried out with “no enzyme” specificity. A fast and effective choice for database searching with “no enzyme” specificity is MSGFDB [13], which can learn scoring parameters from a set of annotated peptide-spectra matches in order to improve the sensitivity of peptide identification. Finally, it should be noted that the biological fact of missed cleavages will result in deviations from these simulations. The feature to allow user-defined missed cleavage propensities has been implemented in the code, and an example of the effects is shown in supplemental Figure in the Supplementary Material available online at http://dx.doi.org/10.1155/2014/960902. The missed cleavages result in noisy length distributions. Missed cleavages help limit the proportion of short peptides, suggesting that optimization of partial digestions might further improve proteome coverage.

4. Conclusions

This work provides a publically accessible computational framework for simulation of iterative proteome digestion that can be used with any input protein sequence database to optimize proteome coverage. Further, this work demonstrates how the choice of proteome digestion agent affects the predicted proteome coverage due to the distribution of peptide lengths that are produced. This work also shows how various digestion agents affect proteome coverage at the residue level. Proteome cleavage targeting common residues results in depletion of the cleaved residue, but proteome cleavage after rare residues results in enrichment of the target residue. Finally, this paper finds that the best theoretical proteome coverage is achieved by an iterative digestion strategy that limits production of short peptides by cleaving the rarest residues first.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

Jesse G. Meyer was supported by the NIH interfaces training Grant (no. T32EB009380) and funding from the NSF (MCB1244506).

Supplementary Materials

Supplemental figure legend 1: Effect of missed cleavages on peptide length distributions produced for in silico digestion with AspN. Ten simulations each of 1%, 10%, and 50% missed cleavage propensity were run and each distribution was plotted. As the missed cleavage propensity increases, the peptides become larger and the variance increases.

  1. Supplementary Figure