Abstract

The predictive capability of the retention time prediction model based on quantitative structure-retention relationships (QSRR) was tested. QSRR model was derived with the use of set of peptides identified with the highest scores and originated from 8 known proteins annotated as model ones. The predictive ability of the QSRR model was verified with the use of a Bacillus subtilis proteome digest after separation and identification of the peptides by LC-ESI-MS/MS. That ability was tested with three sets of testing peptides assigned to the proteins identified with different levels of confidence. First, the set of peptides identified with the highest scores achieved in the search were considered. Hence, proteins identified on the basis of more than one peptide were taken into account. Furthermore, proteins identified on the basis of just one peptide were also considered and, depending on the possessed scores, both above and below the assumed threshold, were analyzed in two separated sets. The QSRR approach was applied as the additional constraint in proteomic research verifying results of MS/MS ion search and confirming the correctness of the peptides identifications along with the indication of the potential false positives.

1. Introduction

Liquid chromatography (LC) combined with tandem mass spectrometry (MS/MS) plays an essential role in the field of protein research. In this technique, proteins and peptides are separated with the use of liquid chromatography methods and then identified by tandem mass spectrometry analysis. Thanks to high resolution, accuracy, and sensitivity of LC-MS/MS systems, equipped with sophisticated techniques of fragmentation, not only can simple proteins be directly investigated, but also research on the level of whole proteomes became possible [1]. However, proteins/peptides identification from biological matrices is still an analytical challenge because of the great complexity of the samples, enormous concentration ranges of the occurring proteins and lack of proper standards. It all makes an exact and precise peptide or protein identification and, consequently, proteome coverage limited [2].

Proteomic research requires also higher throughput of the protein identification in LC-MS/MS. Peptide identification in MS/MS is based on matching to parent ion and values of daughter ions. This procedure allows to assign an identification confidence for this particular peptide, which contributes independently to the overall confidence of the protein identification. One of the most commonly applied method for protein definition in complex samples relies on correlation algorithm Sequest proposed by Yates and coworkers [36]. This algorithm matches the investigated peptide tandem mass spectrometry data with proper data from protein database. To increase reliability of the identification, several statistic parameters have been considered. First, the difference between the normalized cross-correlation functions for the first and second ranked results () is applied to indicate a correctly selected peptide sequence. The other criteria are cross-correlation score between the observed peptide fragment mass spectrum and the theoretically predicted one (), the preliminary score based on the number of ions in the MS/MS spectrum that match the experimental data (), the rank of the certain match during the preliminary scoring (), and the ions value (I) describing how many of the observed ions match the theoretical ions for the listed peptide. Currently, the most often applied criteria in protein study are cross-correlation score between the observed peptide fragment mass spectrum and the theoretically predicted one () and cross-correlation functions for the first and second ranked results (). Washburn et al. [7] applied the following criteria of correctness of peptide identification: above 1.9 for single charged fully tryptic peptides, over 2.2 and 3.75 for fully or partially tryptic doubly and triply charged peptides, respectively, and the values higher than 0.08. On the other hand, in the studies performed by Peng et al. [8] the peptides were classified as properly identified when was, in case of fully tryptic peptides, higher than 2.0, 1.5, or 3.3 for the charge states of 1+, 2+, 3+, correspondingly, and over 3.0 (2+ charged) or 4.0 (3+ charged) considering partially tryptic peptides, when score was above 0.08. The relationship between application of different filtering criteria and degree of false positive identifications has also been recently demonstrated by Qian et al. [9]. There it was shown that all previously applied filtering criteria were derived using either relatively simple proteomes (e.g., the yeast proteome) or standard proteins. The degree of false positive identifications, when these criteria are extended to considerably more complex mammalian proteomes, especially human proteome, is still problematic and requires improvement of the strategies to distinguish correct from incorrect ones. Therefore, to decrease the probability of random match, which is growing up with the size of the protein database, two new sets of filtering criteria were independently developed for human cell line and human plasma samples [9]. For human cell line samples, the new criteria were as follows: for fully tryptic peptides and for partially tryptic peptides for the 1+ charge state, for fully tryptic peptides and for partially tryptic peptides for 2+ charge state, and for fully tryptic peptides and for partially tryptic peptides for the 3+ charge state. All the criteria had value of . The new criteria for peptides from human plasma samples include for the 1+ charged, and for fully and partially tryptic peptides, respectively; for the 2+ charged, for fully and for partially tryptic peptides, consequently; and for the 3+ charged, for fully and for partially tryptic peptides, accordingly. The values were in all cases as well.

Nevertheless, considering the variety and dynamic range of the proteins, occurring in the different organisms, there is still a possibility of false positive or false negative identification. Growing concerns about the quality of MS data affected in various ideas to harden protein identification by using bioinformatics’ methods, for example, decoy search strategies [10] or additional information obtained during analysis, for example, peptide pI or retention time [11]. The retention time is very practical parameter in proteomics as it is easy to obtain from LC-MS data and does not require a lot of instrumental effort [2, 12]. Comparison of the experimental and predicted retention times of the occurring peptides may examine the correctness of the identification and then enable to exclude the incorrectly identified ones. However, to predict properly peptides’ retention highly accurate models should be developed. Recently, some models have been proposed which characterize quantitatively the structure of a peptide and predict its gradient RP-LC retention at given separation conditions [13, 14].

Liquid chromatography (LC) is an analytical technique which can provide a great amount of quantitative, comparable, and reproducible (retention) data for large sets of structurally diversified compounds (analytes). On the other hand, chromatographic retention time can be considered as a chemical structure dependent parameter, which is constant for given separation conditions (mobile phase composition, stationary phase, temperature, pH). Due to that, quantitative structure (chromatographic) retention relationships (QSRR) have been considered as a model approach to establish strategy of retention predictions. However, to predict properly peptides’ retention highly accurate models should be developed [1517]. In particular, in proteomics, the structural descriptors obtained from QSRR studies can contribute to better predictions of retention times and therefore harden peptides identification.

Several previous reports [1821] prove that retention of peptides in reversed-phase liquid chromatography (RP-LC) depends on their amino acids composition. There, the regression analysis was used to derive the regression coefficients, which represented the contribution of each amino acid in the peptide’s sequence to its retention. This approach was applied in proteomics analysis, to predict the retention times of peptides’ tryptic digests [22]. Then, it was also employed to increase the reliability of the peptides identification to check the predictive capability of artificial neural networks (ANNs) by Petritis et al. [23] or by Shinoda et al. [24], where created ANN was then applied to predict the retention times of peptides from Escherichia Coli proteome. The correlation between amino acid composition and peptide’s retention time was used as well to provide the identity information, given by the tandem mass spectrometry, of the peptides from Drosophila melanogaster proteome, to exclude the false positive identifications [25].

Recently, a QSRR model based on multiple linear regression has been proposed [26] to quantitatively characterize the structure of a peptide and to predict its gradient RP-LC retention at established separation conditions. The logarithm of the sum of gradient retention times of the amino acids composing the individual peptide, , the logarithm of the peptide Van der Waals volume, , and the logarithm of its calculated -octanol-water partition coefficient, were employed [2629].

The aim of the study was to derive the retention time prediction model and check its predictive capability based on quantitative structure-retention relationships (QSRRs). The newly modified QSRR model was derived with the use of set of peptides identified with the highest scores and originated from eight model proteins [13, 24, 3032]. Therefore, no synthesized peptides with known amino acid sequences were used to derive and check the model [14, 31]. Moreover, descriptors applied in the new QSRR model were obtained in the new, facilitated from practical point of view, manner. Finally, its predictive ability was supported by further investigation with the use of a Bacillus subtilis proteome digest (not like previously just applying synthesized peptides with known amino acid sequences). To demonstrate that ability three sets of testing peptides received from proteins identified with different levels of confidence were used. Moreover, the additional attempts were performed to demonstrate the utility of QSRR approach as the additional constraint confirming the correctness of the peptides identifications.

2. Material and Methods

2.1. Standards

The standard amino acids solutions were prepared by dissolving seven amino acids among twenty naturally occurring ones (isoleucine, leucine, methionine, phenylalanine, tryptophan, tyrosine, and valine, all from Fluka BioChemika, Buchs, Switzerland) in 0.1% aqueous solution of trifluoroacetic acid (TFA). Water was deionized by passing through a Direct-Q (Millipore) system (Millipore, Bedford, MA, USA). The concentrations of the samples were approximately 0.6 mg/mL.

The solutions of standard proteins annotated as eight model proteins (about 3 mg/mL) were as follows: bovine serum albumin (BSA), chicken egg ovalbumin (CEO), bovine milk lactoglobulin (BML), bovine milk β-casein (BMC), bovine myoglobin (BM), human serum albumin (HSA) and ribonuclease B (RibB) from Sigma-Aldrich (Steinheim, Germany), and insulin-like growth factor-binding protein 1 (IGFBP-1), which was purified from human amniotic fluid following a previously reported procedure [33]. They were obtained by dissolving the lyophilized standard proteins in deionized water and then treated as shown below in digestion protocol.

2.2. Bacillus subtilis Sample Preparation
2.2.1. Growth Conditions

Bacillus subtilis strains were grown in nutrient broth (NB) supplemented with 0.2% KCl, 0.05% MgSO4 (final concentration) and antibiotics, if appropriate with shaking at 37°C.

2.2.2. Spore Purification

As described before [33] forty-eight-hour cultures in nutrient broth were pelleted (10000 g, 10 minutes) and washed three times with 1/4 volume of cold water. The pellet was resuspended in 1/5 of the initial volume of cold MQ water and incubated overnight at 4°C. On subsequent days the suspension was centrifuged (20000 g, 20 min, 4°C). The pellet was resuspended in fresh cold MQ water. This procedure was repeated for 5 to 10 days. Purified spores were kept in water suspension at 4°C in the dark. Once per week the spore were centrifuged and suspended in fresh water to avoid spontaneous germination.

2.2.3. Protein Extraction

The spore pellet (approximately 20 mg spores) was resuspended in 1 mL of extraction buffer (50 mM Tris-HCl, pH = 7.8; 2% SDS; 10% glycerol; 0,2 M DTT) and boiled for 5 min and vortexed for 30 seconds. These steps were repeated twice. Unlysed spores and spore debris were removed by centrifugation at 12,000 g for 5 min at 4°C. The supernatant was precipitated with acidified acetone/methanol mixture. To one volume of protein solution four volumes of cold precipitation reagent were added and kept on at °C. Precipitate was spun down at 15,000  g, at 4°C and supernatant was discharged an samples were drained, then resuspended in water, and stored at °C. Concentration of proteins was determined with the use of Bradford assay kit (Bio-Rad Laboratories) and it equalled 1.2–1.5 mg/mL.

2.3. Digestion Protocol

To 1 mL of each protein (BSA, CEO, BML, BMC, BM, HAS, RibB, and IGFBP-1) sample ( mg/mL), 300 L of DTT (dithiothreitol) (Sigma-Aldrich, Steinheim, Germany) (100 mM, freshly prepared in 100 mM ammonium bicarbonate buffer, pH 8.5) were added. The samples were kept in 60°C for 30 min, to allow reduction of the disulfide bridges. Then 50 g of trypsin was added (ratio 1 : 50 E/S) to each sample. Samples were digested for 12 hours (overnight digestion) at 37°C. After that 0.1 mL of TFA was added to each sample to stop the digestion. Obtained standard solutions concentrations were about 50 pmol/L.

To 1 mL of Bacillus subtilis spore cells lizates (1.2–1.5 mg/mL), 150 L of DTT (Sigma-Aldrich, Steinheim, Germany) (100 mM, freshly prepared in 100 mM ammonium bicarbonate buffer, pH 8.5) were added. The samples were kept in 60°C for 30 min, to allow reduction of the disulfide bridges. Then 25 g of trypsin was added (ratio 1 : 50 E/S) to each sample. Samples were digested for 12 hours (overnight digestion) at 37°C. After that 0.05 mL of TFA was added to each sample to stop the digestion. Obtained standard solutions concentrations were about 50 pmol/L.

Tryptic digests were stored at °C (if frozen in this reaction mixture the disulfide bonds would not reoxidase). The LC-ESI-MS/MS analyses were performed in three weeks at the latest (the shelf life of such frozen solution is couple of months) (http://www.thermo.com/).

2.4. LC Conditions

The chromatographic analysis was performed on C-18 analytical column: XTerra MS C18 3.5 m ( mm) column (Waters, Milford, MA, USA).

The mobile phase consisted of two solvents (A and B) mixed on-line. Solvent A was 0.1% aqueous (water was MS-grade) solution of trifluoroacetic acid (TFA) (Sigma-Aldrich, Steinheim, Germany) and solvent B was acetonitrile (ACN) (MS-grade, Sigma-Aldrich, Steinheim, Germany) containing 0.1% TFA. The applied linear gradient time was 90 min, from 0% B to 60% B. The flow rate was 200 L/min. The injection volume was 10 L. The LC-MS apparatus was equipped with thermostated column oven and surveyor autosampler controlled at 20°C (Thermo Finnigan, San Jose, CA, USA), a quaternary gradient Surveyor MS pump (Thermo Finnigan, San Jose, CA, USA) with a diode array detection (DAD) system, and LTQ linear ion trap MS system with ESI ion source controlled by Xcalibur software 1.4 (Thermo Finnigan, San Jose, CA, USA).

2.5. MS Conditions

The MS/MS analysis was performed on Finnigan LTQ instrument (Thermo Finnigan, San Jose, CA, USA). Mass spectra were generated in positive ion mode under constant instrumental conditions: source voltage 4.62 kV, capillary voltage 40.97 V, sheath gas flow rate 39.99 (arbitrary units), auxiliary gas flow 10 (arbitrary units), sweep gas flow 0.95 (arbitrary units), capillary temperature 219.96°C, and tube lens voltage 250.43 V. MS/MS spectra, obtained by CID (collision-induced dissociation) in the linear ion trap, were performed with an isolation width 3Da ; the activation amplitude was 35% of ejection RF amplitude that corresponds to 1.58 V.

2.6. Protein Identification

The experimental retention times of the peptides were determined at peak intensity maximum. The values measured manually for the most intense peaks in acquired MS/MS spectra were automatically searched against the protein database (*fasta) using the Sequest Algorithm, incorporated into Bioworks 3.0 (Thermo Finningan, San Jose, CA, USA). The *fasta format for each protein was downloaded from Expasy (http://www.expasy.org/sprot/). During the interpretation of the results obtained after the correlation analysis done on the experimental and the predicted retention times of peptides, the exemplary filtering criteria applied in the studies were the same as those discussed previously, proposed by Washburn et al. [7]. The spectra for singly charged peptides with a cross-correlation score to a tryptic peptide () greater than 1.9, the spectra for doubly charged tryptic peptides with of at least 2.2, and the spectra for triply charged tryptic peptides with above 3.75 were accepted as correctly identified according to Sequest software. For all the spectra analyzed, values were above 0.08.

2.7. QSRR Analysis

Multiple regression equations for model set of peptides based on the experimental retention times were derived by employing Microsoft Excel software (Microsoft Co., Redmond, WA, USA) and Statistica (StatSoft, Tulsa, OK, USA) run on a personal computer. Regression coefficients (±standard deviations), multiple correlation coefficients, R, standard errors of estimate, s, significance levels of each term and of the whole equations, p, and values of the F-test of significance, F, were calculated.

The structural descriptors of the analyzed standard amino acids and peptides from investigated, standard proteins and Bacillus subtilis cells were calculated. First of all, in contrary to the previous models [2629], where just was calculated by simple addition of component amino acids retention (taking into account all 20 naturally occurring amino acids), the novel QSRR peptide descriptor was used. The retention factor was introduced, because it is more similar for different related systems than as it compensates for some physical differences between columns. Descriptor was calculated applying retention data for just only 7, the most retained amino acids (isoleucine, leucine, methionine, phenylalanine, tryptophan, tyrosine, and valine). The other 13 amino acids are hardly retained; therefore their presence in peptide’s sequence does not influence significantly its retention. For these 13 amino acids fixed values were ascribed (k = 0) and one was added to avoid zero in the calculation of the logarithm, according to the procedure elaborated and evaluated elsewhere [34]. On the other hand, searching for the most accurate the logarithm of its calculated n-octanol-water partition coefficient, P, values, different calculation methods were tested (data not shown). Briefly, to obtain P values HyperChem 7.5 professional software for personal computers (HyperCube, Waterloo, Canada) with the extension ChemPlus, Dragon professional 5.0 software (Milano Chemometrics and QSAR Research Group—Talete, Milano, Italy), and on-line available ALOGPS 2.1 software (http://www.vcclab.org/) were obtained. Finally, to derive the appropriate QSRR model, clog P values average, log P module in ALOGPS 2.1 software was used to determine that QSRR descriptor.

The general QSRR equation has the following form:

where is the gradient HPLC retention time and - are regression coefficients.

3. Results and Discussion

3.1. Derivation and Validation of QSRR Model

The QSSR model was derived from peptides obtained from the digestion of 8 model proteins. The amino acid sequences of these peptides were proved by MS/MS analysis and identified by Sequest software (Bioworks 3.0 package Thermo Fisher Scientific Inc., Waltham, MA, USA). Only peptides with the highest scores were taken into account in the model set of peptides used to derive the QSRR model. Peptides were assumed and considered as true positives according to their cross-correlation score to a tryptic peptide values with over 2.0 for 1+ and 2+ and over 4.5 for 3+ charged peptides. Peptides with lower values of were excluded from the model set of peptides, due to higher possibility of their false positive identification. Hence, the peptides included in the study were divided into five groups: one set of model peptides (Table 1) and four testing sets of peptides (Tables 25). 50 model peptides used to derive QSRR model and collected in Table 1 originated from 8 model proteins. The 21 peptides reported in Table 2 were used to check the general validity of the proposed QSRR model. In view of the main objective of this work, three other sets of testing peptides originating from B. subtilis proteome digestion were used. One set includes 54 peptides belonging to proteins identified on the basis of more than one peptide with above 1.5 (Table 3). A second set comprises 41 peptides belonging to proteins identified again with above 1.5, but on the basis of just one peptide (Table 4). And the third set comprises 40 peptides belonging to proteins identified on the basis of just one peptide, but with below 1.5 (Table 5).

The model set consisting of 50 peptides with the highest values of was used to create a model to predict further retention times of the peptides from proteome of Bacillus subtilis cells. Among this group differences between experimental and predicted retention times ranged from 0.01 to 2.81 min. 42% (21 peptides) of the results were characterized by differences between experimental and predicted retention times lower than 1 min, and for the remaining 58% (29 peptides), these values ranged from 1 to 3 min (Table 2). Taking into account retention times and the values of descriptors for those 50 model peptides, the following specific equation was derived:

The description of by (2) was good as documented by the following criteria of statistical quality. All the regression coefficients were highly statistically significant as was the whole equation. Multiple correlation coefficient, R, standard error of estimate, s, and the value of the F-test of significance, F, all were also satisfactory.

Equation (2) provides the predictive model based on experimentally obtained descriptor () and improved by the implementation of molecular-modeling-based descriptor (). Experimentally obtained descriptor () appeared to possess significant contributions into peptides’ retention. However, the has little in common with n-octanol/water partition coefficient—neither for individual amino acids nor for the peptide. The considered analytes were highly ionizable and only minute fraction of molecules can exist in nonionized form in solution. Only for that fraction () properly reflects the ability to partition between aqueous and hydrophobic phase. Therefore, the parameter was not considered to mimic ; actually it reflects differences in peptides polarities. Instead, was an auxiliary peptide structure descriptor: a correction for .

In order to check the correctness of the model, the set of 21 peptides (Table 2), derived from 8 model proteins, was used as the validation set. The predicted retention times, calculated from (2), were then compared to the experimental retention times and the differences between these two retention times were calculated. Differences varied from 0.09 to 3.08 minutes in retention time (mean value 1.29 min, Table 2). For 9 peptides the range of differences between experimental and predicted retention times (42.86%) was from 0.09 to 0.46 min; for 11 peptides (52.38%) the range was 1.07–2.99 min; for 1 peptide (4.76%) this value was over 3 min. Correlation () between experimental and predicted retention times confirmed additionally the validity of the model (Figure 1), proving that similar values of predicted and experimental retention times of analyzed peptides correlate also with higher probability of identification correctness using Sequest algorithm (Figure 5).

3.2. QSRR-Based Analysis of Peptides from Bacillus subtilis Proteome

Using (1), the predicted retention times for peptides identified for proteome of Bacillus subtilis cells were further calculated (Tables 35). The experimental retention times for these peptides were obtained in LC-MS/MS analysis and compared to the calculated ones. Here, the special attention on peptides with low (around 1.5) was taken into account to check the applicability of the proposed model and to indicate the potential false positives. In this case, the most important were the attempts to provide the QSRR-based tool to confirm true and false positively identified peptides.

The derived accurate model, as confirmed in Figure 1, was applied to calculate also the retention times of peptides from the real proteome sample of Bacillus subtilis cells. Its correctness was proved first by calculating the predicted retention times of peptides belonging to proteins identified on the basis of more than one peptide with above 1.5, that is, those ones that are assumed to be the most confident true positives. It is clearly seen on correlation plot depicted in Figure 2 that the predicted retention times and experimental retention times do not vary significantly, and so it can be concluded that those peptides, and the proteins, to which they are assigned, are correctly identified and really present in the analyzed sample. The detailed accuracy of the peptide identification can be further examined in Table 3. In the set of 54 peptides obtained from digestion of Bacillus subtilis proteome and belonging to proteins identified on the basis of more than one peptide with above 1.5, the differences between experimental and predicted retention times varied from 0.08 to 18.07 min (mean value 5.13 min). For 8 peptides, being 14.82% of the set, the difference between experimental and predicted retention times was lower than 1 min. There were 6 peptides (11.11%), which retention times differences ranged between 1 and 3 min. In most cases, differences between experimental and predicted retention times were from 3 to 5 min and then from 5 to 10 min, for 18 (33.33%) and 16 (29.63%) peptides, respectively. 4 peptides (7.41%) were characterized by difference in experimental and predicted retention times ranging from 10 to 15 min. There were even also 2 cases, for which these values varied between 15 and 20 min. The correlation between experimental and predicted retention times can be considered good with correlation coefficient equaled 0.936 (Figure 2). However, some peptides in this set could be considered probably as false positives (e.g., ESIAQVAAISAADEEVGSLIAEAMER, or MSGWLAHILEQYDNNRLIRPR). Generally, at that moment, it was proved that it is again possible to predict the retention times of unknown peptides of Bacillus subtilis proteome, based on retention data obtained experimentally only for the limited number of known model peptides originating from 8 known model proteins.

Among 41 Bacillus subtilis peptides belonging to proteins identified on the basis of only just one peptide with above 1.5 (Table 4), the difference between experimental and predicted retention times varied from 0.35 to 11.7 min and the mean value was 4.92 min. The predicted retention times of 5 peptides varied from the experimental ones less than 1 min, which refers to 12.20% of the investigated set. For other 8 peptides (19.51%) the difference between experimental and predicted retention times was higher than 1 min, but lower than 3 min. The range from 3 to 5 min in retention time difference was characteristic for 11 peptides, constituting 26.83% of the studied set. The highest numbers of peptides (13) were characterized by 5 to 10 min difference in retention times (31.76%). On the other hand, the highest values, over 10 min, of the difference between predicted and experimental retention times were characteristic for 4 peptides (9.76%) and the largest difference was 11.7 min (Table 4). The correlation between experimental and predicted retention times is still reasonably with correlation coefficient equaled 0.8405 (Figure 3). Some peptides in this set seem to be also false positives (e.g., DQDISGEKATADQLLKDVK or IQNGDPIAGLFDEFTQTVQR), even though they fulfill the established level of criterion for proper peptide identification. The differences between predicted and experimental retention times (here 11.49 and 11.70 minutes, resp.) suggest that these peptides, and proteins, from which they originate, may not be really present in the analyzed sample.

Finally, in the group of 40 Bacillus subtilis peptides, belonging to proteins identified again on the basis of just one peptide, but with below 1.5 (Table 5), the differences between experimental and predicted retention times range from 1.27 to 78.80 min (mean value equaled 29.41 min). There were only 4 peptides (10%) with predicted and experimental retention times varied less than 3 min. In next 5 cases this difference was over 3 but lower than 5 min, which makes 12.5%. There were 3 peptides (7.5%) in the range between 10 and 15 min of difference in predicted and experimental retention times. For other 5 peptides, the difference in predicted and experimental retention times was from 15 to 20 min (12.5%). Next 4 (10%) peptides in the group belonging to proteins identified on the basis of one peptide with below 1.5 were characterized by 20 to 30 min difference between predicted and experimental retention times. There was 1 case (2.5%), where this difference in retention times ranged between 30 and 50 min. For last 13 peptides (32.5%) in this set the experimental and predicted retention times varied even over 50 min: there were 4 cases (10%), where these values differed between 50 and 60 min; 3 peptides (7.5%) in the 60 to 70 range of retention time difference and 6 (15%) varying more than 70 min (Table 5). It must be stated that for peptides belonging to proteins identified on the basis of one peptide with below 1.5, correlation between experimental and predicted retention times cannot be observed (Figure 4). Therefore it may be concluded that a large number of peptides in this set should be classified as false positives, especially those ones with extremely high difference between experimental and predicted retention times (e.g., HGGSLSAPAIH, DGITDVL, IDFPTNITMD, or LAAGISTI, where these differences are 78.80, 77.54, 73.73, and 73.26 minutes, resp.).

Generally, it can be noticed that lower values of correlate with the higher percentage of peptides are characterized by larger difference between experimental and predicted retention times (Figure 5). In particular, it is observed, when comparing the percentage of cases, where differences between predicted and experimental retention times are higher than 15 min, that in each group of Bacillus subtilis peptides belonging to proteins and identified on the basis of the following: one peptide with below 1.5 (Table 5), one peptide with over 1.5 (Table 4), and more than one peptide with over 1.5 (Table 3). The percentages of peptides characterized by higher than 15 min difference in experimental and predicted retention times in these groups are 57.5%, 0%, and 3.7%, respectively. On the other hand, in model and testing sets of peptides obtained from model proteins all differences between predicted and experimental retention times were lower than 15 min (Tables 1 and 2). It is noticeable that high percent of peptides with low values of was characterized by differences between predicted and experimental retention times larger than 15 min, what can provide an additional indication that they could be considered as potential false positives and in fact were not identified in the analyzed sample. Therefore, QSRR equation to predict peptides retention times might be useful tool to increase throughput of the protein identification in LC-MS/MS.

4. Conclusions

Quantitative structure-retention relationships (QSRRs) model derived with the use of set of peptides identified with the highest scores and originated from 8 known proteins was tested with regards to its predictive capability of the retention time prediction. Bacillus subtilis proteome digest was used to check the predictive ability of the novel QSRR model proposed in the study. It was found that the QSRR approach can be applied as the additional constraint in proteomic research verifying results of MS/MS ion search and confirming the correctness of the peptides identifications along with the indication of the potential false positives. The results suggested that due to the QSRR used for the prediction of peptide retention, liquid chromatography separation stage of proteomic research could be useful in the final identification of peptides, especially considering the most uncertain protein identifications based on findings for just one peptide.

Acknowledgments

The work was supported by the Polish State Committee for Scientific Research Projects N N405 1040 33 and by Polish-Italy bilateral scientific and technological cooperation project 2007–2009.