Table of Contents Author Guidelines Submit a Manuscript
Comparative and Functional Genomics
Volume 2011 (2011), Article ID 780973, 16 pages
http://dx.doi.org/10.1155/2011/780973
Research Article

Prediction and Characterization of Missing Proteomic Data in Desulfovibrio vulgaris

1Division of Biometrics II, Office of Biometrics/OTS/CDER/FDA, Silver Spring, MD 20993-0002, USA
2Division of Biometrics IV, Office of Biometrics/OTS/CDER/FDA, Silver Spring, MD 20993-0002, USA
3Department of Biological Sciences, University of Maryland at Baltimore County, Baltimore, MD 21250, USA
4School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China

Received 25 October 2010; Revised 17 December 2010; Accepted 1 March 2011

Academic Editor: E. Hovig

Copyright © 2011 Feng Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Proteomic datasets are often incomplete due to identification range and sensitivity issues. It becomes important to develop methodologies to estimate missing proteomic data, allowing better interpretation of proteomic datasets and metabolic mechanisms underlying complex biological systems. In this study, we applied an artificial neural network to approximate the relationships between cognate transcriptomic and proteomic datasets of Desulfovibrio vulgaris, and to predict protein abundance for the proteins not experimentally detected, based on several relevant predictors, such as mRNA abundance, cellular role and triple codon counts. The results showed that the coefficients of determination for the trained neural network models ranged from 0.47 to 0.68, providing better modeling than several previous regression models. The validity of the trained neural network model was evaluated using biological information (i.e. operons). To seek understanding of mechanisms causing missing proteomic data, we used a multivariate logistic regression analysis and the result suggested that some key factors, such as protein instability index, aliphatic index, mRNA abundance, effective number of codons (𝑁𝑐) and codon adaptation index (CAI) values may be ascribed to whether a given expressed protein can be detected. In addition, we demonstrated that biological interpretation can be improved by use of imputed proteomic datasets.

1. Introduction

Application of various experimental “omics” tools has enhanced our understanding of complex biological systems in the past decade [18]. However, due to technical limitations of these high throughput technologies and constraints on experimental design, most of these high-throughput “omics” datasets still suffer from significant missing values. The incomplete data available has impeded scientists from extracting correct information regarding cell metabolism. To address the issue, attempts to apply computational tools to impute missing values in various high throughput “omics” datasets have been made [911]. The most successful examples of such efforts were for missing transcriptomic data [12].

Compared with transcriptomic dataset, proteomic datasets suffer even more serious missing data since its identification range and sensitivity are still not fully comparable with transcriptomic measurements [13]. In particular, when partial proteomics dataset was used in various integrated “omics” studies, the undetected proteins were simply assigned a “zero’’ value and were excluded from relationship modeling, which could bias any conclusion resulted from the integrated studies [14]. To overcome this problem, several methods have been adapted from the estimation of missing values in transcriptomic data to estimate the missing proteomics values by using the available measurements from other proteins, such as the 𝑘 nearest neighbor and Bayesian Principal Component Analysis (BPCA) methods for imputing missing proteomic values in gel-based proteomics dataset [15, 16], and by integrating the GO (Gene Ontology) information into the proteomic data imputation [17]. Based on the assumption that there exists meaningful correlation between two types of datasets [14], in recent years we have developed the Zero-inflated Poisson (ZIP) linear regression model [18] and a stochastic Gradient Boosted Trees (GBT) nonlinear model [19] to uncover possible relationships between transcriptomic and proteomic data and to predict protein abundance for the proteins not experimentally detected.

As part of a continuous effort to develop statistical tools for missing value imputation, we in the study applied an artificial neural network (ANN) approach to predict abundance for experimentally undetected proteins for several cognate transcriptomic and proteomic datasets of sulfate reducing bacterium Desulfovibrio vulgaris [2023]. The advantages of artificial neural networks include ability to tackle complex relationships and to deal with noisy data with missing values, and its capability of generalization [24, 25]. In the approach, a multilayer perceptron (MLP) with sigmoid activation function in the single hidden layer was used to capture the nonlinear relationship between transcriptomic data and protein abundance. The top ten relevant predictors for protein abundance identified in a previous study [19] were used as inputs to the MLP for training the output of the MLP to approximate the corresponding protein abundance of each gene. The trained MLP was validated using biological information. In addition, by using a multiple logistic regression analysis, we quantified the contribution of some sequence-based factors, such as mRNA abundance, protein instability index, gene length, “effective number of codons’’ (𝑁𝑐), and CAI (Codon Adaptation Index) values to missing proteomic values. We also demonstrated that biological interpretation can be improved by using the imputed proteomic datasets.

2. Materials and Methods

2.1. Datasets

Two datasets from Desulfovibrio vulgaris were analyzed in this study. The datasets covers the same species, although the experimental conditions are different and they were obtained by two independent studies [1923]. The raw intensity values from both datasets were normalized using a quantile normalization in an R package (caret) available through the 𝑅 project (http://www.r-project.org/). Absolute fluorescence intensity and peptide hits were used as abundance for transcriptomic and proteomic data, respectively [26]. These datasets, labeled Dataset 1 and Dataset 2, will be used throughout the article.

2.2. Genome Information

Gene annotated attributes, such as sequence length, protein length, molecular weight, GC content, triple codon counts, and cellular functional categories of all genes in the D. vulgaris genome were downloaded from the comprehensive microbial resource (CMR) of TIGR (http://cmr.tigr.org/) [22]. To define operons, we used the distance-only approach, a relatively low threshold, in this study to cover more of the possible operons. The genes that are transcribed in the same direction and are with intergenic region less than 15 base pairs were defined as one operon [19].

2.3. Construction of Relationship between Proteomics and mRNA through Artificial Neural Network

We trained an artificial neural network to learn a particular function by adjusting the values of the weights between neurons. It has been shown that a multilayer perceptron with a single hidden layer can approximate any continuous function to arbitrary precision by increasing the number of neurons in the hidden layer [2729]. The weights connecting the input signals as well as the number of neurons can be determined by minimizing the sum of the squared difference between the network outputs and target outputs for each set of input signals. Nonlinear optimization methods in combination with a backpropagation algorithm [30] were implemented to find the optimal weights. The inputs to the network and the targets were usually rescaled to [−1, 1]. The outputs of the network were then rescaled back to approximate the targets in original scale. In this work, we trained a multilayer perceptron to learn the nonlinear relationship between transcriptomic data and proteomic data. The inputs were the transcriptomic data and the target was the protein abundance.

A feed-forward network with a hidden layer can approximate any continuous function with arbitrary precision, which provides ANN the power to deal with nonlinear relationship, also open a door to overfitting if too many neurons are used in the hidden layer. To address the possible overfitting, we implemented a cross-validation to prevent overfitting. Through this method the data was partitioned in 𝐾 equal parts where 𝐾1 sets were used to train the model and the other set was used to calculate prediction errors [31]. This was repeated 𝐾 times, hence 𝐾 prediction errors values were computed one at every fold. An average and standard deviation can be extracted to select the most representative model for future prediction. Cross-validation was generally effective to prevent overfitting. Beside cross-validation to prevent overfitting, we also used some statistical tools to diagnose the selected model. Once the best model has been selected, based on cross-validation, the model was evaluated based on its coefficient of determination (𝑅2). This coefficient represents the variation explained by the model. Furthermore, to alternatively assess the goodness of the model we studied the predictions of small sets of genes which are grouped based on operon information. In order to describe the variation within a data set, such as “molar abundance” of proteins within one operon, we computed the coefficient of variation (CV) for each set of proteins. The CV is defined as the ratio of the standard deviation and the mean of the “molar abundance” for a set of proteins [18, 19, 32] where the calculation of CV score is independent of the sample size. These coefficients of variations were computed for all operons and compared to a distribution of permuted CVs where permutation of genes was performed.

2.4. Analysis of the Factors Ascribed to Missing Protein Measurements

For DNA and protein sequence features, the BioPerl module was used to batch query and parse results from the ExPASy ProtParam program, returning “AliphaticIndex” and “InstabilityIndex”. CodonW was used to perform correspondence analysis of both codon and amino acid usages, returning the first four axes representing the major trends (AA_axis1–4 for amino acid usage, CR_axis1–4 for Relative Synonymous Codon Usage). The frequency of each base at the third codon position (synonymous site, designated as A3s, T3s, C3s, and G3s), as well as two additional measures on protein property such as Grand Average of Hydropathy (GRAVY) and AROMO, were also computed with CodonW. The Codon Adaptation Index (CAI) was computed with an in-house perl script using D. vulgaris ribosomal protein genes as reference for highly expressed genes. The calculation and analysis of these features were described in several of our previous works [3335].

A logistic regression model was used to assess how each of the features attributes to whether a given protein will be detected [36, 37]. Assume the features for a gene 𝑖 are 𝑥1𝑖,,𝑥𝑘𝑖. In logistic regression, the logit, natural log of the odds, of the unknown probability, 𝑝𝑖, that the protein is not detected is modeled as a linear function of 𝑥1𝑖,,𝑥𝑘𝑖:𝑝logit𝑖𝑝=ln𝑖1𝑝𝑖=𝛽0+𝛽1𝑥1𝑖++𝛽𝑘𝑥𝑘𝑖.(1) The model is equivalent to 𝑝𝑖=𝛽exp0+𝛽1𝑥1𝑖++𝛽𝑘𝑥𝑘𝑖𝛽exp0+𝛽1𝑥1𝑖++𝛽𝑘𝑥𝑘𝑖+1.(2) The unknown coefficients are usually estimated by maximum likelihood using a method common to all generalized linear models [35]. In the model, the parameter 𝛽𝑗 is the additive effect on the log of the odds for a unit change in the 𝑗th explanatory feature variable. That is, one unit change of feature 𝑗, that is, 𝑥𝑗𝑖 indicates the odds changes 𝑒𝛽𝑗 times. Positive 𝛽𝑗 indicated that the probability that the protein will not be detected is increased with increasing the 𝑗th feature. By using a logistic regression analysis, the effects of relevant features on whether a protein will be detected were quantified and can be compared. And the fitted model can also be used to predict the chance that any other proteins will not be detected.

3. Results and Discussion

3.1. Determine the Variable Importance and Partial Variable Dependence

In a previous study, we used the gradient boosted trees (GBT) model to measure the contribution of each of the 70 variables input variables to the RNA-protein correlation, and the 10 top-ranked variables were identified to be mRNA expression level, cellular role, and several triple codon usages for the D. vulgaris datasets we used (See Table  1 of Supplmentary Material available online at doi: 10.1155/2011/780973) [19]. In this study, we used the same top ten predictors as inputs to the artificial neural network approach. The datasets used are consistent with previous work for comparison. The results showed that for all datasets, we observe a smooth relationship between protein abundance and the abundance of corresponding mRNA controlling the other predictors at their mean values for each growth condition. The partial dependence between protein and mRNA abundance tends to be smoother when the abundance of mRNAs are in the high range (Figure 1). Figure 2 shows the scatter plots of predicated protein abundance versus mRNA for proteins detected and undetected for datasets 1 and dataset 2, respectively. For both datasets, under each growth condition, there are much fewer genes with very high mRNA or protein abundance. Those proteins undetected are mostly with very low predicted protein abundance and observed mRNA abundance as shown in Figures 2(b) and 2(d) for dataset 1 and dataset 2, respectively. The proteins undetected have much smaller variance in predicted values especially for dataset 2 as shown in Figure 2(d). In comparison, those detected proteins have a larger range of protein abundance and are of very high variability.

fig1
Figure 1: Partial dependency plots. Predicted protein abundance for given values of mRNA controlling the other predictors at their mean values. (a) Dataset 1; (b) dataset 2.
fig2
Figure 2: Prediction plot for proteins. Scatter plots of predicted protein abundance versus mRNA for proteins detected and undetected. (a) Plot for genes with detected proteins for dataset 1; (b) plot for genes with undetected proteins for dataset 1; (c) plot for genes with detected proteins for dataset 2; (d) plot for genes with undetected proteins for dataset 2.

In the previous work of using GBT model to uncover the nonlinear relationship between transcriptomic and proteomic datasets, an interesting pattern, a plateau effect, was observed in the partial dependence plot of mRNA abundance versus protein abundance for high values of mRNA. This is potentially due to the fact that protein abundance data is sparse, with high variance, where the tree models do not generate splits among the predictors [19]. In the neural network model, the output of an MLP is an inherent smooth function of the network inputs, hence a relatively smooth prediction is always guaranteed. However, we can still see the similar pattern that the curve becomes flat when the variance for the mRNA abundance is high. For example, in the region of high mRNA values there are a small number of genes/peptides whose protein values ranges from (0, 40) for Dataset 1 and (0, 500) for Dataset 2. This could reflect problems with the sensitivity of the current proteomic technologies for high abundant proteins.

3.2. Construction of the Nonlinear Artificial Neural Network Model

A feed-forward network with a single hidden layer was trained for each condition for both datasets with mRNA abundance and other predictors as inputs and protein abundance as target. The neural networks were trained using the functions in the neural network toolbox provided in standard software Matlab (7.0.1). The activation function for the neurons in the hidden layer was the hyperbolic tangent function, which is given by 𝑓(𝑣)=1exp(𝑣)1+exp(𝑣).(3) The activation function of the output neuron is a pure linear function, which is the weighted sum of the outputs from the neurons of the hidden layer. The optimal number of neurons in the hidden layer and the optimal weights were determined using 10-fold cross-validation for training a neural network. For each dataset under each experimental condition, the networks with the same inputs but with different number of neurons (from 5 to 12) in the hidden layer were each trained using cross-validation to learn the relationship between the mRNA features and protein abundance. The optimal number of neurons in the layer was determined based on the one generating the minimum averaged validation error.

Both transcriptomic and proteomic data were log-transformed. For dataset 1, the genes with no missing value for three out of four replicates were selected for model training. The numbers of genes used for training were 354, 312, and 308 for the FL, LL, and LS conditions, respectively. For each growth condition, the inputs to the multilayer perceptron were the top ten predictors identified previously [19] (See Table  1 at Supplementary Material). The optimal network had 6, 6, and 9 neurons in the hidden layer for FL, LL, and LS conditions, respectively. Coefficients of determinations 𝑅2 were 0.47, 0.52, and 0.51 for FL, LL, LS conditions, respectively. For dataset 2, a total of 986, 1001, and 986 genes with no missing protein measurements were used for model training for CT0, CT120, and ST120 conditions, respectively. For each of the conditions, the inputs to the multilayer perceptron were the top ten predictors identified previously [19] (See Table  1 at Supplementary material). The optimal network had 8, 7, and 6 neurons in the hidden layer for CT0, CT120, and ST120 conditions, respectively. Coefficients of determination 𝑅2 were 0.65, 0.62, and 0.68 for CT0, CT120, and ST120 conditions, respectively.

3.3. Validation of the Prediction through the Nonlinear Artificial Neural Network Model

The validation of the model prediction was conducted through the calculation of coefficient of variation (CV) across conditions for every operon of D. vulgaris as described before [18, 19]. The genes in the operons are considered to have less dispersion than any random set of genes in terms of their gene expression and/or protein abundance, because of their intrinsic biological relationship. With relatively low threshold for operon identification as described in Section 2, a total of 609 operons were identified in D. vulgaris, which ranged from 2 to 13 genes per operon. To compare these CV values, we have used a permutation test in the following manner: a CV is computed from the predicted protein values for a permuted number of genes. This step is repeated 2000 times through resampling of genes without replacement. The number of genes randomly selected is equal to the size of the gene group to be compared to. For dataset 1, 82% to 89% of the operon groups had smaller CV values than the permuted groups, and for dataset 2, 79% to 89% of the operon groups had smaller CV values than the permuted, demonstrating in general good performance of the model prediction. The validation results are summarized in Table 1.

tab1
Table 1: Model validation: correlated expressions of proteins in selective operons.

To further demonstrate how the current measurement of dispersion (CV) compares with the distribution of the permuted coefficients of variation, we also calculated the percentile score for all coefficients of variations, based on the biological knowledge that genes from the same operon should be less dispersed than permuted sets of genes and the percentile scores are thus expected to be low. The results showed that even though the calculated CV for most groups was less than the mean CV value for permuted sets of genes, the percentile scores shows that only a small collection of these groups fall within a percentile less than 0.20. The percentage of groups with percentile scores less than 0.2 is 43% to 57%, and 57% to 61% for dataset 1 and dataset 2, respectively.

3.4. Comparison of Artificial Neural Network to Previous Methods (GBT and ZIP)

Similar to the GBT method, the artificial neural network method aims to identify an unknown and potentially nonlinear relationship between proteomics and mRNA abundance data, which distinguish them from the ZIP method, which carries zero-inflated linear assumptions [18]. Another key difference is that GBT and neural network methods utilized additional sequence based information, such as cellular role, gene length, amino acid usage, and codon usage known to be important for translation efficiency to model the relationship between transcriptomic and proteomic dataset, while the ZIP method establishes the relationship between proteomic and mRNA solely based on experimental transcriptomic and proteomic dataset.

Some improvements were observed in terms of prediction validation by the neural network model in comparison to the GBT model. The percentage of operon groups with CV smaller than permuted CV was 82% to 89% for dataset 1 in comparison to 75% to 79% for the GBT model. The percentage of operon groups with CV smaller than permuted CV was 79% to 89% for dataset 2 by the neural network model in comparison to 75% to 82% by the GBT model. For dataset 1, the percentage of groups with percentile scores less than 0.2 is 43% to 57% in the neural network model in comparison to 36% to 50% in the GBT model. For dataset 2, the percentage of groups with percentile scores less than 0.2 is 57% to 61% in the neural network model in comparison to 32% to 54% in the GBT model. The coefficient of determination was 0.47 to 0.58 for dataset 1 and 0.62 to 0.68 for dataset 2 in the neural network model in comparison to 0.39 to 0.58 for both datasets in the GBT model, demonstrating the artificial neural network model could be a good alternative in analyzing the missing proteomic values.

3.5. Analysis of Factors Ascribed to Missing Protein Measurements

Through the neural network model, the majority of the proteins that were experimentally undetected were predicted to have protein abundance with greater than 1.0 peptide hit. This thus tempted for us to look into the possible factors which may be ascribed to the missing protein values. To do so, we collected some of the protein or mRNA/DNA sequence-based features which may affect protein detectability for use along with experimental transcriptomic data in a logistic regression procedure to determine their relative contribution to detectability of proteins in the D. vulagris genome. The features/measurements we used were from the following several categories: (i) features related to gene expressivity: experimental mRNA abundance data, 𝑁𝑐 and CAI values [33, 34], codon usage such as CR_axis1-CR_axis4, T3s, C3s, A3s, G3s, GC3s, and GC content [35]; (ii) features related to protein and RNA stabilities: protein instability index and AliphaticIndex, and minimal free energy (MFE) of RNA [35]; (iii) features related to protein solubility: protein solubility indices such as Grand Average of Hydropathy (GRAVY) [33]; (iv) amino acid usage such as AA_axis1-AA_axis4, and others such as gene length [35]. The logistic regression procedure in the SAS software package was used to model the probability of whether a protein will be experimentally detected given the features and measurements for the gene/protein. The stepwise selection method was used to select those most influential factors for indicating missing proteomic value. Those highly correlated features were removed automatically. The 0.05 significance threshold was used for the stepwise selection. The mRNA abundance and gene length values were both log2 transformed before being used in the procedure. The protein stability index, aliphaticIndex, and 𝑁𝑐 values were divided by 10 and the CAI, N_MFE, GRAVY, AA_axis1-AA_axis4, and CR_axis1-CR_axis4 were all multiplied by 10 so that they scaled to 1. This standardization will not change the results, but make illustration of the results easier.

Tables 2 and 3 present the odds ratio of each selected factor for dataset 1 and dataset 2, respectively. The results showed that some of the features are highly correlated with probability of missing proteomic values. For dataset 1, seven factors were indentified consistently by the regression model. Among the features that are positively correlated with probability of missing proteomic values (i.e., when the features/measurement increase), protein instability index which provides an estimate of the stability of a given protein in a test tube [38] was found as the most significant factor for all three growth conditions, and aliphaticIndex which is regarded as a positive factor for the increase of thermostability of globular proteins [39], was found significant for growth conditions LL and FL, but not for condition LS. In addition, AA_axis2 and CR_axis3 correlated most strongly with GRAVY and G3s, respectively, and they were also found significant in all three growth conditions, while the N_MFE was only significant for LS condition (Table 2). For dataset 2, the features that are kept in the regression model are more the same for the three growth conditions in dataset 1, suggesting that these factors could be universal across experimental platforms. Six features that are significant for dataset 1 are also found to be critical for all the conditions and affect the probability in the same direction for dataset 2 include mRNA abundance, protein instability index, gene length, 𝑁𝑐, CAI, and AA_axis2 corresponding to GRAVY (Table 3). In general, the analysis suggested that those proteins with smaller mRNA expression values, smaller 𝑁𝑐 value, higher protein instability index, or high value of aliphaticIndex, have a greater probability to have missing proteomic values in the D. vulgaris datasets. Based on this preliminary analysis, it may be possible to apply a logistic regression model to predict whether a given protein will or will not be detected by choosing a proper threshold for the predicted probability.

tab2
Table 2: Estimates of odds ratios of missing proteomic values for dataset 1.
tab3
Table 3: Estimates of odds ratio of missing proteomic values for dataset 2.
3.6. Data Interpretation Using Imputed Datasets

A total of 308–354 proteins were experimentally identified for dataset 1, and 986–1001 proteins were experimentally identified for dataset 2, which are slightly over 10–30% of the whole proteome in D. vulgaris. Through the ANN modeling and imputation by integrating transcriptomic and proteomic data, we were able to assign abundance values to approximately 33–3400 proteins (dependent on growth conditions), approximately 96–98% of the whole proteomes in D. vulgaris. Using the imputed datasets, we first analyzed the proteins involved in energy metabolism when D. vulgaris was grown in three cultivation conditions (LL: Lactate-based growth in Exponential phase; LS: Lactate-based growth at Stationary phase; FL: Formate-based growth at Exponential phase). As shown in Table 4, even for proteins involved in energy metabolism such as ATP synthase which are typically expressed at relatively high levels, significant missing proteomic data was still obvious in the experimental datasets, only two out of nine putative ATP synthase proteins were experimentally detected under three growth conditions. By analysis of the imputed protein abundance for all putative ATP synthase proteins, a clear trend of downregulation was observed for LS and FL condition when compared with LL condition, which is consistent with the previous data suggesting lactate is a favorable electron donor for D. vulgaris [22]. When comparing lactate-based growth with formate-based at the exponential phase, we found that enzymes involved for formate utilization were predicted to be upregulated during growth on formate (Table 4). It has been reported that c-type cytochromes are highly expressed in sulphate-reducing bacteria [20], however, only low amounts of c-type cytochrome proteins have been identified from D. vulgaris by the proteomic (Table 4), probably due to the fact that c-type cytochromes undergo a complex posttranslational maturation process involving covalent attachment of heme groups [40]. Through the ANN model, significant expression of various cytochrome c proteins was predicted in D. vulgaris grown under three conditions. In general, cytochrome c proteins were predicted to have higher expression at the exponential phase in both lactate- and formate-based media than the stationary phase, consistent with their important roles in electron transfer during fast growth [20]. Two cytoplasmic hydrogenases (Ech and Coo) were previously assigned the putative function of generating hydrogen in the cytoplasm [20]. In agreement with previous prediction from the ZIP model [18], the ANN model predicted that all subunits of Ech hydrogenase (EchACDE) were upregulated, while all subunits of Coo hydrogenase (CooHLUX) were down-regulated on formate-based growth, suggesting their differential roles in H2 generation when grown with different carbon sources. We also used the imputed data to analysis the stress responses when the D. vulgaris cells were treated under salt stress. The analysis allowed identification of some proteins which were not experimentally detected, however, were predicted to be upregulated under the salt treatment condition (ST120) as compared with control (CT120) (Table 5). Among them were transporters and various multidrug resistance proteins and some signal transduction proteins. This list of proteins can serve as putative candidates for further experimental validation. In general, our analysis indicates that biological interpretation may benefit from imputing missing data using computational methods by integrating temporal transcriptomic and proteomic data. The full list of all protein-abundance predictions for all proteins has been provided in the Table  2 of the supplementary material.

tab4
Table 4: Regulation of proteins involved in energy metabolism for Dataset 1*.
tab5
Table 5: Up-regulated proteins under stress condition for Dataset 2*.

In conclusion, missing values in large-scale proteomic analysis is a frequent problem and in some cases it has created difficulty for accurate interpretation of proteomic data for complex biological systems. However, in spite of its obvious importance, the methodologies to deal with missing proteomic data are still lagging well behind the methods for transcriptomic analysis. In this study, we applied an artificial neural network approach to approximate the relationships between cognate transcriptomic and proteomic datasets of D. vulgaris, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors, such as mRNA abundance, cellular role, and triple codon counts. The results showed that the coefficients of determination for the trained neural network models ranged from 0.47 to 0.68, providing better description than several previous models [18, 19]. In an attempt to identify possible causes for missing proteomic values, a logistic regression analysis of various experimental measurements and sequence-based features was performed for the experimentally undetected proteins but with high predicted protein abundance values. Although the results are in general consistent with previous knowledge that proteins with low possible expressivity and high instability will have higher chance to be missed, in the study, we for the first time quantified the odds ratio of each of those possible factors in the D. vulgaris datasets. In addition, in a preliminary analysis we demonstrated that biological interpretation could be improved by using the imputed proteomic datasets. Finally, although the model was well validated statistically, caution needs to be exercised when interpreting experimental data based on predicted protein expression values because the predicted abundance values are constrained by the quality of experimental proteomic data used as input, and a dearth of large-scale quantitative predictors which can be included in the model. Nevertheless, the initial success in applying the ANN method is encouraging, and the model could serve as a basis for developing more sophisticated models to further improve the biology interpretation.

Acknowledgment

The research work of J. QIAO and W. ZHANG was supported by National Basic Research Program of China (2011CBA00803). This paper reflects the views of F. Li and L. Nie should not be construed to represent FDA’s view or policies.

References

  1. D. Medini, D. Serruto, J. Parkhill et al., “Microbiology in the post-genomic era,” Nature Reviews Microbiology, vol. 6, no. 6, pp. 419–430, 2008. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  2. N. C. Kyrpides, “Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream,” Nature Biotechnology, vol. 27, no. 7, pp. 627–632, 2009. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  3. O. Fiehn, “Combining genomics, metabolome analysis, and biochemical modelling to understand metabolic networks,” Comparative and Functional Genomics, vol. 2, no. 3, pp. 155–168, 2001. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  4. O. V. Singh and N. S. Nagaraj, “Transcriptomics, proteomics and interactomics: unique approaches to track the insights of bioremediation,” Briefings in Functional Genomics and Proteomics, vol. 4, no. 4, pp. 355–362, 2006. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  5. J. Lin and J. Qian, “Systems biology approach to integrative comparative genomics,” Expert Review of Proteomics, vol. 4, no. 1, pp. 107–119, 2007. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  6. R. P. Kandpal, B. Saviola, and J. Felton, “The era of 'omics unlimited,” Biotechniques, vol. 46, no. 5, pp. 351–355, 2009. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  7. N. Ishii and M. Tomita, “Multi-omics data-driven systems biology of E. coli,” in Systems Biology and Biotechnology of Escherichia coli, Y. S. Lee, Ed., pp. 41–57, Springer, Dordrecht, Netherlands, 2009. View at Google Scholar
  8. Y. J. Tang, H. G. Martin, S. Myers, S. Rodriguez, E. E. K. Baidoo, and J. D. Keasling, “Advances in analysis of microbial Metabolic fluxes via 13C isotopic labeling,” Mass Spectrometry Reviews, vol. 28, no. 2, pp. 362–375, 2009. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  9. J. L. DeRisi, V. R. Iyer, and P. O. Brown, “Exploring the metabolic and genetic control of gene expression on a genomic scale,” Science, vol. 278, no. 5338, pp. 680–686, 1997. View at Publisher · View at Google Scholar · View at Scopus
  10. P. T. Spellman, G. Sherlock, M. Q. Zhang et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998. View at Google Scholar · View at Scopus
  11. A. A. Alizadeh, M. B. Elsen, R. E. Davis et al., “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, vol. 403, no. 6769, pp. 503–511, 2000. View at Publisher · View at Google Scholar · View at PubMed
  12. T. Aittokallio, “Dealing with missing values in large-scale studies: microarray data imputation and beyond,” Briefings in Bioinformatics, vol. 11, no. 2, pp. 253–264, 2009. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  13. M. R. Wilkins, R. D. Appel, J. E. Van Eyk et al., “Guidelines for the next 10 years of proteomics,” Proteomics, vol. 6, no. 1, pp. 4–8, 2006. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  14. L. Nie, G. Wu, D. E. Culley, J. C. M. Scholten, and W. Zhang, “Integrative analysis of transcriptomic and proteomic data: challenges, solutions and applications,” Critical Reviews in Biotechnology, vol. 27, no. 2, pp. 63–75, 2007. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  15. K. Jung, A. Gannoun, B. Sitek, H. E. Meyer, K. Stäuhler, and W. Urfer, “Analysis of dynamic protein expression data,” RevStat: Statistical Journal, vol. 3, pp. 99–111, 2005. View at Google Scholar
  16. R. Pedreschi, M. L. A. T. M. Hertog, S. C. Carpentier et al., “Treatment of missing values for multivariate statistical analysis of gel-based proteomics data,” Proteomics, vol. 8, no. 7, pp. 1371–1383, 2008. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  17. J. Tuikkala, L. L. Elo, O. S. Nevalainen, and T. Aittokallio, “Missing value imputation improves clustering and interpretation of gene expression microarray data,” BMC Bioinformatics, vol. 9, article 202, 2008. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  18. L. Nie, G. Wu, F. J. Brockman, and W. Zhang, “Integrated analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: zero-inflated poisson regression models to predict abundance of undetected proteins,” Bioinformatics, vol. 22, no. 13, pp. 1641–1647, 2006. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  19. W. Torres-García, W. Zhang, G. C. Runger, R. H. Johnson, and D. R. Meldrum, “Integrative analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: a non-linear model to predict abundance of undetected proteins,” Bioinformatics, vol. 25, no. 15, pp. 1905–1914, 2009. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  20. J. F. Heidelberg, R. Seshadri, S. A. Haveman et al., “The genome sequence of the anaerobic, sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough,” Nature Biotechnology, vol. 22, no. 5, pp. 554–559, 2004. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  21. W. Zhang, M. A. Gritsenko, R. J. Moore et al., “A proteomic view of Desulfovibrio vulgaris metabolism as determined by liquid chromatography coupled with tandem mass spectrometry,” Proteomics, vol. 6, no. 15, pp. 4286–4299, 2006. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  22. W. Zhang, D. E. Culley, J. C. M. Scholten, M. Hogan, L. Vitiritti, and F. J. Brockman, “Global transcriptomic analysis of Desulfovibrio vulgaris on different electron donors,” Antonie van Leeuwenhoek, International Journal of General and Molecular Microbiology, vol. 89, no. 2, pp. 221–237, 2006. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  23. A. Mukhopadhyay, Z. He, E. J. Alm et al., “Salt stress in Desulfovibrio vulgaris hildenborough: an integrated genomics approach,” Journal of Bacteriology, vol. 188, no. 11, pp. 4068–4078, 2006. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  24. L. J. Lancashire, C. Lemetre, and G. R. Ball, “An introduction to artificial neural networks in bioinformatics—application to complex microarray and mass spectrometry datasets in cancer studies,” Briefings in Bioinformatics, vol. 10, no. 3, pp. 315–329, 2009. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  25. W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The Bulletin of Mathematical Biophysics, vol. 5, no. 4, pp. 115–133, 1943. View at Publisher · View at Google Scholar · View at Scopus
  26. W. J. Qian, T. Liu, M. E. Monroe et al., “Probability-based evaluation of peptide and protein identifications from tandem mass spectrometry and SEQUEST analysis: the human proteome,” Journal of Proteome Research, vol. 4, no. 1, pp. 53–62, 2005. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  27. K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989. View at Google Scholar · View at Scopus
  28. K. I. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Networks, vol. 2, no. 3, pp. 183–192, 1989. View at Google Scholar · View at Scopus
  29. G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals, and Systems, vol. 2, no. 4, pp. 303–314, 1989. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet · View at Scopus
  30. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986. View at Google Scholar · View at Scopus
  31. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning—Data Mining, Inference, Prediction, Springer, New York, NY, USA, 2nd edition, 2009.
  32. R. A. Johnson, Miller and Freund's Probability and Statistics for Engineers, Prentice Hall, Englewood Cliffs, NJ, USA, 2005.
  33. G. Wu, L. Nie, and W. Zhang, “Relation between mRNA expression and sequence information in Desulfovibrio vulgaris: combinatorial contributions of upstream regulatory motifs and coding sequence features to variations in mRNA abundance,” Biochemical and Biophysical Research Communications, vol. 344, no. 1, pp. 114–121, 2006. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  34. G. Wu, D. E. Culley, and W. Zhang, “Predicted highly expressed genes in the genomes of Streptomyces coelicolor and Streptomyces avermitilis and the implications for their metabolism,” Microbiology, vol. 151, no. 7, pp. 2175–2187, 2005. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  35. L. Nie, G. Wu, and W. Zhang, “Correlation of mRNA expression and protein abundance affected by multiple sequence features related to translational efficiency in Desulfovibrio vulgaris: a quantitative analysis,” Genetics, vol. 174, no. 4, pp. 2229–2243, 2006. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus
  36. A. Agresti, Categorical Data Analysis, Wiley-Interscience, New York, NY, USA, 2002.
  37. D. W. Hosmer and S. Lemeshow, Applied Logistic Regression, Wiley, Chichester, UK, 2nd edition, 2000.
  38. K. Guruprasad, B. V. B. Reddy, and M. W. Pandit, “Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence,” Protein Engineering, vol. 4, no. 2, pp. 155–161, 1990. View at Google Scholar · View at Scopus
  39. A. Ikai, “Thermostability and aliphatic index of globular proteins,” Journal of Biochemistry, vol. 88, no. 6, pp. 1895–1898, 1980. View at Google Scholar · View at Scopus
  40. C. Aubert, G. Leroy, P. Bianco, E. Forest, M. Bruschi, and A. Dolla, “Characterization of the cytochromes C from desulfovibrio desulfuricans G201,” Biochemical and Biophysical Research Communications, vol. 242, no. 1, pp. 213–218, 1998. View at Publisher · View at Google Scholar · View at PubMed · View at Scopus