Abstract

The expressions of reference genes used in gene expression studies are assumed to be stable under most circumstances. However, a number of studies had demonstrated that such genes were found to vary under experimental conditions. In addition, genes that are stably expressed in an organ may not be stably expressed in other organs or other organisms, suggesting the need to identify reference genes for each organ and organism. This study aims at identifying stably expressed genes in Escherichia coli. Microarray datasets from E. coli substrain MG1655 and 1 dataset from W3110 were analysed. Coefficient of variance (COV) of was calculated and 10% of the lowest COV from 4631 genes common in the 3 MG1655 sets were analysed using NormFinder. Glucan biosynthesis protein G (mdoG), which is involved in cell wall synthesis, displayed the lowest weighted COV and weighted NormFinder Stability Index for the MG1655 datasets, while also showing to be the most stable in the dataset for substrain W3110, suggesting that mdoG is a suitable reference gene for E. coli K-12. Gene ontology over-representation analysis on the 39 genes suggested an over-representation of cell division, carbohydrate metabolism, and protein synthesis which supports the short generation time of E. coli.

1. Introduction

Gene expression analysis is examining the variations in gene expression as a result of changes in environmental conditions by measuring DNA expression levels over time. Quantitative real-time polymerase chain reaction (qRT-PCR) is a commonly used technique to quantify gene expressions [1]. However, several parameters need to be controlled in this process in order to obtain accurate and reliable results. These include variations in the amounts of starting material between samples, RNA extraction efficiency, RNA integrity/quality, efficiency of cDNA synthesis, and differences in the overall transcriptional activity of the cells analyzed. Of which, only the differences in transcriptional activity is of interest. A possible method for accounting other effects is relative normalization, which is the correction of the raw expression values with a reference gene. The reference gene acts as an invariant endogenous control which implies that reference genes should be stably expressed under a wide variety of conditions [2].

However, several studies had suggested that it is not easy to find universal reference genes [35]. This corroborates several studies demonstrating that several genes originally considered invariable in terms of expression may vary under different experimental conditions [68]. For an accurate comparison of DNA expression in different samples, it is necessary to use verified reference genes, such as GAPDH (glyceraldehyde-3-phosphate dehydrogenase) [9] or UBQ (ubiquinone) [9], for normalisation or determine new ones for each experimental system with varying external stimuli [3, 10]. However, some studies had also demonstrated that the expression of GAPDH [11] and UBQ [12] is varying in some conditions. Other studies had also identified references genes, such as recA (recombinase A), proC (pyrroline-5-carboxylase reductase) and gyrA (DNA gyrase) in Pectobacterium atrosepticum [8], and map (methionine aminopeptidase), rpoC (RNA polymerase, beta prime subunit), and alaS (alanyl-tRNA synthetase) in Acidithiobacillus ferrooxidans [13]. This suggests that established reference genes for a particular organism may not be suitable for other organisms.

Escherichia coli, a Gram-negative bacterium commonly found in the gastrointestinal tract, was selected as it has a genome of approximately 4,000 genes. In addition, the genetic material in its plasmids is easily manipulated. Furthermore, E. coli is easily cultured and is a commonly studied prokaryotic model [14, 15]. As it is easily cultured in the laboratory environment and is of low pathogenicity [1618].

Candidate reference genes, which are commonly believed to be invariant, can be identified using algorithms such as geNorm [19], NormFinder [20], and BestKeeper [21]. These methods require a wide range of accessible gene expression data, normally obtained through DNA profiling such as quantitative PCR. However, microarrays, which usually contain thousands of probes, present a good source of data for identifying reference genes [22]. A recent study had successfully identified MARK3 as a suitable reference gene in mouse liver using microarray analysis [23].

Currently, there are numerous studies being conducted to validate known reference genes and possibly identify new ones [3, 8, 9, 13, 24]. In this study, we identify and evaluate a set of invariant genes in E. coli K-12 substrains MG1655 and W3110. Our results suggest that glucan biosynthesis protein G (mdoG) is a suitable reference gene for both MG1655 and W3110 strains of E. coli.

2. Materials and Methods

2.1. Microarray Data

Four datasets were obtained from publicly available microarray databases, Gene Expression Omnibus, National Centre for Biotechnology Information, of which 3 were from E. coli K-12 substrain MG1655 and 1 from substrain W3110. Briefly, the studies conducted with the datasets are as follows: GDS680: MG1655 grown in either aerobic or anaerobic conditions, deleted for transcriptional regulators in oxygen response, and used to validate a computational model of transcriptional and metabolic networks. GDS1099: aerobically grown MG1655 cells in several media with varied carbon sources including glucose, glycerol, succinate, L-alanine, acetate, and L-proline. GDS1494: analysis of derivatives of strain 1655: wild type, fur mutant, and wild type with added FeSO4, induced to overexpress RyhB, a noncoding RNA regulated by the fur repressor protein. GDS1827: W3110 cells grown aerobically and exposed to low, neutral, or high pH to study acid and base response.

2.2. Finding Invariant Genes

The coefficient of variation (COV) of every gene was calculated as the quotient of standard deviation and arithmetic mean. From 4631 genes, the top 10% with the lowest COV from each dataset were listed. The intersection between the 3 MG1655 data sets (GDS680, GDS1099, and GDS1494) was identified and analysed using NormFinder version 0.953 [20] to rank the stability of these genes. A weighted stability index for each gene was then calculated from the NormFinder’s stability index, and an average of the NormFinder stability indexes multiplied by number of samples was taken.

2.3. Gene Ontology Overrepresentation Analysis

The list of genes from the intersection of the top 10% with the lowest COV from the 3 MG1655 data sets were analysed for gene ontology overrepresentation using the Gene ontology gene annotation file for E. coli dated July 8, 2011. Chi-square test was carried out to identify the overrepresented gene ontology terms in the list of genes using the overall 𝑃 value of 0.01, corrected for multiple testing using Holm-Bonferroni method [25].

2.4. Comparing NormFinder and COV

Spearman’s correlation was used to determine the correlation between stability index generated by NormFinder and COV values using the equation 𝑑 𝑟 = 1 [ 6 2 𝑖 / ( 𝑛 ( 𝑛 2 1 ) ) ] , where 𝑟 is the Spearman’s correlation, 𝑑 is the difference in the rank of two parameters, and n is the sample size. The 𝑡 -statistic was calculated by equation 𝑡 = 𝑟 ( 𝑛 2 ) / ( 1 𝑟 2 ) , which was used to test for the null hypothesis of no correlation with ( 𝑛 2 ) degrees of freedom.

3. Results and Discussion

A threshold of less than 10% COV was used to select stably expressed genes across the three datasets GDS 680, 1099, and 1494 (MG1655). A total of 39 genes of consistent low variance were found (Table 1) with the weighted COV values ranging from 0.099 to 0.138. Glucan biosynthesis protein G (mdoG) was found to be most stable with both the lowest weighted COV value and weighted NormFinder Stability Index for MG1655. In GDS 1827 (W3110), mdoG was the most stable in the dataset, with a COV of 0.088 and NormFinder Stability Index of 0.078. The highest COV in GDS 1827 is 0.791 for hslV (peptidase component of the HslUV protease). Our results suggest that mdoG may be a suitable reference gene across both E. coli strains W3110 and MG1655. This may imply that mdoG may be suitable for use as reference genes in other strains of E. coli K-12.

Gene ontology overrepresentation is a commonly used mechanistic analysis method to provide biological insights into a list of genes [2628]. The analysis of the 39 genes with consistently low variance for gene ontology overrepresentation showed that 3 primary functions were found to be overrepresented (Table 2). They were cell division, carbohydrate metabolic process, and protein synthesis. As E. coli is generally accepted as a rapidly dividing prokaryote [29], it is plausible to expect genes responsible for cell division to be constantly expressed. As the cells grow, new cellular structures, such as cell wall and other enzymes, need to be synthesized. Hence, it is plausible to expect protein synthesis to be stable throughout the cell cycle. The role of glutathione [30, 31] and tetrapyrrole [32] had been implicated in protein synthesis while diaminopimelate had been shown to have a role in the maintenance of cell wall [33]. At the same time, cell division involves the replication and segregation of genetic material [34]. Carbohydrate is both a primary source of energy for E. coli [35] as well as the primary component of bacterial cell wall [36]. The gene mdoG has been shown to be involved in the formation of the β-1,6 glucose linkage [37] and in the periplasmic release of newly synthesized osmoregulated periplasmic glucans [38, 39], which is needed for bacterial cell wall. Thus, it is plausible that the expression of mdoG is needed during binary fission. As E. coli divides rapidly, constant synthesis of cell wall is needed. Therefore, it is likely that mdoG is constantly needed, which may be a reason to its constant expression in E. coli. Hence, both gene ontology overrepresentation and the function of the most stably expressed gene, mdoG, support the short generation time of E. coli.

Our results showed that none of the 7 housekeeping genes consistently appeared in the lowest 10% COV subset of each dataset (Table 3), while GAPDH [9], gyrA [8], and alaS [13] were found to be in the lowest 10% COV subset, in one dataset each. Our results illustrated that recA [8] has the highest weighted COV of 0.5378 and gyrA [8] has the lowest weighted COV of 0.1607, which is higher than that of mdoG (COV of 0.099). This suggests that commonly used housekeeping genes such as GAPDH [9] and recA [8] are not suitable for the expression profiling of E. coli. Hence, our results support our earlier hypothesis that common housekeeping genes found to be stable in one organism cannot be assumed to be stable in all organisms. This suggests the need to identify suitable reference genes for each organism of interest.

The advantage of COV is its capability to analyse as large number of samples as required [23] as the number of calculations increases proportionally to the sample size, resulting in linear complexity. NormFinder uses residual analysis between sample subgroup variation and the overall variation of the expression dataset to evaluate the variation contributed by each gene in the entire dataset [20]. Thus, the computational complexity of NormFinder increases exponentially as the number of samples increases; hence, it is only able to work with a small number of genes within reasonable time and computational resources. Therefore, we used Spearman’s rank correlation coefficient to determine the correlation of stability index by NormFinder and COV values which showed that the sum of 𝑑 2 was 5664 and the 𝑃 value was 0.006748. Since the 𝑃 value was less than 0.01, the null hypothesis is rejected, indicating that there is correlation between the stability index from NormFinder and COV values but the strength of this correlation is difficult to establish as the significance in 𝑃 value did not indicate the correlation strength. However, our results do not suggest that COV is a suitable replacement for NormFinder. As NormFinder [20] takes account of the overall variability in the entire dataset, it is likely to be statistically stronger than COV which is a normalized standard deviation. Given the advantageous ability of COV to process large amounts of data such as those derived from microarrays, it is plausible that COV can be used as a weaker filter for a broad category of genes with low expression variation, followed by stronger statistical analysis by NormFinder [20] to identify suitable reference genes.

Acknowledgment

The authors wish to thank H. P. Too (Biochemistry, National University of Singapore) for his discussion and input into this study.