Table of Contents Author Guidelines Submit a Manuscript
International Journal of Genomics
Volume 2015, Article ID 269127, 13 pages
http://dx.doi.org/10.1155/2015/269127
Research Article

Comparisons between Arabidopsis thaliana and Drosophila melanogaster in relation to Coding and Noncoding Sequence Length and Gene Expression

1School of Biological Sciences, University of Wollongong, Northfields Avenue, Keiraville, Wollongong, NSW 2522, Australia
2National Institute for Applied Statistics Research Australia (NIASRA), School of Mathematics and Applied Statistics, University of Wollongong, Northfields Avenue, Keiraville, Wollongong, NSW 2522, Australia

Received 18 March 2015; Accepted 11 May 2015

Academic Editor: Sylvia Hagemann

Copyright © 2015 Rachel Caldwell et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

There is a continuing interest in the analysis of gene architecture and gene expression to determine the relationship that may exist. Advances in high-quality sequencing technologies and large-scale resource datasets have increased the understanding of relationships and cross-referencing of expression data to the large genome data. Although a negative correlation between expression level and gene (especially transcript) length has been generally accepted, there have been some conflicting results arising from the literature concerning the impacts of different regions of genes, and the underlying reason is not well understood. The research aims to apply quantile regression techniques for statistical analysis of coding and noncoding sequence length and gene expression data in the plant, Arabidopsis thaliana, and fruit fly, Drosophila melanogaster, to determine if a relationship exists and if there is any variation or similarities between these species. The quantile regression analysis found that the coding sequence length and gene expression correlations varied, and similarities emerged for the noncoding sequence length (5′ and 3′ UTRs) between animal and plant species. In conclusion, the information described in this study provides the basis for further exploration into gene regulation with regard to coding and noncoding sequence length.

1. Introduction

Advances in high-quality sequencing technologies [1, 2] and large-scale resource datasets [3, 4] have enhanced genomics research. Conducting large-scale sequence comparisons has the advantage of identifying the genetic variation and speciation among organisms [5]. Whole-genome expression experiments have also expanded a new era in bioinformatics analyses [69]. Understanding relationships and cross-referencing of expression data to large genome data can now be attained and facilitates a greater insight of organismal complexity and the tightly regulated process of gene expression.

There is a continuing interest in the analysis of gene architecture and gene expression to determine the relationship that may exist [10]. Current investigations on the similarities and differences between plant and animal genome structure have led to a greater understanding in biochemical pathways, genetic mechanisms, sequence structures, and functions [11], and comparative studies are more powerful than studying the sequence of a single genome [5]. Furthermore, control of gene expression has been used as a measurement of variation and is often well conserved between species in the coding sequences. In unicellular organisms such as the yeast Saccharomyces cerevisiae, research has found that highly expressed genes tend to have smaller compact protein sizes [12]. Other animal genome studies have found that highly expressed genes have fewer and shorter introns and shorter coding sequences and protein lengths and favour more compactness in highly expressed genes [13, 14]. Previous research, however, is divided in opinion, with highly expressed genes not always being compact in plants. There is evidence that suggests that, in higher plant genomes, highly expressed genes comprise longer introns and primary transcripts [15] in contrast with other research on Arabidopsis and rice, finding that highly expressed genes are more compact [16], specifically the lengths of the coding sequence (CDS) [17]. Negative correlation between protein length and gene expression breadth in the plant species Populus tremula was also observed [18]. Taken together, these observations suggest that the difference in length in relation to gene expression is not merely due to adaptive evolution but rather has specific biological significance [19].

Significance of noncoding regions is less understood across species compared to the coding regions. A range of genomic studies over the last decade has supported the opinion that there are tightly regulated processes and levels of control in the regulation of gene expression. This has included the untranslated gene regions, notably the 5′ and 3′ untranslated regions (UTRs), which may play the most important role in the regulation of gene expression [20]. A study by Lin and Li revealed a strong negative correlation between the 5′ UTR length and expression correlation with cytosolic ribosomal protein patterns in S. cerevisiae and C. albicans [21], with highly expressed eukaryotic genes tending to have more compact 5′ UTR regions [22]. A plant study on both Arabidopsis and rice also reported negative correlation between expression levels and noncoding sequences (both 5′ and 3′ UTRs) [16].

Statistical approaches, such as quantile regression, are a practical statistical method utilized by many biologists in a range of ecological [23] and bioinformatics [24, 25] studies to investigate the relationships between variables. The advantage of using such a model includes the robustness against outliners and helps obtain a more comprehensive analysis of the relationship between variables by using different measures of central tendency and statistical dispersion. When dealing with sequence length and gene expression data, modelling techniques often have difficulty with this data, due to the data values ranging over several orders of magnitude. It is general practice to log-transform the data, particularly when parametric statistical tests, such as -test, ANOVA, or linear regression, are used. The log function tends to squeeze together the larger values and stretches out the smaller values allowing a better view of the data.

The aim of this study was to apply a quantile regression model to reexamine the correlation of gene region lengths and expression levels of Arabidopsis using a different and larger set of gene expression data. The research also extended to another species, Drosophila melanogaster, so this study not only expanded objects but also conducted a comparison between plant and animal species.

2. Methods

2.1. Datasets

Sequence and gene expression data were collected from a selection of publicly accessible databases and websites for each of the plant and animal species.

The Arabidopsis thaliana sequence data were downloaded from TAIR website (ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/). The sequence data used were generated from TAIR10 (December 2010) release. Gene expression data were downloaded from the NCBI GEO Datasets database (series GSE31488) [26] including the annotation file which contained only one gene model for each gene. The downloaded expression data were already normalized by Bioconductor (http://www.bioconductor.org/) R software. The final sample size for analysis was 18,445 genes, excluding two genes from the coding sequence that only had 1 bp which was classified as an intron. The accession string and ID reference from the arrays were used to link the data together to create a master database of length and gene expression data for analysis.

The Drosophila melanogaster sequence data were downloaded from FlyBase website: http://www.flybase.org/. The raw CEL gene expression data files were downloaded from the NCBI GEO Datasets database (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42255) under series GSE42255 [27]. Affymetrix microarrays were used to analyse the adult Drosophila and the raw CEL files were normalized using the Bioconductor (http://www.bioconductor.org/) affy package in the R software environment. The annotation file was included and the Entrez UniGene name (GC numbers) and the ID from the platform data table were used to link the data together to create a master database of length and gene expression. The final sample size of unique genes was 3,290 for analysis.

The downloaded text files for each organism were cleaned using Visual Basic scripts and imported into MS Excel; all length data for both coding and noncoding sequences exclude introns. For each organism the gene expression experiments included multiple replicates of the control as well as abiotic stress conditions. For this study we have only reported on the control condition expression from the GEO Datasets for both organisms, to simplify the analysis reporting. Abiotic stress conditions will be investigated at a later stage.

2.2. Statistical Analysis

Pearson’s correlation was used to test the gene expression data to determine the reliability of the control replicates. The value was found never below 0.95, demonstrating the accuracy and reproducibility of the raw data. Therefore, the mean of the results of the control biological replicas was used in the statistical analysis reporting. The gene expression measurements are represented by gene expression signal intensity.

In this study we are interested in whether the length of the coding and noncoding sequences has a significant impact on the probability distribution of the gene expression under control conditions. Quantiles are statistics that describe the subdivisions of a ranked set of data values into equal proportions. Divisions can be made in four parts corresponding to 25%, 50%, and 75% of the data. Firstly, to examine how the data behaves between the sequence length of each region and gene expression, the length data for each region (5′ UTR, CDS, and 3′ UTR) were split into 4 quartiles (groups 1, 2, 3, and 4).

Strong skewness was identified in all the length datasets for each gene region. For example, the distribution of the 5′ UTR length without introns in Arabidopsis thaliana was positively skewed (skewness = 2.511) (Figure 1). Consequently, the Kruskal-Wallis nonparametric analysis method using SPSS version 19 (SPSS IBM, New York, USA) was applied to the data to determine whether there are differences between the quartile groups, in relation to gene expression and the length of the coding and noncoding regions. This test makes no assumptions about the distribution of the data:where is the number of observations in group and is the rank (among all observations) of observation from group . is the total number of observations across all groups. and is the average of all .

Figure 1: 18,445 genes in Arabidopsis thaliana for the 5′ untranslated region (UTR) length, excluding introns. The distribution of this data is positively skewed (skewness = 2.511).
2.3. Quantile Regression Analysis

The purpose of regression analysis is to expose the relationship between the independent variable () and dependent variables (). Conditional quantile regression is useful in modelling the quantile value of the dependent variable on the independent variable. In this study, the dependent variable is represented by the log of gene expression, under control conditions, and the independent variable is represented by the log of the sequence length. The lengths considered include the coding and noncoding sequences (5′ UTR, CDS, and 3′ UTR). The model considered was linear and is represented byThe quantile subsets used ranged from 0.1 to 0.9 in 0.1 increments. The log of the data was used to expand the data points for an enhanced view of the quantile regions. Regression analysis was performed in R.

3. Results

3.1. Length Subset Analysis

To understand the relationship of the length of the coding and noncoding sequences and gene expression, the data of the lengths for each type of coding and noncoding regions were grouped into four quartile subsets, respectively. For each quartile subset (1, 2, 3, and 4), the gene expression data in each of these quartiles were averaged. Through the nonparametric analysis method, the mean of the gene expression conditional on the four quartile groups for each length region, respectively, was significantly different ( value < 0.000) (Figures 2(a) and 3(a)).

Figure 2: Relationship of gene expression in Arabidopsis thaliana within the coding and noncoding sequence regions. The gene expression intensity from GEO Datasets, GSE31488 (a) and GDS3933 (b), is plotted versus the quartile score for coding sequence, transcript, 5′ UTR and 3′ UTR regions. Each data point represents the mean for the samples in each quartile. Error bars represent standard error.
Figure 3: Relationship of gene expression in Drosophila melanogaster within coding and noncoding regions. The gene expression intensity from GEO Datasets, GSE42255 (a) and GSE36507 (b), is plotted versus the quartile score for coding sequence, transcript, 5′ UTR and 3′ UTR regions. Each data point represents the mean for the samples in each quartile. Error bars represent standard error.

For Arabidopsis the coding sequences show a linear negative relationship between the four quartiles (groups 1–4) and their average gene expression intensity, indicating that as the length increases, the gene expression intensity decreases. This pattern is also seen in Drosophila (Figure 3(a)). The same pattern is also seen in the full transcript length, which follows the same negative relationship, in both the animal and plant species.

However, the noncoding sequences show dissimilar trends to the coding sequence. The relationship between the length of the 5′ UTR and gene expression intensity for Arabidopsis indicates a quadratic form, with an increase in length until the average gene expression intensity peaks for those genes in the 3rd group determined by the 3rd quartile and then starts to decrease (Figure 2(a)).

The pattern seen in the 3′ UTR length data was more positively correlated in relation to the average gene expression intensity, in contrast to the CDS and 5′ UTR sequence length. This pattern implies that as the length of the 3′ UTR increases (from 1 to 3318 base pairs) the gene expression intensity increases (Figure 2(a)).

Furthermore, in Drosophila, the noncoding sequences in relation to the average gene expression intensity varied considerably from Arabidopsis. The patterns showed a reversal in the 3′ and 5′ UTR sequence length in relation to the average gene expression. The 3′ UTR gene expression intensity increased until the 2nd quartile and then decreased at the 4th quartile, again showing signs of a nonlinear relationship. The pattern in the 5′ UTR for Drosophila was very distinctive, displaying a cubic polynomial pattern with one turning point (Figure 3(a)).

In summary, the findings based on the 4 quartile subsets show some variability between the coding and noncoding sequences as well as between animal and plant species. The quartile analysis indicates that the coding sequence is negatively correlated to the average gene expression intensity for both the animal and plant species. The full transcript sequence, which includes the flanking 5′ and 3′ UTRs, also shows negative correlation to the average gene expression intensity again in both species. However, when the gene is divided into coding and noncoding regions, differing patterns emerge from each of these gene regions in the plant and animal species. It is important to note that these gene region lengths do not include introns, the gene expression values are measured under control conditions, and the gene length and gene expression data for this analysis have not been transformed.

To determine the validity of the findings in the first set of gene expression experiments, a second set of gene expression data was downloaded from the GEO Datasets website. The raw CEL gene expression files were downloaded GDS3933 [28], Arabidopsis, and GSE36507, Drosophila, and normalised by MAS5 using R. The label and hybridization protocols for Arabidopsis varied between each experimental sample, the first sample using Agilent Low RNA Input Linear Amplification Kit and the second sample using GeneChip 3′ IVT Express Kit. In both samples, total RNA was extracted.

For Drosophila both gene expression samples used 7–9-day-old adults, with total RNA extraction. The labels used were biotin; however the protocols for labelling varied between the gene expression samples. Hybridization protocols followed similar methods. Length data and the master databases containing the length and gene expression data were generated with the same method as outlined in Methods above.

The quartile results show similar results to the first set of gene expression analyses. The noncoding sequences (5′ and 3′ UTRs) in both the animal and plant species displayed an increase in the first two quartiles and then decreased. However, for the coding sequence, there was not such a dramatic decline in gene expression from each quartile (Figures 2(b) and 3(b)).

To test the distribution of gene expression across the four quartile groups, nonparametric analysis was applied to the new gene expression samples. As seen in the previous example, the mean of the gene expression conditional on the four quartile groups for each length region, respectively, was significantly different ( value < 0.0000 at significance level 0.05) (Figures 2(b) and 3(b)).

For the experimental analyses with the quartile length subsets, it is difficult to achieve a general opinion on patterns observed in the coding and noncoding sequences in relation to gene expression. The data in the four subsets do not have sufficient resolution to determine accurately identifiable patterns in both the animal and plant species. However, based on the nonparametric analysis, both samples’ results were unanimous in showing significant differences between the gene expression and the four quartile length groups. The results reported in the length subset analysis of this paper and the results on the relationship between gene expression intensity, and length in general, published in the literature, have directed us to employ a different analytical method to examine more precisely this relationship.

3.2. Quantile Regression Analysis

The log function was used to transform the data for an improved view of the quantile regions, a method not applied in the analysis above. Distinct patterns in the quantile regression for both the animal and plant species are evident in the analysis. Firstly, the length of the 5′ UTR and the gene expression in both Arabidopsis (Table 1/Figure 4) and Drosophila (Table 4/Figure 7) show a positive correlation in the majority of quantiles, indicating that as the length of the 5′ UTR increases gene expression increases. However, in the Drosophila at the 9th quantile, the pattern changes and shows a negative correlation, indicating that, in this quantile for the Drosophila, the 5′ UTR length increases as the gene expression decreases.

Table 1: Quantile regression analysis results on Arabidopsis thaliana between the log of 5′ UTR sequence length and the log of gene expression (GSE31488 gene expression experiment data).
Figure 4: Quantile regression plot for Arabidopsis thaliana with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.

For the CDS length, each species shows a different pattern among the quantiles. For Arabidopsis (Table 2/Figure 5), the pattern shows a positive correlation for the first six (6) quantiles, and then from the 7th quantile there appears to be negative correlation. This would indicate that within the first six quantiles as the CDS length increases, the gene expression increases, and this is reversed past the 7th quantile. The Drosophila result (Table 5/Figure 8) in all quantiles shows negative correlation, indicating that as the CDS length increases, gene expression decreases. This shows two very distinctive patterns between the animal and plant species when the CDS is examined.

Table 2: Quantile regression analysis results on Arabidopsis thaliana between the log of CDS sequence length and the log of gene expression (GSE31488 gene expression experiment data).
Figure 5: Quantile regression plot for Arabidopsis thaliana with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.

Finally, for the 3′ UTR length, the interesting result for both Arabidopsis (Table 3/Figure 6) and Drosophila (Table 6/Figure 9) was that all quantiles showed positive correlation between the 3′ UTR length and gene expression. This suggests that as the 3′ UTR length increases, gene expression increases.

Table 3: Quantile regression analysis results on Arabidopsis thaliana between the log of 3′ UTR sequence length and the log of gene expression (GSE31488 gene expression experiment data).
Table 4: Quantile regression analysis results on Drosophila melanogaster between the log of 5′ UTR sequence length and the log of gene expression (GSE42255 gene expression experiment data).
Table 5: Quantile regression analysis results on Drosophila melanogaster between the log of CDS sequence length and the log of gene expression (GSE42255 gene expression experiment data).
Table 6: Quantile regression analysis results on Drosophila melanogaster between the log of 3′ UTR sequence length and the log of gene expression (GSE42255 gene expression experiment data).
Figure 6: Quantile regression plot for Arabidopsis thaliana with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.
Figure 7: Quantile regression plot for Drosophila melanogaster with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.
Figure 8: Quantile regression plot for Drosophila melanogaster with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.
Figure 9: Quantile regression plot for Drosophila melanogaster with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.

Overall, the CDS length and gene expression appeared dissimilar between the animal and plant species, with different patterns observed. However, when comparing the 5′ UTR and 3′ UTR lengths (noncoding regions of the gene) with gene expression data, similarities emerged.

The quantile regression statistical analyses were again applied to the second set of gene expression data to substantiate this method under different gene expression experiments. The results show very similar patterns to the previous gene expression experiment, indicating the model is robust in studying the relationship between gene expression and the length of coding and noncoding regions in different species (Tables 712/Figures 1015). Both gene expression datasets showed statistical significance across all quantile groups, indicating a relationship between the coding and noncoding length and gene expression in animal and plant species.

Table 7: Quantile regression analysis results on Arabidopsis thaliana between the log of 5′ UTR sequence length and the log of gene expression (GDS3933 gene expression experiment data).
Table 8: Quantile regression analysis results on Arabidopsis thaliana between the log of CDS sequence length and the log of gene expression (GDS3933 gene expression experiment data).
Table 9: Quantile regression analysis results on Arabidopsis thaliana between the log of 3′ UTR sequence length and the log of gene expression (GDS3933 gene expression experiment data).
Table 10: Quantile regression analysis results on Drosophila melanogaster between the log of 5′ UTR sequence length and the log of gene expression (GSE36507 gene expression experiment data).
Table 11: Quantile regression analysis results on Drosophila melanogaster between the log of CDS sequence length and the log of gene expression (GSE36507 gene expression experiment data).
Table 12: Quantile regression analysis results on Drosophila melanogaster between the log of 3′ UTR sequence length and the log of gene expression.
Figure 10: Quantile regression plot for Arabidopsis thaliana with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.
Figure 11: Quantile regression plot for Arabidopsis thaliana with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.
Figure 12: Quantile regression plot for Arabidopsis thaliana with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.
Figure 13: Quantile regression plot for Drosophila melanogaster with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.
Figure 14: Quantile regression plot for Drosophila melanogaster with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.
Figure 15: Quantile regression plot for Drosophila melanogaster with quantiles range from 0.1 to 0.9 in increments of 0.1, respectively.

The observed expression trends in both experimental datasets suggest that there are differences between animal and plant species when considering CDS length and that the noncoding regions show similar patterns of positive correlation to gene expression.

4. Discussion

We aimed to develop an understanding of the relationship between the coding and noncoding sequence length in association with gene expression between animal and plant species. In brief the findings from the quantile regression analysis suggest that (i) the patterns seen between the CDS length and gene expression intensity in Arabidopsis and Drosophila are different, the plant species showing both positive and negative correlations dependent on the quantile whilst the animal species showing a consistent negative correlation among all quantiles; (ii) in both the animal and plant species the 3′ UTR length and gene expression exhibit positive correlation.

The current research has confirmed our previous findings with the Arabidopsis [17] and is also consistent with previous research, where it was found that highly expressed genes have larger primary transcripts [15]. Extensive studies with Arabidopsis have inferred that multistimuli response genes (genes that are differentially expressed in response to a large number of different external stimuli) have significantly longer upstream intergenic regions and are generally shorter [29]. A more recent study investigating the translational efficiency in Arabidopsis has proposed that the sequence context immediately upstream from the AUG initiation codon in plant genes is critical in determining translational efficiency [30]. Other studies investigating the role of the 5′ UTR in translational regulation found that nucleotide composition, length, potential secondary structure, and the presence of uAUGs have a considerable effect on ribosome loading in Arabidopsis [31]. Furthermore, additional studies have focused on the GC content showing large variability among species, ~20 to 60% variation in eukaryotes [32]. Based on the findings from Duret and Stoletzki, GC3-rich genes tend to be shorter than GC3-poor genes [33, 34]. To investigate the hypothesis of synonymous codon usage (SCU), which is described as highly expressed genes undergoing stronger translational selection, for example, higher GC content, in seeded plants, Serres-Giardi et al. tested GC3-rich and GC-poor genes against expression. It was found that, in 154 plant species tested, expression was significantly and positively correlated with GC3 [35]. The results from these studies are interesting with respect to our results and may support and extend the understanding of gene architecture and gene expression in plants.

In addition, the patterns found in the coding sequences for Drosophila are consistent with previous research with animals. A study on Gallus gallus (chicken) found that the coding sequence length is negatively correlated with expression level [13] as shown in the Drosophila in this study. In other animal investigations, the research also reported that in highly expressed genes the length of the coding sequence and protein lengths were small [14, 36]. A popular bioinformatics technique used to detect subtle variations in sequences was used to identify differences between the 3′ UTR and protein coding sequences in the Drosophila. Interestingly, the study found greater number of segments in the 3′ UTR, suggesting greater functional complexity in the 3′ UTRs than in the coding sequence [37]. This could explain the differences in the CDS and 3′ UTR patterns found in this study. Genome size is also another important aspect in determining variability between organisms. A Drosophila melanogaster study has shown that genomes are subjected to constant change not only in their size but also in their composition [38].

Identification of similarities and differences in genomes, particularly between animals and plants that might result in speciation, has had a great deal of interest, with gene families, gene loss, and gene amplification being the focus of these studies [39]. The genomes of Arabidopsis and Drosophila are of similar size; however the number of genes identified varies, ~26,000 for Arabidopsis and ~14,000 for Drosophila. Differences start to emerge when gene families are examined; Arabidopsis appear to have 11,000 gene families, which have more than five members, in contrast to Drosophila which encode fewer genes [40]. Understanding the genome structure of these organisms before examining the finer details of the genome itself is an important strategy.

When the coding sequence is examined in association with gene expression there seems to be divergence in Arabidopsis and Drosophila, although we cannot yet conclude and refer in general to the difference between animal and plant genomes. Differences seen in the animal and plants species may be described by differences in life strategies [11]. Plant genomes appear much more dynamic [10], due to the sessile nature and response to adverse conditions through biochemical complexity and developmental plasticity [41]. In contrast, animal genomes are more conserved and stable, attributable to the ability to avoid adverse conditions [10]. There has been overwhelming evidence that natural selection appears to support the compactness of highly expressed genes in both animal and plant species [13, 16, 4244]. These results may elucidate to the theory on reduction costs of energy with shorter proteins and sequences, contributing to minimizing the cost of synthesis [45]. However it is important to highlight that the length of the coding region is only one of several factors that contribute to the complex nature of natural selection, species complexity, and gene regulation.

Furthermore, the noncoding untranslated sequences have been identified as important components in the regulation of transcription and translation, influencing translation initiation, stability, elongation, and the termination of the mRNA translation [46]. Modification to the lengths of the 5′ UTR and 3′ UTR sequences may contribute to the selective constraints between animal and plants species and may be influenced by environmental conditions [47]. For the 3′ UTR regions, the results of this study have shown similarities in the patterns between Arabidopsis and Drosophila, that is, positive correlation between length and gene expression. This is in agreement with our previous research for Arabidopsis [17].

The regulation of many genes has been known to be controlled primarily by 3′ UTRs, particularly those involved in development [48]. Other research has found that there was positive correlation with transposon and simple sequence repeats (SSRs), with these elements affecting the length and variation of both the 5′ and 3′ UTRs [49]. Differing lengths of the untranslated regions could also be affected by either selection or genetic drift [47]. These results may enforce the concept that these untranslated regions are prone to a higher level of environmental and evolutionary constraints compared to the coding sequences and it is plausible that selection shapes these lengths. However, Chen et al. looked at over 15 species and found that the elongation of 5′ UTR alone cannot lead to the emergence of organismal complexity [47], indicating that the untranslated regions may not be a true indication of organism evolution, thus supporting the similarities found in this research in the untranslated regions.

Furthermore, recent experimental studies have shed light on the complex ceRNA network dynamics in prostate cancer using the alternative cleavage and polyadenylation (APA). This study concluded that long 3′ UTRs tend to harbour more microRNA response elements (MREs) which in turn would influence biological process when the 3′ UTR length is modified. The understanding of 3′ UTR shortening has great potential in creating prognostic markers for oncogene expression [50]. Other research in mammalian brain development proposes that lengthening of 3′ UTRs offers considerable versatility in biological processes [51]. The findings in this study have amplified the importance of the noncoding 5′ and 3′ UTR regions and have shown differences in these regions compared to the coding sequence.

At a global scale, the picture emerging is that animal and plant species show similarities and divergences when comparisons are made with gene expression and the length distributions of the coding and noncoding regions. However, studying the association between expression levels and length can be intricate to interpret, including sample size variation between organisms, statistical methodology, and data transformation. It was our intention to take advantage of available genomic data to identify general responses and relations. Using the available technologies and data our results have shown some interesting correlation between gene expression and the basic gene architecture, length, especially in the 3′ UTR region. Further research is required to explore more details in the gene length distribution variations of different genes and different organisms, including known highly expressed genes such as heat shock protein genes (HSPs).

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

  1. L. T. C. França, E. Carrilho, and T. B. L. Kist, “A review of DNA sequencing techniques,” Quarterly Reviews of Biophysics, vol. 35, no. 2, pp. 169–200, 2002. View at Publisher · View at Google Scholar · View at Scopus
  2. E. Shapiro, T. Biezuner, and S. Linnarsson, “Single-cell sequencing-based technologies will revolutionize whole-organism science,” Nature Reviews Genetics, vol. 14, no. 9, pp. 618–630, 2013. View at Publisher · View at Google Scholar · View at Scopus
  3. S. J. Marygold, P. C. Leyland, R. L. Seal et al., “FlyBase: improvements to the bibliography,” Nucleic Acids Research, vol. 41, no. 1, pp. D751–D757, 2013. View at Publisher · View at Google Scholar · View at Scopus
  4. S. Y. Rhee, W. Beavis, T. Z. Berardini et al., “The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community,” Nucleic Acids Research, vol. 31, no. 1, pp. 224–228, 2003. View at Publisher · View at Google Scholar · View at Scopus
  5. C. A. Ball and J. M. Cherry, “Genome comparisons highlight similarity and diversity within the eukaryotic kingdoms,” Current Opinion in Chemical Biology, vol. 5, no. 1, pp. 86–89, 2001. View at Publisher · View at Google Scholar · View at Scopus
  6. J. Kilian, D. Whitehead, J. Horak et al., “The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses,” The Plant Journal, vol. 50, no. 2, pp. 347–363, 2007. View at Publisher · View at Google Scholar · View at Scopus
  7. C. L. Richards, U. Rosas, J. Banta, N. Bhambhra, and M. D. Purugganan, “Genome-wide patterns of Arabidopsis gene expression in nature,” PLoS Genetics, vol. 8, no. 4, Article ID e1002662, 2012. View at Publisher · View at Google Scholar · View at Scopus
  8. S. W. Robinson, P. Herzyk, J. A. T. Dow, and D. P. Leader, “FlyAtlas: database of gene expression in the tissues of Drosophila melanogaster,” Nucleic Acids Research, vol. 41, no. 1, pp. D744–D750, 2013. View at Publisher · View at Google Scholar · View at Scopus
  9. J. G. Sørensen, M. M. Nielsen, M. Kruhøffer, J. Justesen, and V. Loeschcke, “Full genome gene expression analysis of the heat stress response in Drosophila melanogaster,” Cell Stress and Chaperones, vol. 10, no. 4, pp. 312–328, 2005. View at Publisher · View at Google Scholar · View at Scopus
  10. F. Murat, Y. Van De Peer, and J. Salse, “Decoding plant and animal genome plasticity from differential paleo-evolutionary patterns and processes,” Genome Biology and Evolution, vol. 4, no. 9, pp. 917–928, 2012. View at Publisher · View at Google Scholar · View at Scopus
  11. E. Kejnovsky, I. J. Leitch, and A. R. Leitch, “Contrasting evolutionary dynamics between angiosperm and mammalian genomes,” Trends in Ecology and Evolution, vol. 24, no. 10, pp. 572–582, 2009. View at Publisher · View at Google Scholar · View at Scopus
  12. J. Warringer and A. Blomberg, “Evolutionary constraints on yeast protein size,” BMC Evolutionary Biology, vol. 6, article 61, 2006. View at Publisher · View at Google Scholar · View at Scopus
  13. Y. S. Rao, Z. F. Wang, X. W. Chai et al., “Selection for the compactness of highly expressed genes in Gallus gallus,” Biology Direct, vol. 5, article 35, 2010. View at Publisher · View at Google Scholar · View at Scopus
  14. S. Subramanian and S. Kumar, “Gene expression intensity shapes evolutionary rates of the proteins encoded by the vertebrate genome,” Genetics, vol. 168, no. 1, pp. 373–381, 2004. View at Publisher · View at Google Scholar · View at Scopus
  15. X.-Y. Ren, O. Vorst, M. W. E. J. Fiers, W. J. Stiekema, and J.-P. Nap, “In plants, highly expressed genes are the least compact,” Trends in Genetics, vol. 22, no. 10, pp. 528–532, 2006. View at Publisher · View at Google Scholar · View at Scopus
  16. H. Yang, “In plants, expression breadth and expression level distinctly and non-linearly correlate with gene structure,” Biology Direct, vol. 4, article 45, 2009. View at Publisher · View at Google Scholar · View at Scopus
  17. R. Caldwell, J. Kongcharoen, Y.-X. Lin, and R. Zhang, “The length distributions of non-coding and coding sequences in relation to gene expression: a case study on Arabidopsis thaliana,” in Proceedings of the International Conference Bioinformatics and Computational Biology (BIOCOMOP '10), CSREA Press, Las Vegas, Nev, USA, 2010.
  18. P. K. Ingvarsson, “Gene expression and protein length influence codon usage and rates of sequence evolution in Populus tremula,” Molecular Biology and Evolution, vol. 24, no. 3, pp. 836–844, 2007. View at Publisher · View at Google Scholar · View at Scopus
  19. N. G. C. Smith and A. Eyre-Walker, “Adaptive protein evolution in Drosophila,” Nature, vol. 415, no. 6875, pp. 1022–1024, 2002. View at Publisher · View at Google Scholar · View at Scopus
  20. P. Andolfatto, “Adaptive evolution of non-coding DNA in Drosophila,” Nature, vol. 437, no. 7062, pp. 1149–1152, 2005. View at Publisher · View at Google Scholar · View at Scopus
  21. Z. Lin and W.-H. Li, “Evolution of 5′ untranslated region length and gene expression reprogramming in yeasts,” Molecular Biology and Evolution, vol. 29, no. 1, pp. 81–89, 2012. View at Publisher · View at Google Scholar · View at Scopus
  22. C. J. Grisdale and N. M. Fast, “Patterns of 5′ untranslated region length distribution in encephalitozoon cuniculi: implications for gene regulation and potential links between transcription and splicing,” Journal of Eukaryotic Microbiology, vol. 58, no. 1, pp. 68–74, 2011. View at Publisher · View at Google Scholar · View at Scopus
  23. B. S. Cade and B. R. Noon, “A gentle introduction to quantile regression for ecologists,” Frontiers in Ecology and the Environment, vol. 1, no. 8, pp. 412–420, 2003. View at Publisher · View at Google Scholar
  24. L. Huang, W. Zhu, C. P. Saunders et al., “A novel application of quantile regression for identification of biomarkers exemplified by equine cartilage microarray data,” BMC Bioinformatics, vol. 9, article 300, 2008. View at Publisher · View at Google Scholar · View at Scopus
  25. H. Wang and X. He, “An enhanced quantile approach for assessing differential gene expressions,” Biometrics, vol. 64, no. 2, pp. 449–457, 2008. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  26. K. Hanada, M. Higuchi-Takeuchi, M. Okamoto et al., “Small open reading frames associated with morphogenesis are hidden in plant genomes,” Proceedings of the National Academy of Sciences of the United States of America, vol. 110, no. 6, pp. 2395–2400, 2013. View at Publisher · View at Google Scholar · View at Scopus
  27. G. Landis, J. Shen, and J. Tower, “Gene expression changes in response to aging compared to heat stress, oxidative stress and ionizing radiation in Drosophila melanogaster,” Aging, vol. 4, no. 11, pp. 768–789, 2012. View at Google Scholar · View at Scopus
  28. S. González-Pérez, J. Gutiérrez, F. García-García et al., “Early transcriptional defense responses in arabidopsis cell suspension culture under high-light conditions,” Plant Physiology, vol. 156, no. 3, pp. 1439–1456, 2011. View at Publisher · View at Google Scholar · View at Scopus
  29. D. Walther, R. Brunnemann, and J. Selbig, “The regulatory code for transcriptional response diversity and its relation to genome structural properties in A. thaliana,” PLOS Genetics, vol. 3, no. 2, e11, 2007. View at Google Scholar
  30. Y. Kim, G. Lee, E. Jeon et al., “The immediate upstream region of the 5′-UTR from the AUG start codon has a pronounced effect on the translational efficiency in Arabidopsis thaliana,” Nucleic Acids Research, vol. 42, no. 1, pp. 485–498, 2014. View at Publisher · View at Google Scholar · View at Scopus
  31. R. Kawaguchi and J. Bailey-Serres, “mRNA sequence features that contribute to translational regulation in Arabidopsis,” Nucleic Acids Research, vol. 33, no. 3, pp. 955–965, 2005. View at Publisher · View at Google Scholar · View at Scopus
  32. M. Lynch, The Origins of Genome Architecture, Sinauer Associates, Inc., 2007.
  33. L. Duret and D. Mouchiroud, “Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 8, pp. 4482–4487, 1999. View at Publisher · View at Google Scholar · View at Scopus
  34. N. Stoletzki, “The surprising negative correlation of gene length and optimal codon use—disentangling translational selection from GC-biased gene conversion in yeast,” BMC Evolutionary Biology, vol. 11, no. 1, article 93, 2011. View at Publisher · View at Google Scholar · View at Scopus
  35. L. Serres-Giardi, K. Belkhir, J. David, and S. Glémin, “Patterns and evolution of nucleotide landscapes in seed plants,” The Plant Cell, vol. 24, no. 4, pp. 1379–1397, 2012. View at Publisher · View at Google Scholar · View at Scopus
  36. G. P. S. Raghava and J. H. Han, “Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein,” BMC Bioinformatics, vol. 6, article 59, 2005. View at Publisher · View at Google Scholar · View at Scopus
  37. M. Algama, C. Oldmeadow, E. Tasker, K. Mengersen, and J. M. Keith, “Drosophila3 UTRs are more complex than protein-coding sequences,” PLoS ONE, vol. 9, no. 5, Article ID e97336, 2014. View at Publisher · View at Google Scholar · View at Scopus
  38. M. Boulesteix, M. Weiss, and C. Biémont, “Differences in genome size between closely related species: the Drosophila melanogaster species subgroup,” Molecular Biology and Evolution, vol. 23, no. 1, pp. 162–167, 2006. View at Publisher · View at Google Scholar · View at Scopus
  39. C. A. Ball and J. M. Cherry, “Genome comparisons highlight similarity and diversity within the eukaryotic kingdoms,” Current Opinion in Chemical Biology, vol. 5, no. 1, pp. 86–89, 2001. View at Publisher · View at Google Scholar · View at Scopus
  40. Arabidopsis Genome Initiative, “Analysis of the genome sequence of the flowering plant Arabidopsis thaliana,” Nature, vol. 408, no. 6814, pp. 796–815, 2000. View at Publisher · View at Google Scholar
  41. A. M. Wilczek, J. L. Roe, M. C. Knapp et al., “Effects of genetic perturbation on seasonal life history plasticity,” Science, vol. 323, no. 5916, pp. 930–934, 2009. View at Publisher · View at Google Scholar · View at Scopus
  42. C. I. Castillo-Davis, S. L. Mekhedov, D. L. Hartl, E. V. Koonin, and F. A. Kondrashov, “Selection for short introns in highly expressed genes,” Nature Genetics, vol. 31, no. 4, pp. 415–418, 2002. View at Publisher · View at Google Scholar · View at Scopus
  43. E. Eisenberg and E. Y. Levanon, “Human housekeeping genes are compact,” Trends in Genetics, vol. 19, no. 7, pp. 362–365, 2003. View at Publisher · View at Google Scholar · View at Scopus
  44. H. K. Stenøien, “Compact genes are highly expressed in the moss Physcomitrella patens,” Journal of Evolutionary Biology, vol. 20, no. 3, pp. 1223–1229, 2007. View at Publisher · View at Google Scholar · View at Scopus
  45. E. Vilaprinyo, R. Alves, and A. Sorribas, “Minimization of biosynthetic costs in adaptive gene expression responses of yeast to environmental changes,” PLoS Computational Biology, vol. 6, no. 2, Article ID e1000674, 15 pages, 2010. View at Publisher · View at Google Scholar · View at Scopus
  46. L. W. Barrett, S. Fletcher, and S. D. Wilton, “Regulation of eukaryotic gene expression by the untranslated gene regions and other non-coding elements,” Cellular and Molecular Life Sciences, vol. 69, no. 21, pp. 3613–3634, 2012. View at Publisher · View at Google Scholar · View at Scopus
  47. C.-H. Chen, H.-Y. Lin, C.-L. Pan, and F.-C. Chen, “The genomic features that affect the lengths of 5′ untranslated regions in multicellular eukaryotes,” BMC Bioinformatics, vol. 12, no. 9, article S3, 2011. View at Publisher · View at Google Scholar · View at Scopus
  48. C. Merritt, D. Rasoloson, D. Ko, and G. Seydoux, “3′ UTRs are the primary regulators of gene expression in the C. elegens germline,” Current Biology, vol. 18, no. 19, pp. 1476–1482, 2008. View at Publisher · View at Google Scholar · View at Scopus
  49. H. Liu, J. Yin, M. Xiao et al., “Characterization and evolution of 5 and 3 untranslated regions in eukaryotes,” Gene, vol. 507, no. 2, pp. 106–111, 2012. View at Publisher · View at Google Scholar · View at Scopus
  50. L. Li, D. Wang, M. Xue, X. Mi, Y. Liang, and P. Wang, “3′UTR shortening identifies high-risk cancers with targeted dysregulation of the ceRNA network,” Scientific Reports, vol. 4, 2014. View at Publisher · View at Google Scholar
  51. P. Miura, S. Shenker, C. Andreu-Agullo, J. O. Westholm, and E. C. Lai, “Widespread and extensive lengthening of 3′ UTRs in the mammalian brain,” Genome Research, vol. 23, no. 5, pp. 812–825, 2013. View at Publisher · View at Google Scholar · View at Scopus