Abstract

RNA-Seq is increasingly being used for gene expression profiling. In this approach, next-generation sequencing (NGS) platforms are used for sequencing. Due to highly parallel nature, millions of reads are generated in a short time and at low cost. Therefore analysis of the data is a major challenge and development of statistical and computational methods is essential for drawing meaningful conclusions from this huge data. In here, we assessed three different types of normalization (transcript parts per million, trimmed mean of M values, quantile normalization) and evaluated if normalized data reduces technical variability across replicates. In addition, we also proposed two novel methods for detecting differentially expressed genes between two biological conditions: (i) likelihood ratio method, and (ii) Bayesian method. Our proposed methods for finding differentially expressed genes were tested on three real datasets. Our methods performed at least as well as, and often better than, the existing methods for analysis of differential expression.

1. Introduction

One of the recent methods for gene expression profiling is RNA-Seq. An advantage of RNA-Seq over other gene expression profiling technologies is that it allows a comprehensive assay that does not require probes for targets to be specified in advance. It has particularly been used for de novo detection of splice junctions and allows genome wide expression profiling of organisms with unknown genome sequence [1].

By obtaining millions of short reads from the population of interest and by mapping these reads to the reference genome, RNA-Seq produces read count data. With enough reads from a sample, it has the potential to detect and quantify biologically significant RNAs with low and moderate abundances. Before detecting biologically significant RNAs, systematic technical variations due to experimental variability need to be removed retaining effects resulting from the biological process of interest. This process is also known as normalization. Various procedures for normalization of RNA-Seq have been proposed in literature, such as transcripts parts per million [2], trimmed mean of M values [3], and quantile normalization [4]. Though these methods have been frequently used, no comparative analysis has been presented so far.

Previous methods for identification of differential expressed genes include Bloom et al. [5] who identified differential expression by taking log ratio of the transcript counts; Hoen et al. [6] used a Student's t-test and alternatively also applied a Bayesian model of Vêncio et al. [7]. Marioni et al. [8] and Bullard et al. [4] suggested to use Poisson model (and Fisher's exact test, or a likelihood ratio test as an approximation to it) to test for differential expression. Recently published methods, EdgeR [9] and DESeq [10] use a Negative Binomial distribution to test for differential expression as it allows for over dispersion. We also propose two statistical methods for inferring differential expression for RNA-Seq data. They are likelihood ratio method and Bayesian method. The methods are generic and can be applied to data with or without replication.

Methods for normalization, differential expression, along with the details of the dataset used to test the performance of our methods are detailed in the next section. Results along with a systematic comparison are presented on three real datasets and we conclude with a brief discussion.

2. Material and Methods

2.1. Data

Datasets used to test the performance of our methods.

Dataset 1. Marioni et al. [8] conducted RNA-Seq experiment with liver and kidney of a single human male using Illumina Genome Analyzer sequencing platform. Each tissue was sequenced in seven lanes, split across two runs of the machine and two different cDNA concentrations (1.5 pM, 3 pM). For this work, we only use data sequenced at 3 pM concentration (five lanes for each sample) and 17708 Ensembl transcripts that mapped with the array probes.

Dataset 2. Vaz et al. [11] profiled miRNA expression from the normal peripheral blood mononuclear cells from two different individuals and cancer cells of myeloid lineage, K562 (chronic myelocytic leukemia) and HL60 (acute promyelocytic leukemia) using Solexa technology.

Dataset 3. Mastrokolias et al. [12] analyzed 6 globin reduced with 6 nonreduced human whole blood RNA samples using a tag sequencing method on the Illumina high-throughput sequencing platform.

3. Normalization

Normalization is a procedure to remove nonbiological influence on biological data and to make data comparable across experiments, runs, and lanes. Various normalization procedures have been proposed in literature for RNA-Seq and here we evaluate three different normalization methods: transcripts parts per million, trimmed mean of M values, quantile normalization. At present, Transcripts parts per million (TPM) is a standard procedure to normalize RNA-Seq data. Using this method, number of reads of a transcript/sequence are divided by the total clone count of the sample and multiplied by . Resulting normalized data is reported as reads (or transcripts) per million for each sample. One of the major problems with RNA-Seq data is that while the total number of reads for a sample is known, the composition of the RNA population is unknown. Thus, TPM normalization method has its limitations for datasets with marked different RNA composition. Trimmed mean of M values (TMM) normalization has been suggested to remove RNA compositional bias as TMM equates the overall expression levels of genes between samples by estimation of relative RNA production levels or scale factors. Another method in use is quantile normalization which has previously been applied for microarrays. In quantile normalization, the distribution of read counts in each lane is matched to a reference distribution defined in terms of median counts across sorted lanes.

4. Differential Expression

We propose two methods for inferring differential expression across two biological conditions with technical replicates, each of which yields one test statistics per gene: (i) likelihood ratio method (LRM) (Casella and Berger [13]), (ii) bayesian method (BM), an extension of technique due to Audic and Claverie [14] for more than 2 replicates within a condition. Let denotes the observed number of reads mapped to a gene in replicate ) under condition-1 and let denotes the observed number of reads mapped to a gene in replicate ) for condition-2. Since the number of reads mapped to a gene represents a small (less than 5%) fraction of the total number of reads obtained after sequencing, we assume and to follow independent Poisson distribution with different parameters. Methods are detailed for a gene and the same need to be applied for all genes.

4.1. Likelihood Ratio Method

For condition-1, follows Poisson distribution with parameters , with probability mass function as where denotes the true expression level of gene in replicate . As 's occur independently, the likelihood function of is given by To identify genes with similar read count across replicates, we test the null hypothesis = (say, ) against the alternative for some . Under , the maximum likelihood estimate (MLE) of is given by where and under , the MLE of is given by The likelihood ratio for testing = (say, ) for condition-1 is given by Similarly, for condition-2, follows Poisson distribution with parameters , . As derived above, the likelihood ratio for testing = (say, ) for condition-2 is given by where . For identifying differentially expressed genes across the two conditions, for a gene, define and to be independent Poisson random variables with parameters and , respectively, and test if . The joint likelihood of the two conditions is given as and the unconditional MLE's of and are given by and , respectively, MLE of under the hypothesis is . The likelihood ratio for testing is given by We reject the null hypothesis for the small values of the statistic, .

4.2. Bayesian Method

Back in 1997, the method of Audic and Claverie was used to establish the probability distribution governing the occurrence of the same rare event in repeated experiments and was applied for the analysis of digital gene expression profiles. It was then described for only 2 replicates which we have attempted to extend to 3 or more replicates and apply to RNA-Seq data. As defined before, represents the number of reads mapped to a gene in replicate 1 of the condition-1 and follows Poisson distribution where denotes the actual number of reads mapped to the gene. Let represents the number of reads mapped to a gene in replicate 2 of the condition-1. Then, where in above equation is the posterior probability of given occurrences of a gene in an experiment and is the probability of drawing observations from Poisson distribution with parameter . Using Bayes Theorem, Vêncio et al. [7] showed that, where the prior distribution is taken as uniform distribution over the interval . We extended the above results when the condition is replicated thrice and From Bayes Theorem, Again, using uniform prior for , we get which is a gamma random variable with scale parameter . This gives Therefore, Similarly, if the condition is replicated times, we consider the following probability. In order to find genes with similar read counts within a condition, we find two numbers such that Equation (18) implies that if the observation of the th replicate lies in the interval then we conclude with probability that there are no systematic differences between the replicates. Similarly, the results can be derived for replicates of a gene in condition-2 (i.e., , ). For Identifying differential expression across two conditions, define , to be independent Poisson random variables with parameters and , respectively, and use (11). Under the Bayesian method, we can only identify genes that are different across two conditions if the number of replicates for the two conditions are the same (i.e., ).

5. Results

5.1. Assessing Technical Variability Using Likelihood Ratio Method

We assessed the variability within technical replicates using Dataset 1 which comprises of liver and kidney tissue, each with five technical replicates and 17708 ENSEMBL transcripts. Boxplots of unnormalized data from both liver and kidney samples are shown in Figure 1(a). Variability within replicates and also across the two tissues can be clearly seen. Kidney being more variable was considered for further analysis.

We evaluate this variability statistically using a likelihood ratio method detailed in the previous section. The analysis was performed at 1, 2.5, 5, and 10 levels while considering two, three, four, and five replicates on the un-normalized data from kidney. As shown in Table 1, there is a decrease in the percentage of genes with similar counts as the number of replicates increases, which is expected; however, the decreases is only marginal. The percentage of genes with similar counts also decrease with the increase in the levels. Thus, Dataset 1 is highly reproducible with few systematic differences among the replicates.

5.2. Assessing the Impact of Normalization Using Likelihood Ratio Method

We assess the impact of all three normalization methods using the likelihood ratio method at 1, 2.5, 5, and 10 levels. We used data from liver tissue with five replicates without normalization, with TMM, Quantile, and TPM normalization. It can be seen from Table 2 that the percentage of genes with similar counts increased after TMM and Quantile normalization and, thus, reduction in variability after normalization. A gain of 2 is achieved after TMM or Quantile normalization while the performance of TPM normalization was found to be poor. Similar results were obtained on other two datasets. Figures 1, 2, and 3 represents boxplots of un-normalized, normalized after TPM, TMM and Quantile for Datasets 1, 2, and 3, respectively.

5.3. Comparison of Differential Expression Statistics

We compared the two proposed methods for inferring differentially expressed (DE) genes: Likelihood ratio method and Bayesian method on Datasets 2 and 3. We used the quantile normalized data from these datasets.

For comparison between any two biological conditions, the read count values from the conditions can be categorized under three categories. When both conditions have zero count. In this situation, nothing can be said about differential expression between the two conditions. When one sample has zero or low counts and a reasonable count in the other. This is an interesting biological phenomena where a gene is not expressed in one of the conditions. When both the conditions have reasonable count. We shall evaluate the performance of our methods based on second and third category.

For the quantile normalized Normal versus HL60 data (Dataset 2), 19 miRNAs are absent in either of the two samples and present with a reasonable count for the other and 155 miRNAs were present with read count of at least 5 in both the samples. Using the likelihood ratio method at 1 level of significance, all 19 miRNAs absent in either of the two conditions were identified as DE and out of the 155 miRNAs, 57 were identified as DE. Using the Bayesian method at 1 level of significance, miRNAs absent in either of the two conditions were also identified as DE and out of the 155 miRNAs, 58 were identified as DE. Nearly same miRNAs, except one, were identified as DE using both the methods. We also analyzed this dataset using DESeq and EdgeR and they did not identify miRNAs absent in one of the two conditions. Of the 155 miRNAs, DESeq identified 3 miRNAs as DE with value 0.01 and EdgeR identified 4 miRNAs as DE with value 0.01. Similar analysis was performed for Normal versus K562 and globin reduced versus nonreduced samples. See Additional file 1, 2, and 3 in supplementary material available online at http://dx.doi.org/10.5402/2012/817508 for detailed analysis and Table 3 for a systematic comparison between methods for all three datasets.

From Additional file 1 in supplementary material, it is clear that likelihood ratio method and Bayesian method give very similar results for Normal versus HL60 and Normal versus K562 datasets (Dataset 2). Both methods identified all miRNAs previously identified as differentially expressed in Vaz et al. [11]. However, DESeq and EdgeR could not identify most of the DE miRNAs reported in Vaz et al. [11]. Few miRNAs experimentally verified using RNase protection assay (RPA) and real-time RT-PCR in Vaz et al. [11] (i.e., miR-16, 22, 27a, 192, and let-7g) were identified with high fold in our analysis. In addition, we also identified differential expression of miR-181a family of HL60, previously reported in [15].

For globin reduced versus non-reduced data (Dataset 3), likelihood method reports 2513 significant genes at 1 level of significance, Bayesian method reports 2344 at 1 level of significance, DESeq reports 1505 with value 0.01 and EdgeR reports 2987 genes with value 0.01. From these numbers alone, it is difficult to comment on the performance of any method. Figure 4 demonstrates the distribution of expression strength of the significant gene list obtained from likelihood ratio method, DESeq, EdgeR, and all genes. One would expect the distribution of the significant gene lists to roughly follow the expression strength distribution for all genes. For likelihood ratio method and DESeq, this is true but not for EdgeR. EdgeR seems to be identifying genes from all expression strengths and thus not reflection biolog but the rigidity of its error models. Few genes experimentally verified in Mastrokolias et al. [12] using qPCR (i.e., CXorf25, HBA1, HBA2, HBD, HBB) were obtained with high fold values in our analysis. See additional file 3 in supplementary material for analysis.

Table 4 shows how a confidence interval was evaluated in Bayesian method for quantile normalized Normal-HL60 data. Hsa-let-7g has a read count of 15117 in Normal (condition-1) and 6236 in HL60 (condition-2). Using (18), for one replicate, we estimated the lower and upper bound of the confidence interval around Normal as 1386 and 1644. Read count of 6236 for hsa-let-7g in HL60 lies well outside the estimated confidence interval (1386, 1644). Thus, the read count in Normal and HL60 are significantly different and reported in Table 3 as T(i.e. true). Similar deductions can be made for others.

6. Discussions

We assessed three different types of normalizations and showed that though Illumina data is highly replicable before normalization, normalization further reduces the technical variability, likelihood ratio method was used to statistically evaluate variation across replicates. We also presented two methods for finding differentially expressed genes for RNA-Seq data with or without replicates, likelihood ratio method is a general method that does not impose any restriction on the equality of the number of replicates across the two conditions. Bayesian method on the other hand can only be applied if there is equality on the number of replicates for the two conditions being compared. The performance of both the methods was compared to DESeq, EdgeR. For small RNA dataset, likelihood ratio method and Bayesian method perform similarly but better than EdgeR and DESeq. For Dataset 3, the distribution of the significant gene lists from likelihood ratio method and DESeq roughly follows the expression strength distribution for all genes. However, this was not true for EdgeR.

For both likelihood ratio method and Bayesian method, we assume that the underlying distribution for observed number of reads to be Poisson. Poisson distribution is intuitively appealing and mathematically easy to handle but with a limitation that the mean and variance of Poisson random variable are the same. To avoid this, authors generally assume negative binomial distribution instead of Poisson. However, the efficiency of the proposed methods in identifying differentially expressed genes, their mathematical convenience, and simplicity should make these methods extremely useful.

Authors’ Contribution

R. Gupta processed the data, implemented the methods, conducted statistical analysis, and drafted the manuscript. I. Dewan was responsible for method development and method writing. R. Bharti did the comparison with existing methods. A. Bhattacharya provided valuable insights and helped in improving the paper writing. All authors read and approved the final paper.

Acknowledgment

R. Gupta has been supported by DBT-CoE.

Supplementary Materials

Supplementary Table 1 presents the differential expression analysis for Normal vs HL60 data using Likelihood ratio method, Bayesian method, DESeq and EdgeR.

Supplementary Table 2 presents the differential expression analysis for Normal vs K562 data using Likelihood ratio method, Bayesian method, DESeq and EdgeR.

Supplementary Table 3 presents the differential expression analysis for six globin reduced vs six non reduced samples using Likelihood ratio method, Bayesian method, DESeq and EdgeR.

  1. Supplementary Table 1
  2. Supplementary Table 2
  3. Supplementary Table 3