Statistical Analysis of High-Dimensional Genetic Data in Complex TraitsView this Special Issue
Dynamic Model for RNA-seq Data Analysis
By measuring messenger RNA levels for all genes in a sample, RNA-seq provides an attractive option to characterize the global changes in transcription. RNA-seq is becoming the widely used platform for gene expression profiling. However, real transcription signals in the RNA-seq data are confounded with measurement and sequencing errors and other random biological/technical variation. To extract biologically useful transcription process from the RNA-seq data, we propose to use the second ODE for modeling the RNA-seq data. We use differential principal analysis to develop statistical methods for estimation of location-varying coefficients of the ODE. We validate the accuracy of the ODE model to fit the RNA-seq data by prediction analysis and 5-fold cross validation. To further evaluate the performance of the ODE model for RNA-seq data analysis, we used the location-varying coefficients of the second ODE as features to classify the normal and tumor cells. We demonstrate that even using the ODE model for single gene we can achieve high classification accuracy. We also conduct response analysis to investigate how the transcription process responds to the perturbation of the external signals and identify dozens of genes that are related to cancer.
Next-generation sequencing (NGS) technologies have revolutionized advances in the study of the transcriptome. The newly developed deep-sequencing technologies make it possible to acquire both quantitative and qualitative information regarding transcript biology. By measuring messenger RNA levels for all genes in a sample, RNA-seq provides an attractive option to characterize the global changes in transcription.
To generate RNA-seq data, the complete set of mRNAs are first extracted from an RNA sample and then shattered and reverse-transcribed into a library of cDNA fragments with adaptors attached. These short pieces of cDNA are amplified by polymerase chain reaction and sequenced by machine, producing millions of short reads. These reads are then mapped to a reference genome or reference transcript. The number of reads within a region of interest is used as a measure of abundance. The reads can also be assembled de novo without the genomic sequence to create a transcription map.
Compared to microarray which provides limited gene regulation information, RNA-seq offers a comprehensive picture of the transcriptome. RNA-seq has made a number of significant qualitative and quantitative improvements on gene expression analysis and provides multiple layers of resolutions and transcriptome complexity: the expression at exon, SNP, and positional level; splicing; posttranscriptional RNA editing across the entire gene; isoform and allele-specific expression .
Many advantages include strong concordance between platforms, higher sensitivity and dynamic range, lower technical variation and background signal, and high level of technical and biological reproducibility, and so on [2–5]. However, some limitations are inherent to next-generation sequencing technology. For example, the read coverage may not be homogeneous along the genome, and different samples may be sequenced at different levels of depth in the experiment. Also, although some genes may have a similar level of expression, longer genes are more likely to have more reads than short ones. Therefore, RNA-seq data must be normalized before any comparison of the counts can be made. Another consideration is that, in production of cDNA libraries, larger RNA must be fragmented into smaller pieces to be sequenced and different fragmentations may create bias towards different outcomes. Some other informatics challenges like the storage, transfer, and retrieval of large size data may bring additional errors [6, 7].
Expression variability measured by RNA-seq arises from three primary sources: (i) real biological differences in different experimental groups or conditions, (ii) measurement errors, and (iii) random biological and/or technical variation [1, 8]. The first type of variability is of real biological interest but is confounded with measurement and sequencing errors and other random biological/technical variation. How to appropriately take the latter two types of variability into account is essential issue in the RNA-seq data analysis.
The purpose of this paper is to borrow dynamic theory from engineering and use ordinary differential equation (ODE) for modelling the observed number of reads across the gene and unravelling the features of gene transcription . To achieve this goal, we considered the number of reads or expression level at each position as a function of the genomic position and viewed the transcription process as a stochastic process of transcription along the genome. Instead of taking the derivative of expression level with respect to time, we calculated the expression level derivative with respect to genomic position. Specifically, we proposed a dynamic model for the variation of the transcription process along the genome. For each gene, we use a second order ODE with location-dependent coefficient to model that gene’s transcription process. We develop statistical methods for estimation of the coefficient functions in the ODE based on principal differential analysis. Compared to the ODE model with constant coefficient to capture the stochastic variation feature of transcription process, the location-dependent coefficients are essential to account for the complicated stochastic process of gene regulation.
To examine the precision of the ODE for modeling the RNA-seq data, we split the samples into five groups and use 5-fold cross validation to evaluate the accuracy of predicting gene expression level across the gene using the ODE model.
To capture stochastic feature of gene regulation, we conduct the response analysis. The response analysis of transcriptional processes for each gene using its fitted differential equation can provide important aspects of transcription, including alternative splicing, alternative start and end of transcription, and alternative isoforms. To differentiate feature of gene regulation between normal and cancer tissue samples, we develop statistics to test for significant difference in the response of the gene regulation between the normal and cancer samples under the perturbation of external signals and perform genome-wide response analysis of gene regulation. Using the ODE model, we identified the genes that have a significantly different transcriptional process (both different magnitude and different patterns) and identified genes that showed significantly differential stochastic behaviors in response to environmental perturbations between normal and cancer samples.
To further explore application of the ODE for RNA-seq data analysis, we take the location-varying coefficients of the ODE as features and use FPCA as a tool for extraction of these features. The FPCA scores are used as features and the Lasso logistic regression is used as feature selection tool and classifier for distinguishing the cancer and normal samples.
The data suggest that the dynamic features of gene transcription captured by the coefficient functions can retrieve the original process information. Therefore, they naturally served as good candidate’s features for clustering genes with similar transcription process. These groups of genes could share common biological function, chromosomal location, pathway, or regulation. The ODE for modeling the RNA-seq data has the potential to provide valuable information for understanding the mechanism of gene regulation and unraveling disease processes.
2. Materials and Methods
2.1. ODE Model with Varying Coefficients for RNA-seq Data
Assume that the expression of a gene is measured by the number of sequence reads mapped to this gene in the region . Let denote a genomic position, let be observed gene expression level that was measured by the number of reads mapped to the genomic position, and let be the hidden state that determined the gene expression level at the genomic position . To model transcription process, the second order ordinary differential equation (ODE) with location-varying coefficients can be specified as follows:where and are weighting coefficients or parameters in the ODE. Its observations often have measurement errors:where is measurement error at the position .
2.2. Estimation of Coefficient Functions in the ODE
Estimation of coefficient functions in the ODE consists of two steps. At the first step we estimate the states from the observed number of reads assuming that coefficient functions in the ODE are given. At the second step, we estimate the coefficient functions in the ODE, assuming that states have been estimated.
Step 1. To estimate , we first expand the function in terms of basis functions and then estimate its expansion coefficients. Let be the state variable at the genomic position of the th sample and let be its observation (). Then, can be expanded aswhere and .
Similarly, the parameters and can be expanded asSubstituting their expansions into (1), we obtainwhere , , and . To smooth the estimated function , we impose the following penalty term:where .
We estimate the state function from the observation data by minimizing the following objective function which consists of the sum of the squared errors between the observations and the estimated states and the penalty terms:
LetProblem (7) can be rewritten in a matrix form:The least square estimators of the expansion coefficients are then given by
Step 2. Next we estimate the coefficient functions in the ODE. The coefficient functions in the ODE can be estimated by minimizing the following least squares objective function:where .
Since , the can be expressed in terms of the estimated expansion coefficients aswhere the matrix is defined as
Therefore, problem (11) can be reduced aswhere the matrix is estimated and hence fixed in minimization problem (14). Setting the partial derivative of to be zero,Solving (15) for , we obtainIn summary, we iteratively determine the expansion coefficients of the state function for fixed parameters in the ODE by (10) and estimate the coefficient functions in the ODE for fixed expansion coefficients by (16).
2.3. ODE for Classification
To illustrate that the ODE can be used as a useful tool for modeling the profile of the RNA-seq expression we will show that the ODE can capture all variation of gene expression across the gene and that the coefficient functions of the ODE are useful feature extraction of the RNA-seq data. The ODE can be used for classifying tumor and normal samples.
Since dimensions of the coefficient functions of the ODE are extremely high, the functional principal component analysis (FPCA) is used to reduce the dimensions of the coefficient functions of the ODE.
The FPCA tries to find the dominant direction of variation around an overall trend function [10, 11]. Each principal component is specified by the weight function , and the principal component scores of the individuals in the sample are defined as the inner product of weight function and functional curves ():where, for convenience, we use to denote either or , that is, the coordinate value of functional curves at the direction of with highest variability. By projecting the functional curves onto set of eigenfunctions, we can reduce the dimension to finite number, functional principal component scores.
Suppose that for the th individual sample we obtain the functional principal component score:where and are the coefficient functions of the ODE for the th individual sample and , , are a set of eigenfunctions (or principal component functions). The original functional curves can be reduced to a finite feature matrix:where the is the number of principal components selected to explain the total variability.
To improve classification accuracy we use the Lasso logistic regression as a classifier. In simple logistic regression, we use the logit link to relate the mean of response with the covariates of interest. Let be the vector of observed covariates for th observation, and is the corresponding response outcome. For simplicity, we consider binary cases where or 0. The model is specified as the following posterior probability for th observation :where is the covariate vector of interest and is the intercept term. And the joint log-likelihood of the subjects is defined aswhich can be written asTo estimate the parameter, we set its derivatives to zero and get the score equationsSince (23) is nonlinear equation in , we usually use some iterative methods like Newton-Raphson algorithm to get the solution of .
By adding an penalty to the joint log-likelihood in (23) we have the following constrained maximization equation:where is the constrained log-likelihood and is tuning parameter to adjust the tradeoff between log-likelihood function and the size of penalty. Please note that, in Lasso, we usually do not penalize the intercept term and it is practically meaningful to standardize the covariates before optimization.
The penalty is not differentiable and also is not linear solution of response . It is not trivial to get the score functions but we can still have a solution using nonlinear programming method . The score functions for variables with nonzero coefficients have the formCoordinate descent method is one efficient method to compute the Lasso solution. It fixes the penalty parameters and optimize over each parameter successively, while holding the others fixed at current values. R package glmnet  can efficiently fit the Lasso logistic regression with large and .
2.4. Numerical Solution to the ODE with Bounded Values
We use collocation Runge-Kutta method for the solution of boundary value problem of the ODE. The basic idea is to find a set of polynomials of degree which satisfies the problem over the interval  for a set of pointsNote that, , they are distinct real numbers. Also the polynomial functions are set to satisfyThe numerical approximation at is given byR package bvpSolve implements the method for boundary value problem .
2.5. Response Analysis under Perturbation of External Signal
Gene regulatory properties are encoded in the parameter curves of the ODE modeling gene expressions. Testing significant difference in the parameter curves between two conditions can be used as a powerful tool to assess differential changing behaviors of the gene expression across the gene region between two conditions. Response analysis attempts to extract inherent features of the systems that capture and describe the behaviors of the system over genomic positions under different operating conditions and perturbation of external signals.
Let denote a genomic region within the gene of interests and let be the number of reads mapped to the genomic region. And the ODE model used to describe the expression profile is given as follows:Suppose the and are estimated from the data. The response of a regulatory system depends on the input signals. Different signal will cause different responses. For simplicity, we consider unit-step signal forced on the system and then solve the responses of the original system between different groups using estimated parameters and :To solve the solution of the estimated ODE with unit-step force function , we have to use some numerical methods to approximate the solution . We solved ODE numerically by considering two-point boundary value problems where boundary conditions are specified at both ends of the range of integration. We estimated two initial values at both ends by evaluating the estimated smoothing expression curves at start and end positions.
Suppose is a vector-valued function to represent response functional for all subjects in the normal group and is response functional for subjects in cancer group. Therefore, we can construct a Hotelling . Suppose that the response functions were expanded in terms of eigenfunctions :where and and and are uncorrelated random variables with zero mean and variances with . Define the averages and of the principal component scores and in the normal and cancer group. Then we denote the average vector of scores in normal and cancer group bywhere and , .
The pooled covariance matrix iswhere and .
Let ; then the Hotelling statistics can be written asUnder null of no difference in the response of the gene regulation between two groups, the statistics follows distribution where is the number of principle component scores.
We apply the proposed model to kidney renal clear cell carcinoma (KIRC) RNA-seq data, which is available from The Cancer Genome Atlas (TCGA) project (https://tcga-data.nci.nih.gov/tcga/). The RNA-seq data is available for 72 matched pairs of KIRC and normal samples. The maximum number of genomic positions where the expressions were measured by the number of reads passing quality control is 382,239,893 in the raw BAM file. And the total number of genes is 19,717.
Samtools and bedtools were applied to count number of reads for each base of the gene. Affected mapping reads were taken as the scale factor to normalize the reads for each individual. Hg19 human genome was taken as the reference.
Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq, and known genes from the UCSC genome browser, which was downloaded on August 19, 2010, August 8, 2010, and August 19, 2010, respectively. Reads mapped to junction regions were then repositioned back to the genome and were marked with “ZJ:Z” tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment. Finally, reads failing the Illumina chastity filter were flagged with a custom script, and duplicated reads were flagged with Picard’s MarkDuplicates.
In order to make the data comparable, we applied log transformation on the observed expression profiles. Some genomic position has zero counts and we intentionally add 1 to it and then it returns to be zero after log transformation. After that expression counts for most of genes are of the same scale. We also mapped the genes onto the interval .
3.2. Evaluation of the ODE for Modeling RNA-seq Data
To evaluate the precision of the ODE for modeling the RNA-seq data, we first used the ODE to fit the RNA-seq data where the coefficient functions were estimated. Then, we used numerical collocation Runge-Kutta method to solve the fitted ODE. The solutions of the fitted ODE as a function of genomic position were then compared with the observed RNA-seq curves.
We estimated the varying-coefficient functions using the proposed model. The expression function for gene was first estimated by spline smoothing with some initial penalty. We then update the penalty using the proposed second order ODE with varying-coefficient functions. We iterated between curve smoothing and ODE estimation until convergence was achieved. The smoothing parameters were chosen by cross validation process. By selecting the value of , we trade off basis expansion fitting error and ODE solution filtering error. Larger value of put more emphasis on the ODE penalty and the solution to ODE with estimated parameters is more likely to approximate the original data.
To validate the estimates of coefficients functions in the model, it is essential to compare the observed gene expression curve to the ODE solution with estimated coefficient functions. We solved ODE numerically by considering two-point boundary value problems where boundary conditions are specified at both ends of the range of integration. We estimated two initial values at both ends by evaluating the estimated smoothing expression curves at start and end positions.
Figures 1(a) and 1(b) are fitted results of normal sample and cancer sample, respectively, for gene CD74. In these figures, the circles represent observed RNA-seq expression signal (green: normal; red: cancer); the blues lines are Fourier basis expansion to approximate the observed signal using weighted least square methods. The numbers of basis are chosen based on the length of genes and experimental adjustment to capture the important characteristics of gene expression curves. The dashed lines are estimated ODE solution using boundary value problem solver (R package bvpSolve). The ODE solutions approximate well the observed expression level of gene CD74, which show that estimated coefficient functions carry essential information of the original data. Once we have them, we can retrieve the original data very well.
To further evaluate the precision of the ODE for modeling RNA-seq data, we perform 5-fold cross validation prediction for gene RPL29. This method uses part of the available data to fit the model and estimate the parameters and uses the remaining data to test the model validity and estimate accuracy. We randomly split normal and cancer samples into five groups. From the estimation of parameters in the training samples, we solved the ODE with estimated coefficient functions to predict the expression curves of test samples. To be consistent, we estimated two initial values at both ends by evaluating the estimated smoothing expression curves at start and end positions in test samples. We also calculated the root mean square prediction error (RMSPE) for each folder to evaluate the performance of the prediction which is defined bywhere and are observed and predicted expression level; is the number of genomic positions where the RNA-seq is observed for the gene and is the number of subjects in the folder .
Table 1 lists RMSPE in each folder for normal and cancer groups. The normal group has slightly better performance in terms of prediction on the test samples. But both prediction errors are relatively small.
Figures 2(a) and 2(b) are prediction results for selected samples in test set for gene RPL29 in normal and cancer group, respectively. The gray dot is observed expression profile, and the solid red lines are Fourier basis expansion approximations to the observed expression data. The dashed green lines indicate the predicted gene expression profile in the test set by solving the estimated ODE in the training examples. We can observe that all of the prediction can capture the overall shape and fluctuation in the data. Secondly, they can also predict the magnitude of expression value with decent accuracy. These are predicted very close to the observed expression profiles.
3.3. Classification Analysis
These data suggest that the estimated coefficient functions capture important features of expression curves. From the solution to estimated ODE, we can see the exceptional retrieval of original data. From the prediction performance in the test set, we can also get well-predicted curves by just proving two initial boundary data points. It is natural to consider them as features to classify phenotype categories.
We obtain two coefficient functions from one expression function. We can use FPCA to help us to reduce the dimension of features and to ease the computational effort. We first applied FPCA technique on two coefficient functions and separately; then we combined two groups of the selected functional principal component scores as aggregated features before we provided them to classifier. In the end, we applied Lasso logistic regression to help us select features and make prediction on the groups.
Table 2 lists top 12 genes to differentiate normal and KIRC group using 5-fold cross validations. We can see that using a single gene it can reach as high as 90% classification accuracy. These data strongly indicate that the ODE model effectively captured the inherent features of RNA-seq expression profile. We also evaluated the performance of classification result using sensitivity, specificity, and accuracy. Sensitivity is defined as the percentage of cancer tissues correctly classified as cancer. Specificity is defined as the percentage of the normal tissues correctly classified as normal. The classification accuracy is defined as the percentage of the correctly classified normal and cancer tissues. The classification results can reach as high as 99% if we use all these 12 genes together as predictors.
3.4. Genome-Scale Clustering Analysis
In this section, we continue to use the estimated coefficient functions as features to cluster genes expression data to study the genome-wide transcriptome. By grouping genes with similar patterns of expression profiles, cluster analysis can provide insight into gene functions and biological process. It also gives a simple way of determining the functions of many genes for which information is not available, as genes with the same functions may share expression profiles. We assume the coefficient functions in ODE model help to define these patterns in the dynamic regulation process and give us clues to functional discovery and pattern grouping.
After we derive the feature matrix for all the genes from dimension reduction using FPCA, we merely need to adopt a metric definition which is used as a measure of similarity in the behavior of two genes. To calculate the distance matrix we used Euclidean distance and correlation matrix. This method computes a dendrogram that combines all genes in a single tree.
A total of 19717 genes were clustered into 9 groups according to the cluster analysis (Figures 3(a) and 3(b)). The functional principal component scores from coefficient functions in ODE model were used as significant features to define these patterns in the dynamic regulation process. The function annotation for each cluster was as follows.
The principle functions of the genes in the first group are mainly associated with oxidoreductase activity, ligase activity, dehydrogenase (NAD) activity, and related metabolic process. The detailed functions include aldehyde dehydrogenase (NAD) activity, translational initiation, mediator complex, MHC protein complex, mitochondrial membrane part, ion transmembrane transport, respiratory chain complex I, proton-transporting ATP synthase complex, proton-transporting two-sector ATPase complex, proton-transporting domain, NADH dehydrogenase (quinone) activity, oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor, positive regulation of protein ubiquitination, response to unfolded protein, heterocycle metabolic process, protein modification process, mitochondrial ATP synthesis coupled proton transport, glycerolipid metabolic process, macromolecule modification, proton-transporting two-sector ATPase complex, catalytic domain, regulation of translational initiation, oxidoreductase activity, acting on the aldehyde or oxo group of donors, RNA polymerase II transcription mediator activity, heme binding, positive regulation of ligase activity, negative regulation of ligase activity, negative regulation of ubiquitin-protein ligase activity involved in mitotic cell cycle, positive regulation of ubiquitin-protein ligase activity involved in mitotic cell cycle, regulation of ubiquitin-protein ligase activity involved in mitotic cell cycle, positive regulation of ubiquitin-protein ligase activity, negative regulation of ubiquitin-protein ligase activity, transferase activity, glycerolipid biosynthetic process, amine transport, phosphoinositide metabolic process, carboxylic acid transport, hormone binding, eukaryotic translation initiation factor 3 complex, glycerophospholipid metabolic process, helicase activity, response to protein stimulus, lipid biosynthetic process, phosphoinositide biosynthetic process, aldehyde dehydrogenase [NAD(P)+] activity, proton-transporting ATP synthase complex, coupling factor F(o), cytosolic part, nucleobase, nucleoside and nucleotide metabolic process, proton-transporting ATPase activity, rotational mechanism, phospholipid metabolic process, phosphorus metabolic process, phosphate metabolic process, hydrogen-exporting ATPase activity, phosphorylative mechanism, proton-transporting V-type ATPase complex, MHC class II protein complex, collagen, positive regulation of protein modification process, and posttranslational protein modification.
The principle functions of the genes in the second group are mainly associated with hydratase activity, cation transmembrane transporter activity and hydrolase activity, and related metabolic process. The detailed functions include NAD or NADH binding, peroxisomal membrane, microbody membrane, aconitate hydratase activity, 4 iron, 4 sulfur cluster binding, regulation of vesicle-mediated transport, lactate dehydrogenase activity, L-lactate dehydrogenase activity, long-chain fatty acid-CoA ligase activity, fatty acid ligase activity, homophilic cell adhesion, tight junction, cation transmembrane transporter activity, occluding junction, kinesin complex, microbody part, peroxisomal part, actin filament binding, hydrolase activity, hydrolyzing O-glycosyl compounds, hydrolase activity, and acting on glycosyl bonds.
The principle functions of the genes in the third group are mainly associated with monooxygenase activity, receptor activity, electron carrier activity, sodium ion transmembrane transporter activity, and related metabolic process The detailed functions include nucleoside binding, purine nucleoside binding, monooxygenase activity, receptor activity, protein binding, ATP-binding, DNA packaging, chromatin assembly or disassembly, electron carrier activity, sodium ion transmembrane transporter activity, actin cytoskeleton, RNA metabolic process, adenyl nucleotide binding, GTPase regulator activity, regulation of lipid transport, negative regulation of lipid transport, adenyl ribonucleotide binding, protein-DNA complex, very-low-density lipoprotein particle, triglyceride-rich lipoprotein particle, cellular nitrogen compound metabolic process, cellular macromolecule biosynthetic process, nucleosome organization, chylomicron, organelle, intracellular organelle, cellular biosynthetic process, keratin filament, regulation of transcription, regulation of biological process, regulation of cellular process, regulation of nitrogen compound metabolic process, regulation of RNA metabolic process, regulation of macromolecule metabolic process, nucleoside-triphosphatase regulator activity, biological regulation, DNA conformation change, and regulation of primary metabolic process.
The principle functions of the genes in the fourth group are mainly associated with acyl-CoA thioesterase activity, oxidoreductase activity, phosphatase activity, and related metabolic process. The detailed functions include organellar small ribosomal subunit, organellar large ribosomal subunit, phospholipid-translocating ATPase activity, glutathione transferase activity, receptor signaling protein serine/threonine kinase activity, transmembrane receptor activity, inward rectifier potassium channel activity, organic acid transmembrane transporter activity, mitochondrial matrix, mitochondrial large ribosomal subunit, mitochondrial small ribosomal subunit, cytosol, translation, translational elongation, cell surface receptor linked signaling pathway, large ribosomal subunit, small ribosomal subunit, integral to membrane, acyl-CoA thioesterase activity, oxidoreductase activity, acting on NADH or NADPH, phosphatase activity, cytosolic ribosome, signaling process, signal transmission, intrinsic to membrane, negative regulation of protein ubiquitination, cullin-RING ubiquitin ligase complex, and mitochondrial lumen.
The principle functions of the genes in the fifth group are mainly associated with cell projection part, microtubule associated complex, motor activity, microtubule, axoneme, microtubule-based process, microtubule-based movement, microtubule cytoskeleton, dynein complex, cytoskeletal part, cilium, macromolecular complex, cilium axoneme, cell projection, protein complex, cilium part, pyrophosphatase activity, hydrolase activity, acting on acid anhydrides, hydrolase activity, acting on acid anhydrides, phosphorus-containing anhydrides, and nucleoside-triphosphatase activity.
The principle functions of the genes in the sixth group are mainly associated with intracellular signal transduction, cholesterol efflux, UDP-galactosyltransferase activity, and histone demethylase activity.
The principle functions of the genes in the seventh group are mainly associated with ATP-binding cassette (ABC) transporter complex, JNK cascade, ATP-dependent peptidase activity.
The principle functions of the genes in the eighth group are mainly associated with glutamate receptor activity, ATPase activity, cytoskeletal protein binding, and myosin filament.
The principle functions of the genes in the ninth group are mainly associated with adrenoceptor activity, inhibition of adenylate cyclase activity by G-protein signaling pathway, adenosine deaminase activity, hydrolase activity, acting on carbon-nitrogen (but not peptide) bonds, cyclic amidines, deaminase activity, adenylate cyclase activity, activation of protein kinase A activity, alpha-adrenergic receptor activity, adrenergic receptor binding, epinephrine binding, regulation of norepinephrine secretion, norepinephrine transport, positive regulation of blood pressure, norepinephrine secretion, oxidoreductase activity, acting on CH-OH group of donors, oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor, and delayed rectifier potassium channel activity.
Since the ODE model use the dynamic information of gene expression profiles and we consider genes with similar expression profiles will share common biological functions. Based on the cluster analysis, the genes grouped together have similar pattern of expression share common biological function. It directs us a way to find the functions of many genes for which information is unknown by looking the genes in the same group.
3.5. Response Analysis of Gene Regulation
The expression level of a gene measured by sequencing can be viewed as a curve or function of genomic position. The gene expression will vary across the gene region. If we treat time and space position as the same argument, all theories and methods of dynamic system can be applied to RNA-seq data analysis. The dynamic behavior of a system is encoded in the temporal evolution of its states or in the genomic location evolution of the gene expression in our problem. Therefore, borrowing dynamic theory, we can study the location-dependent variation of gene expression under the perturbation of the external signals. The transient response of the dynamic systems is an important property of the system itself. It can be used to quantify the space domain characteristics of the gene regulation system responding to the disturbance of environments. Our goal is to investigate how the gene expression level at each genomic position varies in response to the external perturbation and whether this will affect the function of cell.
We conducted response analysis of 19,717 genes under unit-step signal perturbation. We used the Hoteling statistic that was described in Section 2.5 to identify 31 genes that showed significant difference in the response property. The names of 31 genes with significant difference in response property were summarized in Table 3. In a few cases, the matrix may be singular; we can use penalized method or generalized inverse to estimate . However, this will inflate the false positive rates.
We present Figures 4(a)–4(d) showing the average expression curves, unit-step response curves, and the coefficient curves of the ODE of gene CD74, respectively. We observed that gene CD74 not only showed significant difference in gene expression and coefficient curves of the ODE but also demonstrated strong difference in the unit-step response. The changing point of gene expression curve and unit-step response curve occurred between 11b and 12a where a splicing site is located. It was reported that CD74 played critical role in cancer cell tumorigenesis  and downregulation of CD74 inhibits growth and invasion in clear cell renal cell carcinoma .
Transient response is one of dynamic properties. As shown in Figures S1A–D in Supplementary Material available online at http://dx.doi.org/10.1155/2015/916352, gene ABHD10 that did not show significant difference in gene expression and coefficient curves of the ODE demonstrated strong difference in the unit-step response.
Figures S2A–D plotted the average expression curves, unit-step response curves, and the coefficient curves of the ODE of gene BTS2, respectively. Gene BTS2 was differentially expressed but did not show significant difference in coefficient of the ODE between tumor and normal samples. Gene BTS2 was identified to have significant difference in the unit-step response. The pattern of difference in the unit-step response may mainly due to rapid changes of gene expression in the region close to genomic position 20. From the literature we found that BTS2 was associated with a number of cancers [18, 19].
Dominant methods in literature for RNA-seq data analysis use a single valued summary statistic to represent expression level of a gene. However, a single number oversimplifies complex expression variation pattern across a gene and ignores information on alternative splicing and isoform and expression level variation at the genomic position level. To extract biologically useful expression variation signals across gene from RNA-seq data is a challenge, but important task. To meet this challenge, we have proposed using the ODE for modeling the RNA-seq data and addressed several essential issues for application of the ODE model to RNA-seq data analysis.
The first issue is how to use the ODE for modeling the RNA-seq data. We considered the number of reads or expression level at each position as a function of the genomic position and viewed the transcription process as a stochastic process of transcription along the gene. Borrowing dynamic theory from engineering, we have used the second order ODE to model the expression function of the gene measured by RNA-seq. We have employed differential principal analysis to develop statistical methods for estimation of location-varying coefficients of the ODE. We observed that the second order ODE almost has as good accuracy to predict gene expression as the third order ODE. But the third order ODE requires one more degree to describe the model. Therefore, the second order ODE is good enough to model gene expression.
The second issue is the precision of the ODE to model the RNA-seq data. We randomly split normal and cancer samples into five groups. From the estimation of parameters in the training samples, we solved the ODE with estimated coefficient functions to predict the expression curves of test samples. We have showed that the accuracy of the prediction by the second order ODE was very high and the root mean square prediction errors were quite small.
The third issue is how to extract useful regulatory signals from the RNA-seq data confound with measurement errors and sequencing technology variation. Since the second order ODE can model RNA-seq data very well, the location-coefficient functions of the ODE may well characterize the features of the regulatory process and measure the impact of the gene expression on the function of the cells and tissues. We have demonstrated that using location-coefficient functions of the second order ODE as features we have accurately classified the tumor and normal samples.
The fourth issue is to explore the applications of the ODE for RNA-seq data analysis. We have showed that the ODE can be used as a powerful tool to study the response of the gene transcription to the perturbation of environments. We have identified a number of cancer associated genes which showed significant difference in the response of the gene transcription between tumor and normal tissues but were not differentially expressed.
To our knowledge, this is the first time to use the ODE for modeling the RNA-seq data and investigation of gene transcription process. Our results were very preliminary. The samples were used to validate the accuracy of the ODE model to fit the real RNA-seq data. Large-scale validation and experiments for evaluating the model precision are urgently needed. Although the response analysis of dynamic model for the transcription process can help us to study how the external signals affect the gene expression variation across the gene, the mechanism of the gene transcription variation under the perturbation of external signals is largely unknown. The experiments for validation of the results of the response analysis of the dynamic models need to be performed. We lack consensus methods for RNA-seq data analysis. We are facing great challenges in developing innovative approaches and general framework for RNA-seq data analysis.
In conclusion, this study proposes the second order ODE for modeling RNA-seq data. We have demonstrated that the estimated ODE can accurately predict the gene expression level across the gene. We have showed that the location-dependent coefficients of the ODE effectively extract regulatory signals from the RNA-seq confounded with the measurement errors and sequencing technology variation and capture the inherent features of the transcription process. The results have showed that using coefficients of the ODE as features we can reach very high accuracy for classifying tumor and normal samples. Finally, we have demonstrated that using transient response analysis of dynamic system we identified 31 genes with significant differential response behavior between tumor and normal samples related to cancer.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The project described was supported by Grants 1R01AR057120-01 and 1R01HL106034-01 from the National Institutes of Health and NHLBI. The authors wish to thank TCGA working group for providing RNA-seq data. The authors wish to acknowledge the contributions of the research institutions, study investigators, field staff, and study participants in creating the TCGA datasets for biomedical research.
The transient response of the dynamic systems is an important property of the system itself. It can be used to quantify the space domain characteristics of the gene regulation system responding to the disturbance of environments. Our goal is to investigate how the gene expression level at each genomic position varies in response to the external perturbation and whether this will affect the function of cell. Figures S1A-D and Figures S2A-D plotted the average expression curves, unit-step response curves, the coefficient curves of the ODE of genes gene ABHD10 and BTS2, respectively.
H. Xiong, J. B. Brown, N. Boley, P. J. Bickel, and H. Huang, “DE-FPCA: testing gene differential expression and exon usage through functional principal component analysis,” in Statistical Analysis of Next Generation Sequencing Data, pp. 129–143, Springer, New York, NY, USA, 2014.View at: Google Scholar
K. Ogata, System Dynamics, Prentice Hall, 3rd edition, 1997.
J. O. Ramsay, Functional Data Analysis, Wiley Online Library, 2006.
J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of Statistical Software, vol. 33, no. 1, pp. 1–22, 2010.View at: Google Scholar
Y.-H. Liu, C.-Y. Lin, W.-C. Lin, S.-W. Tang, M.-K. Lai, and J.-Y. Lin, “Up-regulation of vascular endothelial growth factor-D expression in clear cell renal cell carcinoma by CD74: a critical role in cancer cell tumorigenesis,” The Journal of Immunology, vol. 181, no. 9, pp. 6584–6594, 2008.View at: Publisher Site | Google Scholar
S.-Q. Ji, X.-L. Su, W.-L. Cheng, H.-J. Zhang, Y.-Q. Zhao, and Z.-X. Han, “Down-regulation of CD74 inhibits growth and invasion in clear cell renal cell carcinoma through HIF-1α pathway,” Urologic Oncology: Seminars and Original Investigations, vol. 32, no. 2, pp. 153–161, 2014.View at: Publisher Site | Google Scholar
A. Sayeed, G. Luciani-Torres, Z. Meng, J. L. Bennington, D. H. Moore, and S. H. Dairkee, “Aberrant regulation of the BST2 (Tetherin) promoter enhances cell proliferation and apoptosis evasion in high grade breast cancer cells,” PLoS ONE, vol. 8, no. 6, Article ID e67191, 2013.View at: Publisher Site | Google Scholar