BioMed Research International

Volume 2016 (2016), Article ID 3017475, 14 pages

http://dx.doi.org/10.1155/2016/3017475

## Finding Clocks in Genes: A Bayesian Approach to Estimate Periodicity

^{1}Department of Environmental Health, University of Cincinnati, Cincinnati, OH 45267-0056, USA^{2}Department of Molecular and Cellular Physiology, University of Cincinnati, Cincinnati, OH 45267-0576, USA^{3}Department of Mathematical Sciences, University of Cincinnati, Cincinnati, OH 45221-0025, USA

Received 3 November 2015; Accepted 28 April 2016

Academic Editor: Sher Afzal Khan

Copyright © 2016 Yan Ren et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Identification of rhythmic gene expression from metabolic cycles to circadian rhythms is crucial for understanding the gene regulatory networks and functions of these biological processes. Recently, two algorithms, JTK_CYCLE and ARSER, have been developed to estimate periodicity of rhythmic gene expression. JTK_CYCLE performs well for long or less noisy time series, while ARSER performs well for detecting a single rhythmic category. However, observing gene expression at high temporal resolution is not always feasible, and many scientists are interested in exploring both ultradian and circadian rhythmic categories simultaneously. In this paper, a new algorithm, named autoregressive Bayesian spectral regression (ABSR), is proposed. It estimates the period of time-course experimental data and classifies gene expression profiles into multiple rhythmic categories simultaneously. Through the simulation studies, it is shown that ABSR substantially improves the accuracy of periodicity estimation and clustering of rhythmic categories as compared to JTK_CYCLE and ARSER for the data with low temporal resolution. Moreover, ABSR is insensitive to rhythmic patterns. This new scheme is applied to existing time-course mouse liver data to estimate period of rhythms and classify the genes into ultradian, circadian, and arrhythmic categories. It is observed that 49.2% of the circadian profiles detected by JTK_CYCLE with 1-hour resolution are also detected by ABSR with only 4-hour resolution.

#### 1. Introduction

Organisms from cyanobacteria to humans have robust time-keeping mechanisms called biological clocks [1, 2]. In mammals, for example, the suprachiasmatic nucleus (SCN) located in the hypothalamus controls circadian rhythms and coordinates timing information with peripheral clocks. Collectively, these clocks regulate rhythmic physiological behaviors such as body temperature, cardiac repolarization, sleep/wake cycle, and metabolism [3–6]. Autonomous oscillations arise from the interplay of core clock components that form transcriptional-translational feedback loops [7]. As protein levels of clock-transcription factors oscillate, their downstream targets also oscillate. Different clocks (e.g., metabolism and cell cycle) may have different patterns of oscillation and target different output genes. Circadian rhythms cycle with a period of about 24 hours, whereas ultradian rhythms cycle with a period of less than 24 hours, and infradian rhythms cycle with a period greater than 24 hours. It has been shown that circadian clocks regulate both circadian and ultradian rhythms [8].

Circadian rhythms coordinate temporal regulation of other cellular processes. For example, the circadian clock regulates transcriptional activation of Wee1, a critical component in the cell cycle that coordinates timing of cell division [9, 10]. Thus, the study of rhythmic gene expression may reveal individual genes (nodes) or even parts of regulatory networks shared by different cellular processes. Finding and characterizing periodic gene expression are a prerequisite for determining these links amongst different oscillatory processes, such as circadian clock, cell cycle, and metabolic cycles.

A series of gene expression levels observed at a set of different time points is called a gene expression profile, and a rhythmic gene produces a rhythmic profile. In general, it is assumed that a rhythmic gene expression profile is correlated with rhythmic periodicity and hence each gene expression takes the form of a series of cosine curves: where is the observed gene expression at time , is the number of component cosine curves, and , , and are the amplitude, frequency, and phase of the th component cosine curve, respectively. Several methods have been developed to estimate periods as well as amplitudes and phases (mathematically) of gene expression profiles. Classical approaches such as Fisher’s -test [11] and fast Fourier transform (FFT) [12] perform well in estimating periods for long time series, but those approaches are less effective for short time series. Microarrays are commonly used to investigate changes of gene expressions over a time-course, and 4-hour resolution within a 48-hour time interval is a typical experimental design for circadian studies. In other words, microarray data provide short time series (e.g., with 12 time points) for each gene, which results in likely biased outcomes using either Fisher’s -test or FFT algorithms. Another widespread approach, COSOPT [13], effectively provides period estimate only with approximately sinusoidal data [14]. Recently, Hughes et al. [15] introduced the Jonckheere-Terpstra-Kendall (JTK_CYCLE) algorithm that applies the Jonckheere-Terpstra (JT) test to the null distribution of Kendall’s tau correlations. JTK_CYCLE is an efficient algorithm to estimate periodicity for long or less noisy time series; however, it is less reliable (as are all other methods) when it is applied to noisy short time series [16]. Yang and Su [17] developed an algorithm of autoregressive spectral estimation regression (ARSER) and showed that ARSER is more effective than Fisher’s -test and COSOPT in detecting oscillations in a variety of profile patterns, especially, for the microarray data in short time series. ARSER is useful to detect oscillations of a single category, for example, the circadian rhythms, but it is not efficient to detect multiple periods simultaneously.

In this paper, a new algorithm called the autoregressive Bayesian spectral regression (ABSR) is proposed. Built on ARSER, this ABSR algorithm significantly improves true discovery rate (TDR) and reduces FDR for noisy short time series as compared to JTK_CYCLE and ARSER. One of the features of ABSR comes from the use of posterior probabilities for model selection rather than the Akaike Information Criterion (AIC). In situations where the number of model parameters is large relative to the number of observations (e.g., the number of parameters is about one-half of the number of observations), AIC may fail to select the optimal model [18]. In addition, because AIC depends on the estimates of parameters, model selection by AIC may fail to select the most appropriate model if the parameter estimations are biased [19]. Using posterior probabilities for model selection overcomes the shortcomings of AIC by averaging over the uncertainty in the parameter estimates and leads to a more parsimonious model. Another feature of ABSR is that all possible frequencies in the harmonic models are considered and only the unique dominant frequency is extracted for the period estimate. Hence ABSR is able to classify rhythmic genes by different periods.

In Section 2, we present the model to obtain periodic information from time-course data using ABSR algorithm. In Section 3, simulated data and information theory are used to assess the performance of ABSR, ARSER, and JTK_CYCLE, and these algorithms are applied to existing experimental time-course data from mouse liver. Brief conclusions are discussed in Section 4.

#### 2. Methods

##### 2.1. Overview

The proposed algorithm, the autoregressive Bayesian spectral regression (ABSR), is developed to identify rhythmic patterns in gene expression profiles. The procedure to obtain periodic information from time-course gene expression data is described below.

Suppose genes are observed in an experiment at time points with the same lag, and the observed profiles are considered as time series. Let the observed time series of the th gene be . The raw profile is then standardized, denoted by , as follows: where is the average value of the components of and is the standard error of the components of . The standardization is needed to unify the variances of the time series, and the unified variances can be led to comparable spectrum densities across profiles. Note that the standardization does not change the behavior of the time series. Significant linear trends in experimental data are observed broadly. They are not biologically meaningful but may affect the periodicity estimate. So a linear regression model is then fitted to to remove the linear trend from the time series, and the detrended time series is denoted by . The Savitzky-Golay (S-G) smoothing filter [20] with order 4 is then applied to in order to reduce the noise level without much biasing the data, and the resulting new time series is represented by . In an autoregressive model with order of , denoted by , the current state of a time series is assumed to depend on the previous states only. Since the longest period of interest in this study is 24 hours and the method is designed for 4-hour temporal resolution data, it is reasonable to consider an model, in which the gene expression levels within the previous 24 hours are considered. Both and are modeled via an process of order 6 and model parameters are estimated by each of the following three methods: Yule-Walker method [21, 22], Burg method [23], and maximum likelihood estimation (MLE) [24]. Thus six AR models, , for each manipulated gene expression profile are obtained. For each AR model, spectral analysis is then applied to obtain one set of frequencies along with their spectral densities. Unlike ARSER, all frequencies and their corresponding spectral densities are considered to estimate the period and classify the genes according to their periods into three categories: arrhythmic, ultradian, and circadian.

Next, let the six sets of frequencies obtained by fitting the AR models of the th gene expression profile be . For the th set of frequencies, a harmonic model is considered as follows: where is the detrended profile of the th gene at time , is the constant term of the th harmonic model for , are the elements of the frequency set provided that there are elements in that set, are unknown linear parameters of the trigonometric terms, and is the error term for the th gene th harmonic model at time . The posterior probabilities of the six harmonic models are estimated and the model with the largest posterior probability is selected as the optimal model. A period is defined as the dominant period if it corresponds to the highest peak of the frequency spectrum of the optimal model.

Lastly, each gene is classified according to the criteria described in Section 2.3. Figure 1 shows a flowchart describing the ABSR algorithm.