Computational and Statistical Approaches for Modeling of Proteomic and Genomic NetworksView this Special Issue
Spectral Analysis on Time-Course Expression Data: Detecting Periodic Genes Using a Real-Valued Iterative Adaptive Approach
Time-course expression profiles and methods for spectrum analysis have been applied for detecting transcriptional periodicities, which are valuable patterns to unravel genes associated with cell cycle and circadian rhythm regulation. However, most of the proposed methods suffer from restrictions and large false positives to a certain extent. Additionally, in some experiments, arbitrarily irregular sampling times as well as the presence of high noise and small sample sizes make accurate detection a challenging task. A novel scheme for detecting periodicities in time-course expression data is proposed, in which a real-valued iterative adaptive approach (RIAA), originally proposed for signal processing, is applied for periodogram estimation. The inferred spectrum is then analyzed using Fisher’s hypothesis test. With a proper -value threshold, periodic genes can be detected. A periodic signal, two nonperiodic signals, and four sampling strategies were considered in the simulations, including both bursts and drops. In addition, two yeast real datasets were applied for validation. The simulations and real data analysis reveal that RIAA can perform competitively with the existing algorithms. The advantage of RIAA is manifested when the expression data are highly irregularly sampled, and when the number of cycles covered by the sampling time points is very reduced.
Patterns of periodic gene expression have been found to be associated with essential biological processes such as cell cycle and circadian rhythm , and the detection of periodic genes is crucial to advance our understanding of gene function, disease pathways, and, ultimately, therapeutic solutions. Using high-throughput technologies such as microarrays, gene expression profiles at discrete time points can be derived and hundreds of cell cycle regulated genes have been reported in a variety of species. For example, Spellman et al. applied cell synchronization methods and conducted time-course gene expression experiments on Saccharomyces cerevisiae . The authors identified 800 cell cycle regulated genes using DNA microarrays. Also, Rustici et al. and Menges et al. identified 407 and about 500 cell cycle regulated genes in Schizosaccharomyces pombe and Arabidopsis, respectively [3, 4].
Signal processing in the frequency domain simplifies the analysis and an emerging number of studies have demonstrated the power of spectrum analysis in the detection of periodic genes. Considering the common issues of missing values and noise in microarray experiments, Ahdesmäki et al. proposed a robust detection method incorporating the fast Fourier transform (FFT) with a series of data preprocessing and hypothesis testing steps . Two years later, the authors further proposed a modified version for expression data with unevenly spaced time intervals . A Lomb-Scargle (LS) approach, originally used for finding periodicities in astrophysics, was developed for expression data with uneven sampling . Yang et al. further improved the performance using a detrended fluctuation analysis . It used harmonic regression in the time domain for significance evaluation. The method was termed “Lomb-Scargle periodogram and harmonic regression (LSPR).” Basically, these methods consists of two steps: transferring the signals into the frequency (spectral) domain and then applying a significance evaluation test for the resulting peak in the spectral density.
While numerous methods have been developed for detecting periodicities in gene expression, most of these methods suffer from false positive errors and working restrictions to a certain extent, particularly when the time-course data contain limited time points. In addition, no algorithm seems available to resolve all of these challenges. Microarray as well as other high-throughput experiments, due to high manufacturing and preparation costs, have common characteristics of small sample size , noisy measurements , and arbitrary sampling strategies , thereby making the detection of periodicities highly challenging. Since the number and functions of cell cycle regulated genes, or periodic genes, remain greatly uncertain, advances in detection algorithms are urgently needed.
Recently, Stoica et al. developed a novel nonparametric method, termed the “real-valued iterative adaptive approach (RIAA),” specifically for spectral analysis with nonuniformly sampled data . As stated by the authors, RIAA, an iteratively weighted least-squares periodogram, can provide robust spectral estimates and is most suitable for sinusoidal signals. These characteristics of RIAA inspired us to apply it to time-course gene expression data and conduct an examination on its performance. Herein, we incorporate RIAA with a Fisher's statistic to detect transcriptional periodicities. A rigorous comparison of RIAA with several aforementioned algorithms in terms of sensitivities and specificities is conducted through simulations and simulation results dealing with real data analysis are also provided.
In this study, we found that the RIAA algorithm can provide robust spectral estimates for the detection of periodic genes regardless of the sampling strategies adopted in the experiments or the nonperiodic nature of noise present in the measurement process. We show through simulations that the RIAA can outperform the existing algorithms particularly when the data are highly irregularly sampled, and when the number of cycles covered by the sampling time points is very few. These characteristics of RIAA fit perfectly the needs of time-course gene expression data analysis. This paper is organized as follows. In Section 2, we begin with an overview of RIAA. In Section 3, a scheme for detecting periodicities is proposed, and simulation models for performance evaluation and a real data analysis for validation purposes are presented. A complete investigation of the performance of RIAA and a rigorous comparison with other algorithms are provided in Section 4.
2. RIAA Algorithm
RIAA is an iterative algorithm developed for finding the least-squares periodogram with the utilization of a weighted function. The essential mathematics involved in RIAA is introduced in this section with the algorithm input being time-course expression data; for more details regarding RIAA, the readers are encouraged to check the original paper by Stoica et al. .
Suppose that the signals associated with the periodic gene expressions are composed of noise and sinusoidal components. Let , , denote the time-course expression ratios of gene at instances , respectively; are real numbers; . The least-squares periodogram is given by where is the solution to the following fitting problem:
Let , where and refer to the amplitude and phase of , respectively. The criterion in (2) can then be rewritten as
The second term in the above equation is data independent and can be omitted from the minimization operation. Hence, the criterion (2) is simplified to
We further apply and and derive an equivalent of (4) as follows:
The target of interest to the fitting problem now becomes and (instead of ), and the solution is well known to be where
After and are estimated, the least-squares periodogram can be derived.
2.2. Observation Interval and Resolution
Prior to implementation of RIAA for periodogram estimation, the observation interval and the resolution in terms of grid size have to be selected. To this end, the maximum frequency in the observation interval without aliasing errors for sampling instances ,, , can be evaluated by where is given by
The observation interval is hence chosen after is obtained.
To ensure that the smallest frequency separation in time-course expression data with regular or irregular sampling can be adequately detected, the grid size is chosen to be which, in fact, is the resolution limit of the least-squares periodogram. As a result, the frequency grids considered in periodogram are where the number of grids is given by
The following notations are introduced for the implementation of RIAA at a specific frequency : where and and denote variables and at frequency , respectively.
RIAA's salient feature is the addition of a weighted matrix to the least-squares fitting criterion. The weighted matrix can be viewed as a covariance matrix encapsulating the contributions of noise and other sinusoidal components in other than to the spectrum; it is defined as where and denotes the covariance matrix of noise in expression data , given by
Assuming that is invertible, in RIAA, a weighted least-squares fitting problem is formulated and considered for finding and (instead of using (5)), and it is written in the form of matrices using (13) as follows:
The iteration for estimating spectrum starts with initial estimates , in which the elements and are given by (6) with , . After initialization, the first iteration begins. First, the elements and of are applied to obtain using (16). Secondly, to get a good estimate of , the frequency at which the largest value- is located in the temporary periodogram , , derived using (20) with , is applied for obtaining a reversed engineered signal . The elements , in are given by
The phase of the cosine function is unknown; however, is estimable using where is the Euclidean norm. With estimates and , the estimates , = , in the first iteration are hence given by (15). After this, are inserted into the right-hand side of (19) and updated estimates , = , are derived. The algorithm consists of repeating these steps and updating and iteratively, where denotes the number of iterations, until a termination criterion is reached. If the process stops at the iteration, then the final RIAA periodogram is given by (20) using . The pseudocode in Algorithm 1 represents a concise description of the iterative RIAA process.
Figure 1 demonstrates our scheme for periodicity detection and algorithm comparison. The first step involves a periodogram estimation, which converts the time-course gene expression ratios into the frequency domain. Three methods are considered for comparison: RIAA, LS, and a detrend LS (termed DLS), which uses an additional detrend function (developed in LSPR) before regular LS periodogram estimation is applied. The derived spectra are then analyzed using hypothesis testing. This study is conducted using a Fisher's test, with the null hypothesis that there are no periodic signals in the time domain and hence no significantly large peak in the derived spectra. The algorithm performance is evaluated and compared via simulations and receiver operating characteristic (ROC) curves. In real microarray data analysis, three published benchmark sets are utilized as standards of cell cycle genes for performance comparison.
3.1. Fisher's Test
After the spectrum of time-course expression data is obtained via periodogram estimation, a Fisher's statistic for gene with the null hypothesis that the peak of the spectral density is insignificant against the alternative hypothesis that the peak of the spectral density is significant is applied as where refers to the periodogram derived using RIAA, LS, or DLS. The null hypothesis is rejected, and the gene is claimed as a periodic gene if its -value, denoted as , is less than or equal to a specific significance threshold. For simplicity, is approximated from the asymptotic null distribution of assuming Gaussian noise  as follows:
In real data analysis, deviation might be invoked for the estimation of when the time-course data is short. This issue was carefully addressed by Liew et al. , and, as suggested, alternative methods such as random permutation may provide less deviation and better performance. However, permutation also has limitations such as tending to be conservative . While finding the most robust method for the -value evaluation remains an open question, it gets beyond the scope of this study since the algorithm comparison via ROC curves is threshold independent , and the results are unaffected by the deviation.
Simulations are applied to evaluate the performance of RIAA. The simulation models and sampling strategies used for simulations are described in the following paragraphs.
3.2.1. Periodic and Nonperiodic Signals
Three models, one for periodic signals and two for nonperiodic signals, are considered as transcriptional signals. Since periodic genes are transcribed in an oscillatory manner, the expression levels embedded with periodicities are assumed to be where denotes the sinusoidal amplitude; refers to the signal frequency; are Gaussian noise independent and identically distributed (i.i.d.) with parameters and . For nonperiodic signals, the first model is simply composed of Gaussian noise, given by
Additionally, as visualized by Chubb et al., gene transcription can be nonperiodically activated with irregular intervals in a living eukaryotic cell, like pulses turning on and off rapidly and discontinuously . Based on this, the second nonperiodic model incorporates one additional transcriptional burst and one additional sudden drop into the Gaussian noise, which can be written as where and are indicator functions, equal to 1 at the location of the burst and the drop, respectively, and otherwise. The transcriptional burst assumes a positive pulse while the transcriptional drop assumes a negative pulse. Both of them may be located randomly among all time points and are assumed to last for two time points. In other words, the indicator functions are equal to 1 at two consecutive time points, say, at and . The burst and the drop have no overlap.
3.2.2. Sampling Strategies
As for the choices of sampling time points , = , four different sampling strategies, one with regular sampling and three with irregular sampling, are considered. First, regular sampling is applied in which all time intervals are set to be , where is a constant. Secondly, a bio-like sampling strategy is invoked. This strategy tends to have more time points at the beginning of time-course experiments and less time points after we set the first time intervals as and set the next time intervals as . Third, time intervals are randomly chosen between and . The last sampling strategy, in which all time intervals are exponentially distributed with parameter , is less realistic than the others but it is helpful for us to evaluate the performance of RIAA under pathological conditions.
ROC curves are applied for performance comparison. To this end, 10,000 periodic signals were generated using (25) and 10,000 nonperiodic signals were generated using either (26) or (27). Sensitivity measures the proportion of successful detection among the 10,000 periodic signals and specificity measures the proportion of correct claims on the 10,000 nonperiodic simulation datasets. Sampling time points are decided by one of the four sampling strategies and the number of time points is chosen arbitrarily. For all ROC curves in Section 4, and .
3.3. Real Data Analysis
Two yeast cell cycle experiments synchronized using an alpha-factor, one conducted by Spellman et al.  and one conducted by Pramila et al. , are considered for a real data analysis. The first time-course microarray data, termed dataset alpha and downloaded from the Yeast Cell Cycle Analysis Project website (http://genome-www.stanford.edu/cellcycle/), harbors 6,178 gene expression levels and 18 sampling time points with a 7-minute interval. The second time-course data, termed dataset alpha 38, is downloaded from the online portal for Fred Hutchinson Cancer Research Center's scientific laboratories (http://labs.fhcrc.org/breeden/cellcycle/). This dataset contains 4,774 gene expression levels and 25 sampling time points with a 5-minute interval. Three benchmark sets of genes that have been utilized in Lichtenberg et al.  and Liew et al.  as standards of cell cycle genes are also applied herein for performance comparison. These benchmark sets, involving 113, 352, and 518 genes, respectively, include candidates of cycle cell regulated genes in yeast proposed by Spellman et al. , Johansson et al. , Simon et al. , Lee et al. , and Mewes et al.  and are accessible in a laboratory website (http://www.cbs.dtu.dk/cellcycle/).
RIAA performed well in the conducted simulations. As shown in Figure 2(a), a periodic signal (solid line) with amplitude and frequency is sampled using the bio-like sampling strategy, which applies 16 time points in (0,8] and 8 more time points in (8,16]. Gaussian noise with parameters = 0 and = 0.5 is assumed during microarray experiments. The resulting time-course expression levels (dots), at a total of 24 time points and the sampling time information were treated as inputs to the RIAA algorithm. Figure 2(b) demonstrates the result of periodogram estimation. In this example, the grid size was chosen to be 0.065 and a total of 11 amplitudes corresponding to different frequencies were obtained and shown in the spectrum. Using Fisher's test, the peak at the third grid (frequency 0.195) was found to be significantly large (-value = 2.4 10), and hence a periodic gene was claimed.
ROC curves strongly illustrate the performance of RIAA. In Figures 3 and 4, subplots (a)-(b), (c)-(d), (e)-(f), and (g)-(h) refer to the simulations with regular, bio-like, binomially random, and exponentially random sampling strategies, respectively. Additionally, in the left-hand side subplots (a), (c), (e), and (g), nonperiodic signals were simply Gaussian noise with parameters = 0 and = 0.5, while in the right-hand side subplots (b), (d), (f), and (h), nonperiodic signals involve not only the Gaussian noise but also a transcriptional burst and a sudden drop (27). Periodic signals were generated using (25) with amplitude = 1, , and . The only difference in simulation settings between Figures 3 and 4 is the frequency of periodic signals; they are = and , respectively. As shown in these figures, LS and DLS can perform well as RIAA when the time-course data are regularly sampled, or mildly irregularly sampled; however, when data are highly irregularly sampled, RIAA outperforms the others. The superiority of RIAA over DLS is particularly clear when the signal frequency is small.
Figure 5 illustrates the results of the real data analysis when these three algorithms, namely, the RIAA, LS, and DLS, were applied. On the -axis, the numbers indicate the thresholds that we preserved and classified as periodicities among all yeast genes; on the y-axis, the numbers refer to the intersection of preserved genes and the proposed periodic candidates listed in the benchmark sets. Figures 5(a)–5(c) demonstrate the results derived from dataset alpha when the 113-gene benchmark set, 352-gene benchmark set, and 518-gene benchmark set were applied, respectively. Similarly, Figures 5(d)–5(f) demonstrate the results derived from dataset alpha 38. The RIAA does not result in significant differences in the numbers of intersections when compared to those corresponding to LS and DLS in most of these cases. However, RIAA shows slightly better coverage when the dataset alpha 38 and the 113-gene benchmark set was utilized (Figure 5(d)).
In this study, the rigorous simulations specifically designed to comfort with real experiments reveal that the RIAA can outperform the classical LS and modified DLS algorithms when the sampling time points are highly irregular, and when the number of cycles covered by sampling times is very limited. These characteristics, as also claimed in the original study by Stoica et al. , suggest that the RIAA can be generally applied to detect periodicities in time-course gene expression data with good potential to yield better results. A supplementary simulation further shows the superiority of RIAA over LS and DLS when multiple periodic signals are considered (see Supplementary Figure available online at http://dx.doi.org/10.1155/2013/171530). From the simulations, we also learned that the addition of a transcriptional burst and a sudden drop to nonperiodic signals (the negatives) does not affect the power of RIAA in terms of periodicity detection. Moreover, the detrend function in DLS, designed to improve LS by removing the linearity in time-course data, may fail to provide improved accuracy and makes the algorithm unable to detect periodicities when transcription oscillates with a very low frequency.
The intersection of detected candidates and proposed periodic genes in the real data analysis (Figure 5) does not reveal much differences among RIAA, LS, and DLS. One possible reason is that the sampling time points conducted in the yeast experiment are not highly irregular (not many missing values are included), since, as demonstrated in Figures 3(a)–3(d), the RIAA just performs equally well as the LS and DLS algorithms when the time-course data are regularly or mildly irregularly sampled. Also, the very limited time points contained in the dataset may deviate the estimation of -values  and thus hinder the RIAA from exhibiting its excellence. Besides, the number of true cell cycle genes included in the benchmark sets remains uncertain. We expect that the superiority of RIAA in real data analysis would be clearer in the future when more studies and more datasets become available.
Besides the comparison of these algorithms, it is interesting to note that the bio-like sampling strategy could lead to better detection of periodicities than the regular sampling strategy (as shown in Figures 3(c) and 3(d)). It might be beneficial to apply loose sampling time intervals at posterior periods to prolong the experimental time coverage when the number of time points is limited.
The authors would like to thank the members in the Genomic Signal Processing Laboratory, Texas A&M University, for the helpful discussions and valuable feedback. This work was supported by the National Science Foundation under Grant no. 0915444. The RIAA MATLAB code is available at http://gsp.tamu.edu/Publications/supplementary/agyepong12a/.
Supplementary Figure s1: demonstrates the simulation results considering multiple periodic signals. Two sinusoidal waves with different frequency settings are superimposed as the periodic signal, and RIAA, LS, and DLS are applied for comparison.
P. T. Spellman, G. Sherlock, M. Q. Zhang et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998.View at: Google Scholar
J. Fan and Q. Yao, Nonlinear Time Series: Nonparametric and Parametric Methods, Springer, New York, NY, USA, 2003.
A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 30, no. 7, pp. 1145–1159, 1997.View at: Google Scholar
T. Pramila, W. Wu, W. Noble, and L. Breeden, “Periodic genes of the yeast Saccharomyces cerevisiae: a combined analysis of five cell cycle data sets,” 2007.View at: Google Scholar