Abstract

We present investigation of gene expression profiles by means of singular spectrum analysis (SSA). The biological problem under investigation is the decomposition of bicoid protein profiles of Drosophila melanogaster into the sum of a signal and noise, where the former consists of an exponential-in-distance pattern and is close to constant nonspecific component, or “background.” The signal processing problems addressed are (i) trend extraction from a noisy signal, (ii) batch processing of similar data, and (iii) analytical approximation of the signal components by the sum of exponential and constant-like functions. The proposed methods are evaluated on the given 17 series.

1. Introduction

Singular spectrum analysis is a method intended to perform decomposition of a sequence of measurements (usually a time series) into a sum of interpretable components, such as a trend, cycles, and noise [1]. SSA is recognized in geosciences, and is giving promising results in other areas (see http://www.math.uni-bremen.de/~theodore/ssawiki/). This work presents the use of SSA for signal extraction from spatial one-dimensional gene expression data. SSA was chosen as a compromise between parametric methods like regression, which can lead to wrong results if the model is not valid, and frequency methods like filtering.

The study of the activity of diverse genes has become one of key approaches in modern functional genomics, and is crucial for our understanding of embryo development. The expression of genes is traced either in time or in space (as in our case), along different tissues and organs or even a whole embryo. The research of gene expression is aimed at biomedical problems, but first it is systematically tested and developed on so-called model organisms. D. melanogaster (fruit fly) is one such organism, and the gene ensemble governing early events of fly embryo segmentation is one of the best studied genetic networks. This network of cross-regulating genes makes complicated patterns of their products, the segmentation factors. These patterns are directing the embryo developmental processes. Exponential-in-distance patterns are common at the very beginning of segmentation, and the primary morphogenetic gradient of the protein bicoid is most known and best studied [2, 3].

The biological problem under investigation is extraction of a signal from the noisy bicoid protein profile. Following [2], we assume bicoid to generate exponential pattern. The measured protein profile contains also the smooth residual referred to as “background” [4] and the measurement noise. In general, the form of the background is still an open question [24]. Moreover, it includes an unknown additive function depending on the confocal microscopy settings used. In this paper, we extend the model of [2], allowing the background to differ from constant. The problem is complicated by the fact that (i) the data contain outliers and (ii) that the data are very noisy and the noise has unknown structure. Although the noise appears to be multiplicative, the carried out study showed that it is true in very rough approximation only.

The SSA features demonstrated are as follows: (i) signal extraction with no parametric models of signal/noise specified; (ii) robustness to outliers; (iii) taking into account a parametric model of signal, if specified; (iv) interactive analysis with control over quality of separation of signal and noise; (v) batch processing of a set of similar series; (vi) derivation of an analytical formula of the signal.

Section 2 introduces the data, SSA, and the related methods used. The results of data processing are presented in Section 3. Finally, the conclusions are provided.

2. Methods and Approaches

Biological Data
Expression level of protein in wild-type fruit fly embryos (D. melanogaster, Oregon-R), was measured using fluorescently tagged antibodies. For each embryo, a 1 0 2 4 × 1 0 2 4 pixel image with 8 bits of fluorescence data was obtained. Image processing transforms each image into an ASCII table containing a series of records (fluorescence intensity), one for each nucleus. About 1100–1300 nuclei are obtained from each image. Each nucleus is characterized by a unique identification number, the anteroposterior (AP) and dorsoventral (DV) coordinates of its centroid, and the average fluorescence level of the gene product. Because the expression of segmentation genes is largely a function of position along the AP axis, it is natural to use the AP profiles of gene expression. We use straightened data from a rectangle of 50% of the DV height of the embryo, centred on the AP axis; for details see [2]. This captures approximately 700–800 nuclei. As in [2], we investigate only the interval of the AP coordinate between 20 and 80% egg lengths (%EL). For examples of plotting nuclear intensity versus AP position, see Figures 1(a) and 1(c). The considered test set consists of 17 embryo images belonging to cleavage cycle 13 and thoroughly studied in [2]. The data and software used in this study are available at http://www.math.uni-bremen.de/~theodore/GENESSA/. For a description of the embryos, see [5], http://flyex.ams.sunysb.edu/FlyEx/, or http://urchin.spbcas.ru/flyex/.

SSA
Let us describe the basic algorithm of SSA for extraction of signal from a one-dimensional series 𝐹 = ( 𝑓 0 , , 𝑓 𝑁 1 ) . For the given data, 𝑓 𝑛 represents the intensity measured at the 𝑛 th nucleus inside the considered interval of 20%EL–80%EL. The first step of SSA has only the parameter 𝐿 and the window length, 1 < 𝐿 < 𝑁 , and consists of the construction of the Hankel matrix of size 𝐿 × 𝐾 , 𝐾 = 𝑁 𝐿 + 1 , with column vectors 𝑋 𝑗 = ( 𝑓 𝑗 1 , , 𝑓 𝑗 + 𝐿 2 ) T , 𝑗 = 1 , , 𝐾 , which is called the trajectory matrix. Note that there is a one-to-one relation between series of length 𝑁 and Hankel matrices of size 𝐿 × 𝐾 ; each secondary diagonal of a Hankel matrix has equal values and produces a term of the series. The trajectory matrix is then decomposed into the sum of the ordered elementary matrices, 𝐗 = 𝑑 𝑖 = 1 𝐗 𝑖 , where 𝐗 𝑖 = 𝜆 𝑖 𝑈 𝑖 𝑉 T 𝑖 , 𝜆 𝑖 are nonzero eigenvalues of 𝐗 𝐗 T in decreasing order, 𝑈 𝑖 are the corresponding eigenvectors, and 𝑉 𝑖 are the factor vectors. This is the so-called singular value decomposition (SVD), and each SVD component generates an elementary reconstructed component (elementary RC) of the series 𝐹 as follows. The matrix 𝐗 𝑖 is hankelized by averaging the entries with indices 𝑖 + 𝑗 = c o n s t , and the corresponding series of length 𝑁 is reconstructed by the above-mentioned one-to-one relation. Thus, we decompose 𝐹 into the sum of elementary RCs, 𝐹 𝐹 = 1 𝐹 + + 𝑑 , where 𝑑 is the number of nonzero eigenvalues 𝜆 𝑖 (the so-called SSA rank of 𝐹 ). Then, we choose a group 𝒥 of 𝑟 indices of the desirable components of 𝐹 (signal in our case), and gather the reconstructed signal as 𝐹 = 𝑖 𝒥 𝐹 𝑖 .
The signal extraction problem is thus reduced to (i) choice of window length 𝐿 and (ii) selection of the subgroup 𝒥 of SVD components for reconstruction. These questions are briefly discussed in the next paragraph.

Trend Extraction in SSA
SSA needs no a priori specification of models of series and signals, neither deterministic nor stochastic ones. In this paper, we are interested in extraction of a slowly varying signal, usually called the trend. Hereinafter, we refer to trend instead of signal.
Generally, SSA is able to extract different kinds of trends. Note that any trend can be approximated by a finite-rank series as the class of finite-rank series includes all kinds of sums of products of polynomials, exponentials, and sinusoids. Let us assume that the trend is (or is approximated by) a series of rank 𝑟 . With large enough 𝑁 and 𝐿 ( 𝐿 𝑁 / 2 ) , the trend is separable from noise and is reconstructed by the 𝑟 -leading SVD components. The subspace spanned by the 𝑟 -corresponding eigenvectors contains information about the finite-rank structure and, in particular, allows to derive the approximate analytical formula of the trend. If 𝑁 is not large enough (for strong noise or high rank 𝑟 ) and separability is bad, then the trend still can be extracted using small 𝐿 . In this case, trend is determined by a few leading components, and SSA works like a smoothing adaptive linear filter. However, the subspace spanned by the corresponding eigenvectors does not reflect the finite-rank structure of the trend and is liable to be affected by noise and outliers.
Grouping of SVD components is based on the fact that the slowly varying component of the series generates eigenvectors and factor vectors of slowly varying form [1, 6], and therefore is composed of similar elementary RCs. Thus, the identification of the components of a trend consists in identification (visual or automatic) of slowly varying eigenvectors, factor vectors, or elementary RCs.

AutoSSA for Trend Extraction
Here we present a method of identification of slowly varying eigenvectors. This method is easy to use as it has only two parameters (if the window length is fixed).
Firstly, we introduce the periodogram 𝐼 𝑌 ( 𝜔 ) of a vector 𝑌 𝑀 , 𝑌 = ( 𝑦 0 , , 𝑦 𝑀 1 ) T : 𝐼 𝑌 ( 𝑘 / 𝑀 ) = ( 1 / 𝑀 ) | 𝑀 1 𝑛 = 0 𝑒 𝑖 2 𝜋 𝑛 𝑘 / 𝑀 𝑦 𝑛 | 2 , 𝑘 = 0 , , 𝑀 / 2 , which can be interpreted as the contribution of the frequency 𝑘 / 𝑀 . The cumulative contribution is evaluated as 𝜋 𝑌 ( 𝜔 ) = 𝑘 0 𝑘 / 𝑀 𝜔 𝐼 𝑌 ( 𝑘 / 𝑀 ) , 𝜔 [ 0 , 0 . 5 ] . For 𝜔 0 ( 0 , 0 . 5 ) , the contribution of low frequencies to 𝑌 𝑀 is defined as 𝒞 ( 𝑌 , 𝜔 0 ) = 𝜋 𝑌 ( 𝜔 0 ) / 𝜋 𝑌 ( 0 . 5 ) .
Let us consider eigenvectors 𝑈 𝑖 . Then, given 𝜔 0 ( 0 , 0 . 5 ) and 𝒞 0 [ 0 , 1 ] , we select SVD components with eigenvectors satisfying 𝒞 ( 𝑈 𝑖 , 𝜔 0 ) 𝒞 0 . One may interpret this method as selection of SVD components characterized mostly by low-frequency fluctuations.
The low-frequency boundary 𝜔 0 manages the scale of the extracted trend; the lower 𝜔 0 is, the slower the trend varies. The parameter 𝒞 0 regulates an acceptable share of higher frequencies in the extracted component, and is usually chosen close to 1. For more details on selecting 𝜔 0 and for description of a fitting procedure to choose 𝒞 0 , see [6].

Derivation of the Analytical Form of a Component
A series consisting of a sum of exponentials and sinusoids has SSA-rank 𝑟 , and is represented in the complex-valued form as 𝑔 𝑛 = 𝑟 𝑗 = 1 𝐶 𝑗 𝜇 𝑛 𝑗 . The subspace spanned by the 𝑟 -leading eigenvectors of the trajectory matrix determines the values 𝜇 𝑗 . The methods calculating 𝜇 𝑗 through this subspace are called the subspace methods. Real-valued 𝜇 𝑗 correspond to exponential components; complex conjugate 𝜇 𝑗 generate sinusoids. For a noisy signal, the subspace of the signal can be estimated using SSA as the space spanned by the eigenvectors { 𝑈 𝑖 } 𝑖 𝒥 . Subspace methods for calculation of exponentials 𝜇 𝑗 (mostly complex-valued) have been known for a long time. Among their advantages are (i) high resolution (estimation of close frequencies of summand sinusoids), (ii) robustness to outliers, and (iii) little prior information (no signal model specified, only its rank 𝑟 ).
In this paper, we use the method ESPRIT [7], which was chosen to illustrate application of subspace methods. Note that there are modifications of ESPRIT for more precise results, for example, weighted/total least-squares ESPRIT. Having 𝜇 𝑗 estimated, the coefficients 𝐶 𝑖 can be computed by means of one of the least-squares methods. In terms of SSA, ESPRIT exploits the rotational invariance property of the subspace of the signal found by SSA.

3. Results and Discussion

3.1. Methodology

An Ideal Case (Trend Is Separable from Noise with 𝐿 = 𝑁 / 2 )
Let us consider the data ms19 (data from a single embryo) as an example (see Figure 1(a)). The only parameter of SSA is the window length 𝐿 . SSA theory [1] implies that for better separability between components of a given series, one should choose 𝐿 close to the half-length 𝑁 / 2 . Having performed SVD with 𝐿 = 2 4 6 𝑁 / 2 , we visually examine the elementary RCs produced (see Figure 1(b)). As only the two leading elementary RCs vary slowly, we reconstruct the trend with the two leading SVD components. The resulting trend is depicted in Figure 1(a). One can easily see that the result reasonably reconstructs the trend, but at the same time is robust to the apparent outliers.

A Bad Case (Trend Is Not Separable from Noise with 𝐿 = 𝑁 / 2 )
However, for some cases the trend cannot be separated from noise or cyclic components. For the data ac2 (see Figure 1(c)) with 𝐿 = 2 8 5 = 𝑁 / 2 , the third and fourth elementary RCs contain both trend and noise (see Figure 1(d)). The contribution of these SVD components is small and they can be omitted for a tentative trend reconstruction. But for background estimation, loss of even 1-2 intensity units is sizeable. Let us consider two additional tricks which help to reconstruct trend more precisely.

Use of Small Window Length
The trend and noise components can be mixed between each other due to the complicated trend shape or strong noise, and the latter is observed in our study. The first strategy to cope with the mixing is to choose small window length 𝐿 . With small 𝐿 , SSA works like a smoothing adaptive linear filter. With 𝐿 = 3 5 for ac2, we get only the one leading SVD component corresponding to the trend. Due to small window length, this SVD component includes all terms of the trend. This method is suitable for different kinds of signals, and overcomes the mixing of trend and noise. However, the resultant decomposition does not allow us to split the trend into the pattern and background.

Improvement of Separability by the Addition of a Constant
Recall that we suppose the trend to be the sum of an exponential pattern and an almost constant background, where the latter is approximated by an exponential function with small rate. Then, in this particular case another trend extraction strategy can be exploited. Such a trend generates two SVD components [1]. In order to enlarge the contribution of the second SVD component and therefore to reduce mixture with the rest, we add a constant to the given data, thus artificially increasing the background. After that we use the theoretically best window length 𝐿 = 𝑁 / 2 with no effect of mixing. This strategy greatly helps; having added 𝐴 = 5 0 to the given data, the results for ac2 (with 𝐿 = 2 8 5 ) become visually the same as in the good case ms19 depicted in Figure 1(b). The value 𝐴 equal to 50 was chosen to provide separability for all series from the considered test set and therefore to allow us to extract trends being split into patterns and backgrounds. As for ms19, smaller values of 𝐴 that are enough for separability can be used.

3.2. Batch Identification of Trend Components

Above we investigated the properties of the SSA representation of bicoid gene expression data, which contain the exponential pattern and a close-to-constant background. Let us apply AutoSSA for batch-processing the whole dataset taking into account the experience of the previous section. First, we set the parameter 𝜔 0 . As mentioned above, 𝜔 0 defines the low-frequency interval [ 0 , 𝜔 0 ] . Examining the eigenvector periodogram, we can guess 𝜔 0 as a value bounding the interval of large periodogram values next to the zero frequency. Taking the data ms19 as an example, we consider periodograms of its eigenvectors ( 𝐿 = 𝑁 / 2 ). Exactly the two leading SVD components of ms19 are to be identified. The first six frequencies 0 , 1 / 𝐿 , , 5 / 𝐿 of the periodograms of both eigenvectors contain the prevailing contribution. Thus, we select 𝜔 0 = 5 / 𝐿 . For five randomly selected test series (ad36, as27, cb23, hx8, iz13), the procedure of choice of 𝒞 0 presented in [6] ( 𝒞 m i n = 0 . 5 , 𝒞 m a x = 1 , Δ 𝒞 = 0 . 0 1 , Δ = 0 . 0 1 ) produces the following 𝒞 0 : 0.85,0.87,0.88,0.85,0.88 of which the smallest 𝒞 0 = 0 . 8 5 is selected. AutoSSA with 𝜔 0 = 5 / 𝐿 and 𝒞 0 = 0 . 8 5 identifies the same SVD components as those visually identified above; two leading SVD components for ms19 ( 𝐿 = 𝑁 / 2 ) and for ac2 increased by 𝐴 = 5 0 ( 𝐿 = 𝑁 / 2 ), as well as only the leading component for ac2 with 𝐿 = 3 5 .

Having added 𝐴 = 5 0 to all series from the given set, AutoSSA identifies exactly the two leading components. The visual check of the identified eigenvectors proves their slowly varying shape; this substantiates the use of AutoSSA, especially for these data where addition of a constant has enhanced the separability. Moreover, this uniformity of results over the whole dataset allows us to derive the analytical approximation using these SVD components.

3.3. Analytical Trend Approximation: Exponential Pattern Plus Background

Let us consider two-rank approximation of the trend. A two-rank series is either an exponentially modulated sinusoid or a sum of two real exponentials [1], and a constant function is a special case of an exponential. Note that we specify no approximation model (from the two given above) but only the rank. For the data ac2, the resulting formula for the trend is 𝑔 𝑛 = 7 9 . 2 9 0 . 9 9 4 𝑛 + ( 5 9 . 8 8 0 . 9 9 9 8 𝑛 5 0 ) , where 𝑛 runs through the nucleus numbers in the considered AP interval. After transformation from nucleus number 𝑛 to AP coordinate 𝑥 , we obtain the approximation for the exponential pattern 𝑝 ( 𝑥 ) and the background 𝑏 ( 𝑥 ) : 𝑝 ( 𝑥 ) = 3 0 2 . 3 5 𝑒 0 . 0 6 2 𝑥 and 𝑏 ( 𝑥 ) = 1 1 . 7 0 . 0 8 6 5 𝑥 (see Figure 2).

3.4. Summary Results of the Analytical Approximation

The considered dataset contains 17 series, and we extracted exponential patterns using ESPRIT for all of them, increased by 𝐴 = 5 0 in advance. The fact that the patterns tend to zero close to the posterior end is confirmed by the biological interpretation of the bicoid gradient. To generate reference results, we fitted the curve 𝑔 𝑛 = 𝐶 𝑒 𝐿 𝑛 + 𝐵 to each original series using the least-squares (LSs) method, like [2]. The patterns produced with ESPRIT and with LS-fitting are very similar, which confirms potential of ESPRIT in extracting exponential patterns without fixing the model of constant background. Both the MATLAB function fminsearch (v.7.4) and the nonlinear estimation module of STATISTICA (v.6.0) with randomly chosen initial values produce substantially incorrect results. Thus, the initial values used are crucial. It turns out that by using ESPRIT estimates of 𝐶 , 𝐿 , and 𝐵 = 1 as the initial values, the procedure of the LS-estimation becomes stable and precise. This shows the usefulness of ESPRIT in combination with LS-methods.

The SSA-based procedure presented here is more flexible than the usual bicoid profile modeling with constant background. On the other hand, as simultaneous estimation of parameters of two exponents is less stable in general, the variation of the resulting pattern parameters can be potentially larger than that in the model with a constant background. However, even for modeling with fixed shape of the background, SSA can be useful for setting initial values of the corresponding fitting procedure.

4. Conclusions

First, we developed the SSA-based technique for signal extraction from one-dimensional spatial gene expression profiles containing exponential-in-distance patterns and constant-like backgrounds. The obtained results are consistent with the state-of-the-art results for the given data, though the data contain strong noise and outliers, and we do not assume models of profile, pattern, and background to be known a priori. Moreover, the feasibility of batch processing of the given data using AutoSSA is demonstrated.

Second, using the SSA-related method ESPRIT, we obtained an analytical representation of the signal as a sum of two exponential functions. The first is the well-known exponential pattern of the bicoid protein, and the second is the background approximated by an exponential or linear function in our case. The employed method produces stable parameter estimates, even for noisy series. Moreover, these estimates can be used as initial values for nonlinear least-squares fitting procedures in which the model is assumed a priori.

Acknowledgments

The authors thank David Holloway for comments and Alexey Timofeyev for ESPRIT software. This work was supported by the NSF/NIGMS BioMath 1-R01-GM072022 and CRDF RUB1-1643-ST-06 Grants.