BioMed Research International

Volume 2015 (2015), Article ID 689745, 14 pages

http://dx.doi.org/10.1155/2015/689745

## Shaped Singular Spectrum Analysis for Quantifying Gene Expression, with Application to the Early *Drosophila* Embryo

^{1}Faculty of Mathematics and Mechanics, St. Petersburg State University, Universitetsky Pr. 28, Peterhof, St. Petersburg 198504, Russia^{2}Mathematics Department, British Columbia Institute of Technology, 3700 Willingdon Avenue, Burnaby, BC, Canada V5G 3H2^{3}Computer Science and CEWIT, SUNY Stony Brook, 1500 Stony Brook Road, Stony Brook, NY 11794, USA^{4}The Sechenov Institute of Evolutionary Physiology & Biochemistry, Torez Pr. 44, St. Petersburg 194223, Russia

Received 4 July 2014; Revised 10 September 2014; Accepted 10 September 2014

Academic Editor: Hongwei Wang

Copyright © 2015 Alex Shlemov et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In recent years, with the development of automated microscopy technologies, the volume and complexity of image data on gene expression have increased tremendously. The only way to analyze quantitatively and comprehensively such biological data is by developing and applying new sophisticated mathematical approaches. Here, we present extensions of 2D singular spectrum analysis (2D-SSA) for application to 2D and 3D datasets of embryo images. These extensions, circular and shaped 2D-SSA, are applied to gene expression in the nuclear layer just under the surface of the *Drosophila* (fruit fly) embryo. We consider the commonly used cylindrical projection of the ellipsoidal *Drosophila* embryo. We demonstrate how circular and shaped versions of 2D-SSA help to decompose expression data into identifiable components (such as trend and noise), as well as separating signals from different genes. Detection and improvement of under- and overcorrection in multichannel imaging is addressed, as well as the extraction and analysis of 3D features in 3D gene expression patterns.

#### 1. Introduction

While the availability of genome sequences has drastically revolutionized biological and biomedical research, our understanding of how genes encode regulatory mechanisms is still limited. Embryonic development depends critically on such regulatory mechanisms in order for cells to differentiate in the correct positions and at the correct times. Global understanding of gene regulation in development requires determining at cellular resolution in vivo when and where each gene is expressed. New dynamic, cellular resolution atlases will address the question of how gene transcription factors influence expression patterning [1].

With the development of automated microscopy technologies in recent years the volume and complexity of image data have increased to the level that it is no longer feasible to extract information without using computational tools. Biologists increasingly rely on computer scientists to come up with new solutions and software [2]. Such computational tools have been essential for processing the images generated by high-throughput microscopy of large numbers and varieties of biological samples under a variety of conditions. Recent advances in labeling, imaging, and computational image analysis are allowing quantitative measurements to be made more readily and in much greater detail in a range of organisms (e.g.,* Arabidopsis*,* Ciona*,* Drosophila*,* C. elegans*, mice,* Platynereis*, and zebrafish) [1, 3–6]. In particular, imaging of single intact small organisms, like* Drosophila* and* C. elegans*, is now feasible with high resolution in two dimensions, three dimensions, and across time, resulting in massive image data sets available for comprehensive computational analysis.

These large-scale quantitative data sets provide new insights to address many fundamental questions in developmental biology. The initial inputs for deriving quantitative information of gene expression and embryonic morphology are usually raw image data of stained fluorescent markers in fixed material. These raw image sets are then analyzed by computational algorithms that extract features such as cell location, cell shape, and gene product concentration. Ultimately, the most powerful way to analyze 3D spatial data in biology is by developing and applying new sophisticated mathematical approaches, allowing for the rigorous comparison of multiple quantitative features [8, 9].

In this publication, we introduce new computational tools to analyze gene patterning for three spatial dimension datasets, applied to early* Drosophila* embryos. These tools are an extension of two-dimensional singular spectrum analysis (2D-SSA).

*Introduction to the Method*. Singular spectrum analysis [10–15] was originally suggested as a method for decomposition of time series into a sum of identifiable components such as trend (or pattern), oscillations, and noise. One advantage of this method is that it does not need a noise model to be given a priori. We decompose the data series into a set of elementary series, analyze them, choose appropriate components, and finally sum the identifiable components together in classes. As an example, selection of smooth components can produce adaptive smoothing. SSA is very useful for exploratory analysis since the method can deal with modulated noise, that is, noise that can depend on trend values (e.g., has a multiplicative nature).

Recently SSA was extended for analysis of two-dimensional objects (2D-SSA), for example, digital images [16, 17]. Decomposition of images is more complicated compared to time series analysis due to variability of 2D patterns. But methods which are easily controlled and adaptive, such as 2D-SSA, can have broad applicability.

2D-SSA has much in common with the 2D-ESPRIT method (see [18]), which is based on the parametric form of images and has many applications. 2D-SSA and related subspace-based methods are applied in texture analysis [19], seismology [20], spatial gene expression data [21], and medical imaging [22].

The paper [23] applied 2D-SSA to the analysis of digital terrains in geology and demonstrated that 2D-SSA is a useful tool for analyzing different levels of details in surface data. Later, based on the theory given in [17], 2D-SSA was applied to gene expression data to separate nuclear noise from expression trend [21].

The papers [24, 25] present extensions of 2D-SSA which increase the range of SSA applications. In the present paper, we demonstrate how these extensions can be applied to analyzing gene expression data.

This paper is structured as follows. Section 2 describes the data sets which were analyzed. Section 3 describes the new methodology, and Sections 4 and 5 demonstrate the approach on several examples.

The new approaches described here, circular and shaped 2D-SSA, are particulary applicable to cylindrical surfaces (as used for* Drosophila* embryos), to avoid edge effects and patterns of irregular shape. For example, the area of good quality data in an image (e.g., without oversaturation) can be nonrectangular and even have gaps. Also, since the planar projection of a* Drosophila* embryo is nearly elliptical, the ability to analyze nonrectangular shapes can be useful.

Section 4 deals with the problem of detection and improvement of under- and overcorrection in multichannel imaging, while Section 5 considers the problem of analysis of stripe shapes for the even skipped gene. Section 6 contains a short discussion and conclusions.

#### 2. Materials

Data are taken from the Berkeley Drosophila Transcription Network Project (BDTNP) [4], which contains three-dimensional (3D) measurements of relative mRNA concentration for 95 genes in early development (including* snail* (*sna*)) and the protein expression patterns for four genes (bicoid, giant, hunchback (*hb*), and Krüppel (*Kr*)) during nuclear cleavage cycles 13 (C13) and 14 (C14A). BDTNP Release 2 contains individual datasets (PointCloud files) for 2830 embryos (http://bdtnp.lbl.gov/Fly-Net/bioimaging.jsp). These data were registered to the coordinates of 6078 nuclei on the embryo cortex and presented as an integrated dataset (VirtualEmbryo file, with tools for visualization and analysis). Embryos were fixed and fluorescently stained to label the mRNA expression patterns of two genes plus nuclear DNA. One of the genes stained was either even skipped (*eve*) or* fushi tarazu* (*ftz*), which were used as fiduciary markers for subsequent spatial registration.

#### 3. Methods

##### 3.1. 2D Singular Spectrum Analysis

We will follow the common structure of 2D-SSA algorithms described in [24, 25]. This common structure consists of embedding, decomposition, grouping, and reconstruction steps. Input for a 2D-SSA algorithm consists of an image and the shape of a moving window (which is the main algorithm parameter). The output of a 2D-SSA algorithm is the decomposition of into identifiable components of the form .

*Common Scheme of SSA-Like Algorithms*

*(**1) Embedding Step*. Construction of the trajectory matrix , where is a space of structured Hankel-like matrices. The structure of the matrix (and the space ) depends on the algorithm modification and on the moving window. Generally speaking, the columns of the trajectory matrix consist of the windows moving along the image, transformed to vectors by a fixed order of window elements. In a sense, the window size reflects the resolution of the method; that is, larger windows lead to more detailed decompositions.

*(**2) Decomposition Step*. Singular value decomposition (SVD) of the trajectory matrix . Here are so-called eigentriples (abbreviated as ET) and consist of singular values, left and right singular vectors of . The eigenvectors can be transformed back to the window form. This means that we can consider eigenvectors as images and call them eigenimages.

*(**3) Grouping Step*. Partition and grouping of summands in the SVD decomposition to obtain a grouped matrix decomposition , where . The grouping with is called elementary. The aim of this step is to group the SVD components to obtain an interpretable decomposition of the initial object. This can be performed by means of analysis of eigentriples.

*(**4) Reconstruction Step*. Decomposition of the initial image , where ; is the operator of projection on the space (e.g., hankelization in the 1D case); holds.

Let us explain the sense of the embedding operator for the 1D case, since it is simpler and demonstrates the general methodology. For a one-dimensional series , we take moving 1D windows of length and construct the columns of the trajectory matrix in the forms , , and so on. From these lagged vectors we gather a Hankel matrix with equal numbers on antidiagonals called the trajectory matrix

It is well known that Hankel matrices are related to series which consist of sums of products of polynomials, exponentials, and sine waves and the problem is to separate this sum into addends. If we can separate exponential and polynomial approximations from the residual, then we can extract trends and patterns. If we are able to separate sine waves with different frequencies, then we can construct a decomposition on components with different frequency ranges.

The singular value decomposition (SVD) of the trajectory matrix constructs a sequence of elementary matrices, which provides the best approximations of the initial matrix and, in a sense, of the initial series: , , and so on. Thus, we obtain the optimal decomposition, which is adaptive to the initial series. Note that the maximal number of the decomposition elements is equal to . SSA theory explains why we can group the elementary components in the SVD expansion to solve such problems as, for example, smooth approximation and extraction of regular oscillations.

After a proper grouping, we obtain a matrix , which is close to a Hankel matrix, but not exactly Hankel. We can find the Hankel matrix closest to by hankelization, that is, by averaging values by antidiagonals. Thus, we obtain the series consisting of , , , and so on. The* m*th term is determined as , where .

The role of is as follows. Small provides a decomposition to a small number of components, which mostly differ by frequency, and where the leading components present slowly varying series like the trend. Larger leads to more detailed decomposition. This gives more chance to extract a component; however, some components can mix. Therefore, if the data series has a trend with a complex form or has periodicities with complex modulation, then window lengths should be moderate.

These generalities also hold for the case of 2D-SSA. In practice, the difference between 1D and 2D is in the construction of the trajectory matrices, which are quasi-Hankel, in particular Hankel-block-Hankel. The moving window is two-dimensional, for example, a rectangle. In this paper, we introduce circular SSA, for treating rectangles with periodic boundary conditions, for example, data sets on cylindrical geometries. Small window size corresponds to smoothing. We can take into consideration the structure of the image in different directions by choosing different sizes in different directions. The trajectory matrix is constructed from vectorized windows of arbitrary shape moving within the whole image (including circular domains, for periodic boundary conditions).

##### 3.2. Particular Cases

For a rectangular image, with a rectangular window which moves within the image boundaries, we obtain the standard 2D-SSA method. If the image and the window are of arbitrary shape, the shaped version of 2D-SSA is applied [25]. If the window can cross the boundary of the image, we obtain a circular version of 2D-SSA.

For example, let us take an image (a matrix in the mathematical sense) and the window of size . Then we have a set of 4 windows in the ordinary version, , , , and , and two additional windows, , , in the circular case. For the circular case, the trajectory matrix will have the form

One can see that the 2D trajectory matrix consists of trajectory matrices from each matrix’s row.

##### 3.3. Choice of Parameters, Separability, and Component Identification

Approach to the choice of window size for one-dimensional time series is thoroughly described in [13, 26]. Recommendations for 2D objects are more complicated. For extraction of so-called objects of finite rank (sums of products of polynomials, exponentials, and sinusoids), which satisfy linear recurrence relations (LRRs), windows should be large, up to half of the object size. However, real-world patterns usually have complex form and satisfy LRRs only approximately and locally. The window needs to agree with this local character. In particular, sine waves are exactly governed by an LRR. However, if a 2D-sine wave has a slowly changing location, then only its local parts satisfy an LRR. The window sizes need to be in accordance with the scale of this locality. Choice of window size is always a balance between the local and the global scales of the data.

Generally, SSA can separate smooth patterns from noise for a wide variety of patterns. For regular patterns, 2D-SSA can be applied whether the pattern varies smoothly or sharply. However, if the pattern is not regular, variation needs to be smooth in order to use 2D-SSA for signal separation. Irregular pattern with sharp variation is poorly separated by 2D-SSA. If, however, the sharp change occurs in narrow area, this can be cut out, and the remaining data analyzed by shaped SSA, which is a version of 2D-SSA with a nonrectangular shape of the image or the window.

Elementary components are grouped based on their similarity to the data components being extracted. For regular components like sine waves, the number of elementary components can be calculated from theory. Also, patterns usually have a limited frequency range (usually lacking high frequencies). In general, therefore, leading elementary components with the appropriate frequency characteristics are ascribed to pattern.

In this paper we show how 2D-SSA can be used to remove noise, to separate regular oscillations from slowly varying patterns (for correcting erroneous unmixing procedures), and to extract stripes for their further analysis. Shaped SSA allows for the analysis of complex patterns by splitting images into several parts.

*Drosophila* early gene expression (before the midblastula transition) produces smooth and simple patterns suitable for 2D-SSA processing. A number of web resources have such datasets (BDTNP BID [4], Fly-FISH http://fly-fish.ccbr.utoronto.ca [27], FlyEx http://urchin.spbcas.ru/flyex [28]; see also [29, 30]). Shaped SSA can also be useful for a common subset of this data, in which patterns fall sharply to zero. In these cases, subregions can be excised or analyzed separately from the whole image. The gene* sna* is a typical* Drosophila* example seen in the BDTNP BID; such compact patterns are also seen in other experimental organisms, such as the nine zebrafish genes [31]. We expect 2D-SSA and shaped SSA to therefore have broad applicability to image processing in developmental biology.

The problem of unmixing expression patterns from two different genes in one image [32] requires additional conditions. Specifically, information is needed on the unmixed expression of each gene (i.e., data from one gene in the absence of the other gene). If the two genes have slowly varying patterns, they cannot readily be separated by SSA. In such cases, SSA cannot be used to detect or correct errors in mixed images. However, SSA is an effective unmixing method for cases in which one gene has an approximately regular structure, and this differs from the structure of the other gene. In this paper, we apply SSA to signal unmixing and image correction for such cases from* Drosophila* data.

##### 3.4. Data Preprocessing

Initially, the data for 2D-SSA analysis should be measured on a regular grid. Data for gene expression are measured at nuclei, which are not regularly located on a 3D surface of embryo (which is roughly ellipsoidal in shape). The first step of preprocessing is a cylindrical projection of the data (centred on the major axis of the ellipsoid; the major axis of the embryo is found by principal component analysis). We then interpolate the data to a regular grid on this cylinder. We analyze a central region of the cylinder, in order to avoid corruptions near the poles from the ellipsoid to cylinder transformation. After 2D-SSA decomposition, we interpolated the data back onto the nuclear centers. This interpolation is performed for smooth components; residuals are calculated as the difference between the initial data and interpolated smooth components.

Interpolation involves Delaunay triangulation followed by linear interpolation of nuclear centers to the triangulation.

##### 3.5. Implementation

The algorithms are implemented in the Rssa and BioSSA packages in* R*. Rssa is a general-purpose package containing effective implementation of singular spectrum analysis and its 2D extensions. 2D-SSA algorithms are time- and memory-consuming and therefore it is very important to have an effective implementation. A description of Rssa with examples can be found in [24, 33]. The* R*-package BioSSA is an addition to Rssa for application to fly embryo gene expressions data and is briefly described at http://biossa.github.io/.

#### 4. Periodic Patterns Produced by Unmixing Algorithms

Different emission spectra for fluorescent probes allows for the simultaneous staining for 3-4 gene products in embryonic tissues. Quantitative imaging projects [4, 30] use the same gene in one of these channels in all embryos, for reliable quantitative comparisons, registration, and so forth. The gene used for this marking in* Drosophila* embryos is commonly one of the pair-rule genes (such as* eve* or* ftz*), which have a characteristic periodic 7-stripe expression pattern.

Multichannel imaging suffers from an inherent problem of overlapping emission spectra (when the fluorescent markers are simultaneously excited (e.g., [34])), where light from more than one fluorescent dye is collected by a given acquisition channel. To computationally reduce this “crosstalk,” an automated channel unmixing method was developed and applied to the BDTNP data [32].

The problem with this approach in large scale projects with automatic data processing is that the unmixing parameters can end up being too high or too low. If the parameters are overestimated, unmixing produces an overcorrection, which is manifest as a partial subtraction of the common, reference pattern from the pattern of the second gene (the gene under study for the embryo). With periodic reference patterns (*eve*,* ftz*), this produces periodic grooves in the “unmixed” pattern. Figure 1 shows the effects of such overcorrection in one of the BDTNP embryos.