Mining 3D Patterns from Gene Expression Temporal Data: A New Tricluster Evaluation Measure

Gutiérrez-Avilés, David; Rubio-Escudero, Cristina

doi:https://doi.org/10.1155/2014/624371

The Scientific World Journal

On this page

Abstract Introduction Results Conclusions Acknowledgments References Copyright Related Articles

Special Issue

Emerging Trends in Soft Computing Models in Bioinformatics and Biomedicine

View this Special Issue

Research Article | Open Access

Volume 2014 | Article ID 624371 | https://doi.org/10.1155/2014/624371

Mining 3D Patterns from Gene Expression Temporal Data: A New Tricluster Evaluation Measure

David Gutiérrez-Avilés¹and Cristina Rubio-Escudero¹

Academic Editor: V. Bhatnagar, S. Balochian, Y. Zhang

Received28 Dec 2013

Accepted26 Feb 2014

Published31 Mar 2014

Abstract

Microarrays have revolutionized biotechnological research. The analysis of new data generated represents a computational challenge due to the characteristics of these data. Clustering techniques are applied to create groups of genes that exhibit a similar behavior. Biclustering emerges as a valuable tool for microarray data analysis since it relaxes the constraints for grouping, allowing genes to be evaluated only under a subset of the conditions. However, if a third dimension appears in the data, triclustering is the appropriate tool for the analysis. This occurs in longitudinal experiments in which the genes are evaluated under conditions at several time points. All clustering, biclustering, and triclustering techniques guide their search for solutions by a measure that evaluates the quality of clusters. We present an evaluation measure for triclusters called Mean Square Residue 3D. This measure is based on the classic biclustering measure Mean Square Residue. Mean Square Residue 3D has been applied to both synthetic and real data and it has proved to be capable of extracting groups of genes with homogeneous patterns in subsets of conditions and times, and these groups have shown a high correlation level and they are also related to their functional annotations extracted from the Gene Ontology project.

1. Introduction

The use of high throughput processing techniques has revolutionized the technological research and has exponentially increased the amount of data available [1]. Particularly, microarrays have revolutionized biological research by their ability to monitor changes in RNA concentration in thousands of genes simultaneously [2].

A common practice when analyzing gene expression data is to apply clustering techniques, creating groups of genes that exhibit similar expression patterns [3]. These clusters are interesting because it is considered that genes with similar behavior patterns can be involved in similar regulatory processes [4]. Although in theory there is a big step from correlation to functional similarity of genes, several articles indicate that this relation exists [5].

Traditional clustering algorithms work on the whole space of data dimensions examining each gene in the dataset under all conditions tested. However, the activity of genes could only appear under a particular set of experimental conditions, exhibiting local patterns. Discovering these local patterns can be the key to discover gene pathways, which could be hard to discover in other ways. For this reason, the paradigm of clustering techniques must change to methods that allow local pattern discovery in gene expression data [6].

Biclustering [7] addresses this problem by relaxing the conditions and by allowing assessment only under a subset of the conditions of the experiment, and it has proved to be successful in finding gene patterns [8]. However, if the time condition is added to the dataset clustering, and biclustering result insufficient. There is a lot of interest in temporal experiments because they allow an in-depth analysis of molecular processes in which the time evolution is important, for example cell cycles, development at the molecular level or evolution of diseases [9]. In this sense, triclustering appears as a valuable tool since it allows for the assessment of genes under a subset to the conditions of the experiment and under a subset of times.

All clustering, biclustering, and triclustering techniques guide their search for solutions by a measure that evaluates the quality of clusters [10]. In this work we propose an evaluation measure for triclusters called Mean Square Residue 3D (MSR_3D). This measure is based on a classic biclustering measure presented by Cheng and Church in [11] called Mean Square Residue (MSR). MSR measures the homogeneity of a bicluster in the relation of each value in the bicluster with the average value for all genes in the bicluster, average of all conditions, and average of all genes and conditions in the bicluster. A perfect score would be zero, which represents a constant bicluster of elements of a single value.

Our proposal, MSR_3D, is an adaptation of MSR to the three-dimensional space, so that a third factor, in this case time, can be taken into account. MSR_3D measures the homogeneity of a tricluster in the relation of each value of the tricluster, with the average of all genes, average of all conditions, average of all times, average of all genes and conditions, average of all genes and times, average of all conditions and times, and average of all genes, conditions, and times in the tricluster. As for MSR, a perfect score would be zero, which represents a constant tricluster of elements of a single value.

MSR_3D has been applied as an evaluation measure along with the TriGen (Triclustering-Genetic based) algorithm presented in [12]. TriGen is an algorithm based on evolutionary heuristic, genetic algorithms. Many heuristic approaches have been proposed both for biclustering and triclustering algorithms [13, 14], due to the NP hard nature of the problem [15].

We show the results obtained from applying the TriGen algorithm along with the MSR_3D measure to a synthetic dataset and four real experiments datasets: the yeast cell cycle regulated genes [16], mouse degeneration of retinal cells [17], mouse ectopic bHLH transcription factor expression Mesogenin1 effect on embryoid bodies [17], and human Transcription factor oncogene OTX2 silencing effect on D425 medulloblastoma cell line [17].

The results have been validated by analyzing the correlation among the genes, conditions, and times in each tricluster using two different correlation measures: Pearson and Filon [18] and Spearman [19]. Besides this, we have provided functional annotations for the genes extracted from the Gene Ontology project [20].

The rest of the paper is structured as follows. A review of the latest related works can be found in Section 2. Section 3 describes the methodology of the MSR and MSR_3D measures as well as a brief description of the TriGen algorithm. In Section 4 we show the results of applying TriGen to the synthetic and real datasets. Section 5 shows the conclusions.

2. State of the Art

This section is to provide a general overview of recent works in the field of gene expression temporal data. In particular, for those works related to the application of triclustering, we focus on the measures applied to evaluate the triclusters.

In 2005, Zhao and Zaki [21] introduced the triCluster algorithm to extract patterns in 3D gene expression data. They presented a measure to assess triclusters’ quality based on the symmetry property. This allows a very efficient cluster mining since clusters are searched over the dimensions with the least cardinality. The triclusters have to fulfill some requirements such as being maximal; that is, no tricluster in the set of solutions is totally included in another tricluster in the set of solutions; the ratio of every pair of columns in the tricluster is delimited by a given ; the maximum volume of the tricluster is determined by the relation among , , and for gene, condition, and time dimensions,respectively; and the minimum volume for the tricluster is also controlled. An extended and generalized version of this proposal, g-triCluster, was published one year later [22]. The authors claimed that the symmetry property is not suitable for all patterns present in biological data and propose the Spearman rank correlation [19] as a more appropriate tricluster evaluation measure.

An evolutionary computation proposal was made in [23]. The fitness function defined is a multiobjective measure which tries to optimize three conflicting objectives: clusters size, homogeneity, and gene-dimension variance of the 3D cluster.

LagMiner was introduced in [24] to find time-lagged 3D clusters, which allows in turn finding regulatory relationships among genes. It is based on a novel 3D cluster model called Cluster. They evaluated their triclusters on homogeneity, regulation, minimum gene number, sample subspace size, and time periods length.

Wang et al. [25] proposed a new algorithm called ts-cluster basing their definition for coherent triclusters also on finding regulatory relationships among genes. For that purpose, time shifting is also considered among time points in the evaluated triclusters.

A new strategy to mine 3D clusters in real-valued data was introduced in [26]. The authors defined the Correlated 3D Subspace Clusters (CSCs) where the values in each cluster must have high cooccurrences and those cooccurrences are not by chance. They measure the clusters based on the correlation information measure, which takes into account both prerequisites. In particular, the authors were concerned about discovering subspaces with a significant number of items, one of the main problems typically found in tricluster-based approaches. At the same conference, another approach was presented focusing on the concept of Low-Variance 3-Cluster [27], which obeys the constraint of a low-variance distribution of cell values.

The work in [28] was focused on finding Temporal Dependency Association Rules, which relate patterns of behaviour among genes. The rules obtained are to represent regulated relations among genes.

Finally, a brief survey on triclustering applied to gene expression time series was published in 2011 [29].

3. Methodology

In this section we first describe what is triclustering in relation to biclustering, second we show the fundamentals of our proposal, the two dimensions MSR measure proposed by Cheng and Church [11] in order to assess the quality of biclusters grouping gene and conditions, and third we make a detailed description of our proposal, the three dimensions MSR measure (MSR_3D) to assess the quality of triclusters which group gene, conditions, and the time dimension. Finally, we describe TriGen and the genetic algorithm where the (MSR_3D) measure has been integrated to be tested.

3.1. Triclustering

Given a dataset containing information from gene expression data organized in rows/columns (genes as rows and conditions as columns), biclustering finds subgroups of genes and conditions where the genes exhibit highly correlated patterns of behavior for every condition [30].

A bicluster BC can be defined as a subset from a dataset which contains information related to the behavior of some genes under certain conditions . The tricluster is formally defined as where and .

Triclustering appears as an evolution of biclustering due to its capacity to mine gene expression datasets involving time as a third dimension and to find subgroups of genes, conditions, and times which exhibit highly correlated patterns of expression [12]. Figure 1 shows the structure of a tricluster, with genes as rows, conditions as columns, and time as depth.

A tricluster is as a subset from a dataset which contains information related to the behavior of some genes under conditions at times . The tricluster is formally defined as where , , and .

3.2. Two-Dimension MSR

The Mean Squared Residue (MSR) was introduced by Cheng and Church in [11]. This measure was proposed to assess the quality of biclusters extracted from gene expression data based on biclusters’ homogeneity. The formal definition can be seen in where can be defined as

Each of the terms of (1) and (2) are defined as follows:(i): bicluster being evaluated,(ii): subset of genes of ,(iii): subset of conditions of ,(iv): number of genes in ,(v): number of conditions in ,(vi): expression level of a gene under condition in ,(vii): mean of the values of a condition under all genes in ,(viii): mean of the values of a gene under all conditions in ,(ix): mean value of all values in .

A graphical representation of the values involved in (2) can be seen in Figure 2. We can say that MSR measures the homogeneity for a given bicluster based on the difference of each individual gene expression (see Figure 2(a)) with the average values of genes (see Figure 2(b)), conditions (see Figure 2(c)), and genes and conditions (see Figure 2(d)). The closer the value of MSR is to zero, the more homogeneous the bicluster is. This interpretation is the basis for the extension to three-dimension measure presented in the next section.

(a)

(b)

(c)

(d)

3.3. Three Dimensions MSR

Our proposal is an adaptation to three dimensions of MSR that measures the homogeneity of triclusters which contain subgroups of genes, conditions, and time points. We call this measure MSR_3D. The formal definition can be seen in where can be defined as Each of the members of (3) and (4) is defined as follows:(i): tricluster being evaluated,(ii): subset of genes from ,(iii): subset of conditions from ,(iv): subset of times from ,(v): number of genes in ,(vi): number of conditions in ,(vii): number of times in ,(viii): expression level of gene under condition at time in ,(ix): mean of all conditions at all times for a gene in ,(x): mean of all genes at all times for a condition in ,(xi): mean of all genes under all conditions at time in ,(xii): mean of the values of a condition and a time under all genes in ,(xiii): mean of the values of a gene and a time under all conditions in ,(xiv): mean of the values of a gene and a condition under all times in ,(xv): mean value of all values in .

A graphical representation of the values involved in (4) can be seen in Figure 3. We can say that MSR_3D measures the homogeneity for a given tricluster based on the difference of each individual gene expression (see Figure 3(a)), the mean of all conditions at all times for a gene (see Figure 3(b)), the mean of all genes at all times for a condition (see Figure 3(c)), the mean of all genes under all conditions at time (see Figure 3(d)) with the mean of a condition and a time under all genes (see Figure 3(e)), the mean of a gene and a time under all conditions (see Figure 3(f)), the mean of a gene and a condition under all times (see Figure 3(g)), and the mean value of all values in (see Figure 3(h)). The closer the value of is to zero, the more homogeneous the tricluster is. is capable of finding negatively correlated genes due to its formulation.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

3.4. TriGen Algorithm

To test the effectiveness of MSR_3D we have included it as part of the TriGen (Triclustering-Genetic based) algorithm [12]. TriGen extracts triclusters from gene expression datasets where the time is also a component taken into account in the experiment. TriGen applies a bioinspired paradigm of an evolutionary heuristic, genetic algorithms, which mimics the process of natural selection by creating an initial population of individuals representing solutions which are crossed and mutated for a number of generations and the best individuals in the populations are finally selected. MSR_3D has been applied along with TriGen as a fitness function to assess the quality of the triclusters or solutions in the population.

The flowchart of the TriGen algorithm can be seen in Figure 4. In these subsections we are going to present the principal aspects of the algorithm including inputs, outputs, representation of individuals, and genetic operators.

3.4.1. TriGen’s Input

The TriGen algorithm takes two inputs:(i): a dataset containing the gene expression values from an experiment containing genes , experimental conditions , and times . Therefore, each cell from where , , and represents the expression level of the gene under the experimental condition at time ;(ii): set of parameters to execute the algorithm as described in Table 1. These parameters control the number of solutions or triclusters to find (), the number of generations to execute (), the number of individuals in the population (), and the randomness factor which are generated within the initial population () as well as weights for the selection and mutation operators (sel y mut), weights to control the size of the triclusters (, , ), and weights to control the overlap among solutions (, , ).

3.4.2. TriGen’s Output

The TriGen algorithm’s output will be a set of triclusters. Each tricluster is composed of a subset of genes , conditions , and times from the input dataset , with the best scores when evaluated under the MSR_3D measure.

3.4.3. Codification of Individuals

Each individual in the evolutionary process of the TriGen algorithm represents a tricluster, that is, a subset of genes, experimental conditions, and time points. All genetic operators are applied to each individual in the population, in each of these three subsets. The genetic material is structured as follows. An individual, as mentioned above, is composed of three sequences of structures: one for the sequence of genes from the input dataset , one for the sequence of conditions , and one sequence of time points . These sequences are set up based on the input dataset; that is, where is the number of genes listed in the input dataset, for all genes, and .

Analogously where is the number of conditions listed in the input dataset, for all conditions, and .

Finally, represents different time stamps or values of pairs gene condition at different times: where is the number of samples measured over time and .

The algorithm’s population is made up of several individuals, as depicted in Figure 5, where the individual codification has been represented.

3.4.4. Initial Population

The initial population is generated attending to the randomness parameter. An percent of individuals are created at random by two methods: half of the individuals are purely randomly generated; this is a random subset of genes , conditions , and times chosen from and the other half is also randomly created but controlling that the values for the genes are contiguous; the values for the conditions are contiguous and the times are contiguous as well. The rest of the individuals are created at random but taking into account the previously created individuals to control overlapping of solutions.

3.4.5. Fitness Function

The proposed measure MSR_3D has been applied as part of the fitness function to evaluate the homogeneity of the triclusters in the population. MSR_3D has been combined with two other factors which measure the size of the triclusters and their overlap with previously found solutions.

Controlling the size of each of the dimensions of the triclusters might be a very important task since gene expression datasets are unbalanced on the three dimensions, with the number of genes counting in thousands and the number of conditions and times counting in tens. Therefore, the weights for the number of genes , of conditions , and times control that the dimensions of the triclusters are balanced (e.g., if we increase , the algorithm considers that solutions with a high number of genes are better than those with low number of genes).

We also control the overlap among found solutions with the weights , , and for the overlap among genes, conditions, and times, respectively, (e.g., if we increase , the algorithm considers that solutions with low level of overlap with the genes in previously found solutions are better than those with a high level of overlap).

Therefore, the fitness function can be formulated as seen in

3.4.6. Selection Operator

This operator is implemented following the roulette wheel selection method [31]. The fitness level is used to associate a probability of selection with each individual of the population. This emulates the behavior of a roulette wheel in a casino. Usually a proportion of the wheel is assigned to each of the possible selections based on their fitness value. Then a random selection is made similar to how the roulette wheel is rotated. While candidates with a higher fitness will be less likely to be eliminated, there is still a chance that they are eliminated. There is a chance that some weaker solutions may survive the selection process, which is an advantage, as though a solution may be weak, it may include some component which could prove useful following the recombination process. The parameter indicates how many individuals will pass to the next generation undergoing this method. The rest of the individuals up to complete the next population ( − ) will be created based on the crossover operator.

3.4.7. Crossover Operator

To complete the next generation, we create new individuals with this operator as follows: two individuals (parents, and ) are combined to create two new individuals (offspring, and ). The parents are randomly chosen. Their genetic material is combined by a random one-point cross in the genes , conditions , and times and mixing the coordinates in both children. We can see this process in Figure 6.

3.4.8. Mutation

An individual can be mutated according to a probability of mutation, Mut. The mutation probability is verified for every individual and if it is satisfied, one out of six possible actions is taken. These actions are as follows: add a new random gene to in , add a new condition to in , or add a new time to in or by removing a random gene, condition, or time. The election of these actions is also random. For the case of addition of a new gene, condition, or time, the operator checks whether the new member is already in the individual or not.

4. Results

We have applied the proposed measure MSR_3D as part of the TriGen algorithm to analyse several datasets: synthetically generated data, data from experiments with the yeast cell cycle (Saccharomyces cerevisiae) obtained from the Stanford University [16], three datasets retrieved from Gene Expression Omnibus [17], and a database repository of high throughput gene expression data. Two datasets are experiments for mouse (Mus musculus) [32, 33] and the third one is an experiment for humans (Homo sapiens) [34]. All experiments examine the behaviour of genes under conditions at certain times.

To examine the quality of the results in experiments with real datasets, we show for each experiment two types of validity measures: analysis of correlation among the genes, conditions, and times in each tricluster and analysis of genes and gene product annotations for the genes in each tricluster based on the Gene Ontology project [20].

Regarding the correlation analysis, we show a table for each tricluster (in rows) in which we calculate the Pearson and Filon [18] and Spearman [19] correlation coefficient between each combination of condition time and the values series are the expression levels of all genes in the corresponding condition-time combination. For example, for a tricluster with ten genes , three conditions and , and two times and , we provide Pearson’s and Spearman’s correlation coefficient for values at the six possible combinations , , , , , and for each of the ten genes.

In the biological analysis we provide a validation of the triclusters obtained based on the Gene Ontology project (GO) [20]. GO is a major bioinformatic initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. The project provides an ontology of terms for describing gene product characteristics and gene product annotation data. The ontology covers three domains: cellular component, the parts of a cell or its extracellular environment; molecular function, the elemental activities of a gene product at the molecular level such as binding or catalysis; and biological process, operations, or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms. For legibility reasons, we have presented for one solution of the experiment a GO analysis table in which we include the most representative terms extracted by the Ontologizer software [35].

We have also provided a graphical representation of the triclusters found. For legibility reasons we show graphs for one tricluster for each of the experiments. Each tricluster is represented through three graphical views in which we can see the pattern of behavior. In the first (sample curves), we show one graph for each time, genes on the x-axis, the expression levels on the y-axis, and the lines of condition as the outline. In the second (time curves), we show for each experimental condition (one graph for each condition) genes on the x-axis, the expression levels in the y-axis, and the time lines as the outline. In the third representation (gene curves), for each experimental condition (one graph for each condition) we show times in the x-axis, the expression levels in the y-axis, and the genes as the outline.

All experiments were executed on a multiprocessor machine with 64 processors Intel Xeon E7-4820 2.00 GHz with 8 GB RAM memory. We have used Java for the TriGen algorithm implementation (and other ad hoc developments) and an R framework to create graphics and get datasets resources from GEO [17].

We now analyse the results obtained in each of the five experiments.

4.1. Synthetic Datasets

Synthetic data has the advantage that the process that generated the data is well known and so one is able to judge the success or failure of the algorithm [36]. Synthetic datasets generation has been widely applied both in microarray related publications [37, 38] and in other general data mining applications [39].

We have used an application designed by ourselves to generate the synthetic data applied in this work. The data generated is a three-dimensional dataset Dsynt_3D with 4000 genes (rows), 30 conditions (columns), and 20 times (depth) of random numbers generated by a cryptographic secure standard library Math3 provided by Apache Commons [40] where we insert 10 triclusters , with 3D patterns of 150 genes (rows), 6 conditions (columns), and 4 times (depth) at random positions within Dsynt_3D.

To see the behavior of the MSR_3D measure applied along with TriGen and also with the aim of analyzing the effect of the value of the parameters in the solutions, we have made executions varying the number of solutions in and other control parameters as follows.(i)Number of generations in : greater number of generations gives us an increase in genetic recombination of individuals; an excessive increase in may favour exploitation versus exploration in excess and the algorithm may return solutions which fall into a local minimum.(ii)Number of individuals in : an increase in the number of individuals creates a larger search space for the solutions; an excessive increase can create a scatter search effect and therefore not return good quality solutions.(iii)Rate of selection in : a high selection rate creates individuals with low level of genetic recombination, favouring exploitation versus exploration and if the parameter is increased in excess, the algorithm may fall into a local minimum.(iv)Probability of mutation in : the opposite to the rate of selection. A high probability of mutation favours exploration versus exploitation, and if increased in excess you will end up with solutions in many areas of the search space but with low quality levels.(v)Randomness in the initial population in : increasing this parameter involves increasing the level of randomness in the initial population. This has to be combined with the overlap control to make sure that a wide area of the space of solutions is initially covered.(vi)Weight for the number of genes in the solution in , weight for the number of conditions in , and weight for the number of times in control the number of items in the solutions; increasing these weights involves favouring solutions with more volume.(vii)Overlap control weights for genes, in , conditions in , and times in : the increase in these weights leads to little or nonoverlapped solutions; an excessive increase can lead us to lose interesting solutions.

The results obtained are shown in Table 2. We can see the high rate of coverage (90–96%) of the 10 different triclusters inserted at random positions in the dataset Dsynt_3D.

We can conclude that the MSR_3d measure applied along with TriGen algorithm was successful in finding the solution triclusters.

4.2. Yeast Cell Cycle Dataset

We have applied the TriGen algorithm to the yeast (Saccharomyces cerevisiae) cell cycle problem [16]. The yeast cell cycle analysis project’s goal is to identify all genes whose mRNA levels are regulated by the cell cycle. The resources used are public and available in http://genome-www.stanford.edu/cellcycle/. Here we can find information relative to gene expression values obtained from different experiments using microarrays. In particular, we have created a dataset Delu_3D from the elutriation experiment with 7744 genes, 13 experimental conditions, and 14 time points. Experimental conditions correspond to different statistical measures of the Cy3 and Cy5 channels while time points represent different moments of taking measures from 0 to 390 minutes.

The parameter configuration used for this experiment is shown in Table 3.

With this configuration we wanted to find solutions with a considerable number of genes because it is the largest dimension on Delu_3D. With the overlap control values we seek a compromise between slightly overlapped solutions and not losing interesting triclusters. The rest of the parameters have been set to a default configuration.

To analyse the results, we can see the correlation in Table 4. We see how the correlation levels vary from very low up to almost perfect correlation. This is due to the fact that MSR_3D is capable of finding negatively correlated values, and some genes involved in the yeast cell cycle behave in an inversely correlated manner [41, 42] as can be seen in Figure 7(a). Therefore, when calculating the averages of correlations close to one and correlations close to minus one, we get values close to zero. Triclusters , , and stand out for having Pearson and Spearman correlation values close to one indicating an almost perfect correlation.

(a) Sample curves

(b) Time curves

(c) Gene curves

We also show a graphical representation of the genes, conditions, and times selected by tricluster with 30 genes, 3 conditions, and 9 time points in Figure 7. In Figure 7(a) we see a representation of genes at each condition with a graph for each time. The negative correlation among genes is clearly shown. Figure 7(b) shows the genes at each time with one graph for each condition, and finally in Figure 7(c) we see the times at each gene with a graph for each condition.

In Table 5 we show an analysis of the biological annotations related to the genes selected in our tricluster .

In this type of studies, values are relevant below 0.05. We show the ten most significant terms with values ranking in the [0.001970,0.01039] interval. Furthermore, these terms are quiet specific increasing the quality of the tricluster obtained.

4.3. Mouse GDS4510 Dataset

This dataset was obtained from the GEO [17] with accession code GDS4510 and under the title rd1 model of retinal degeneration: time course [32]. In this experiment the degeneration of retinal cells in different individuals of home mouse (Mus musculus) is analyzed over 4 days just after birth, specifically on days 2, 4, 6, and 8. Our input dataset is composed of 22690 genes, 8 experimental conditions (one for each individual involved in the biological experiment), and 4 time points.

We have executed the TriGen algorithm with the parameters shown in Table 6. We have increased the number of generations and individuals to create a larger search space as the input dataset has a considerable large volume. For the same reason we have increased to favor individuals with a greater number of genes.

In Table 7 we see the correlation analysis for the 20 triclusters obtained. The correlation coefficients are very high and, in most cases, perfect with values close to one. This indicates almost perfect homogeneity between the genes, conditions, and times of the tricluster.

We show the graphs associated with solution with 78 genes, 6 conditions, and 3 time points in Figure 8. We see for the three views, Figures 8(a), 8(b), and 8(c), how all lines are totally aligned.

(a) Sample curves

(b) Time curves

(c) Gene curves

The biological validity of the solution shown can be found in Table 8 and yields good results regarding the terms listed and high statistical significance ( values below 0.05). The terms again are very specific and some are related to the dataset under study such as embryonic placenta development (GO:0001892) or cell differentiation involved in embryonic placenta development (GO:0060706).

4.4. Mouse GDS4442 Dataset

This time we have accessed the GEO database [17] to retrieve the dataset about the experiment under code GDS4442 titled ectopic bHLH transcription factor expression Mesogenin1 effect on embryoid bodies: time course [33]. This biological experiment examines the effect of doxycycline induction in mouse (Mus musculus) embryonic individuals at three stages of development: 12, 24, and 48 hours. Our input dataset is composed by 45101 genes, 6 experimental conditions (one for each individual involved in the biological experiment), and 3 time points.

Regarding the TriGen parameters, we increased and for the same reason as in the previous experiment, that is, to have more solutions in the evolutionary process with a larger number of generations due to size of , see Table 9.

Regarding the correlation analysis, the results show high correlation values, highlighting the solutions , , and with Pearson’s correlation values close to 1, see Table 10.

We show in Figure 9 the graphical representation of solution with 15 genes, 5 conditions, and 2 time points. We can see the great homogeneity among all genes, conditions, and times in Figures 9(a), 9(b), and 9(c).

(a) Sample curves

(b) Time curves

(c) Gene curves

The biological evaluation of tricluster shown in Table 11 shows annotated terms with high statistical significance, highlighting GO:0045127, GO:0009384, and GO:0019262 which are related to the cell wall synthesis which, in turn, is related to the action of doxycycline.

4.5. Human GDS4472 Dataset

This dataset has been obtained from GEO [17] under code GDS4472 titled transcription factor oncogene OTX2 silencing effect on D425 medulloblastoma cell line: time course [34]. In this experiment we analyze the effect of doxycycline on medulloblastoma cancerous cells at six times after induction: 0, 8, 16, 24, 48, and 96 hours. Our input dataset is composed by 54675 genes, 4 conditions (one for each individual involved), and 6 time points (one per hour).

Because of the volume of the dataset we increase and to expand the space of solutions. The full set of parameters can be seen in Table 12.

We can see in Table 13 the high levels of correlation obtained for the 15 solutions found.

We graphically represent tricluster with 25 genes, 2 conditions, and 2 time points in Figure 10. We can see the great homogeneity among all genes, conditions, and times in Figures 10(a), 10(b), and 10(c).

(a) Samples curves

(b) Time curves

(c) Gene curves

The biological validation can be seen in Table 14, where we see annotated terms with high statistical significance.

5. Conclusions

In this work we have presented a new evaluation measure for triclusters, MSR_3D, which measures the homogeneity among genes, conditions, and times in a tricluster. This measure has been inspired in the classic MSR measure proposed by Cheng and Church in [11]. A detailed formulation of both MSR and MSR_3D has been provided.

In order to assess the quality of the measure, we have applied it along with the TriGen algorithm [12], an evolutionary heuristic to mine triclusters from microarray experiments involving time, to several datasets: synthetically generated data, data from experiments with the yeast cell cycle (Saccharomyces cerevisiae) obtained from the Stanford University [16], and three datasets retrieved from Gene Expression Omnibus [17], two datasets are experiments for mouse (Mus musculus) and the third one is an experiment for humans (Homo sapiens). All experiments examine the behavior of genes under conditions at certain times.

The results obtained have been validated by means of analyzing the correlation among the genes, conditions, and times in each tricluster using two different correlation measures: Pearson and Filon [18] and Spearman [19]. Besides this, we have provided functional annotations for the genes extracted from the Gene Ontology project [20]. Regarding the synthetic data, we see that MSR_3D combined with TriGen has been capable of extracting almost all 10 triclusters artificially inserted in the dataset with a coverage of 90% to 96%. The results for the real datasets are also successful, with correlation values close to one, with the exception of the yeast dataset, where values are close to zero due to triclusters containing negatively correlated genes, found by MSR_3D.

The GO validation has given good results as well, with high levels of significance for the terms extracted ( values smaller than 0.05 and very specific terms). Graphical representation of the triclusters has also been provided.

MSR_3D is a tricluster evaluation measure created to assess the quality of triclusters extracted from temporal experiments with microarrays, but it can be used in other biologically related fields, for instance combining expression data with gene regulation information by means of substituting the time dimension by ChIP-chip data representing transcription factor-gene interactions which can provide us with regulatory network information. This proposal can also be applied to mine RNA-seq data repositories. Triclustering can also be applied to not biologically related fields, for instance, the seismic regionalization of areas at risk of undergoing an earthquake [43]. In this case, the third component does not identify time points but features associated with every pair of geographical coordinates of the area under study.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors want to thank the financial support given by the Spanish Ministry of Science and Technology with project TIN2011-28956-C02-02 and Junta de Andalucía with project TIC-7528.

References

R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, New York, NY, USA, 1998.
P. O. Brown and D. Botstein, “Exploring the new world of the genome with DNA microarrays,” Nature Genetics, vol. 21, no. 1, pp. 33–37, 1999.
View at: Publisher Site | Google Scholar
C. Rubio-Escudero, F. Martínez-Álvarez, R. Romero-Zaliz, and I. Zwir, “Classification of gene expression profiles: comparison of K-means and expectation maximization algorithms,” in Proceedings of the 8th International Conference on Hybrid Intelligent Systems, HIS 2008, pp. 831–836, Barcelona, Spain, September 2008.
View at: Publisher Site | Google Scholar
M. P. Tan, E. N. Smith, J. R. Broach, and C. A. Floudas, “Microarray data mining: a novel optimization-based approach to uncover biologically coherent structures,” BMC Bioinformatics, vol. 9, article 268, 2008.
View at: Publisher Site | Google Scholar
P. D'Haeseleer, S. Liang, and R. Somogyi, “Genetic network inference: from co-expression clustering to reverse engineering,” Bioinformatics, vol. 16, no. 8, pp. 707–726, 2000.
View at: Google Scholar
A. Ben-Dor, B. Chor, R. Karp, and Z. Yakhini, “Discovering local structure in gene expression data: the order-preserving submatrix problem,” in Proceedings of the 6th Annual International Conference on Computational Biology, pp. 49–57, April 2002.
View at: Google Scholar
J. A. Hartigan, “Direct clustering of a data matrix,” Journal of the American Statistical Association, vol. 67, no. 337, pp. 123–129, 1972.
View at: Publisher Site | Google Scholar
S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for biological data analysis: a survey,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 1, no. 1, pp. 24–45, 2004.
View at: Publisher Site | Google Scholar
Z. Bar-Joseph, “Analyzing time series gene expression data,” Bioinformatics, vol. 20, no. 16, pp. 2493–2503, 2004.
View at: Publisher Site | Google Scholar
F. Divina, B. Pontes, R. Giráldez, and J. S. Aguilar-Ruiz, “An effective measure for assessing the quality of biclusters,” Computers in Biology and Medicine, vol. 42, no. 2, pp. 245–256, 2012.
View at: Publisher Site | Google Scholar
Y. Cheng and G. M. Church, “Biclustering of expression data,” in Proceedings of the International Conference on Intelligent Systems for Molecular Biology (ISMB '00), pp. 93–103, 2000.
View at: Google Scholar
D. Gutiérrez-Avilés, C. Rubio-Escudero, F. Martínez-Álvarez, and J. C. Riquelme, “Trigen: a genetic algorithm to mine triclusters in temporal gene expression data,” Neurocomputing, vol. 132, pp. 42–53, 2014.
View at: Publisher Site | Google Scholar
H. Banka and S. Mitra, “Evolutionary biclustering of gene expressions,” Ubiquity, vol. 2006, article 5, 2006.
View at: Publisher Site | Google Scholar
S. Mitra and H. Banka, “Multi-objective evolutionary biclustering of gene expression data,” Pattern Recognition, vol. 39, no. 12, pp. 2464–2477, 2006.
View at: Publisher Site | Google Scholar
A. Tanay, R. Sharan, and R. Shamir, “Discovering statistically significant biclusters in gene expression data,” Bioinformatics, vol. 18, supplement 1, pp. S136–S144, 2002.
View at: Google Scholar
P. T. Spellman, G. Sherlock, M. Q. Zhang et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998.
View at: Google Scholar
T. Barrett, S. E. Wilhite, P. Ledoux et al., “NCBI GEO: archive for functional genomics data sets—update,” Nucleic Acids Research, vol. 41, pp. D991–D995, 2011.
View at: Publisher Site | Google Scholar
K. Pearson and L. N. G. Filon, “Mathematical contributions to the theory of evolution. IV. On the probable errors of frequency constants and on the influence of random selection on variation and correlation,” Philosophical Transactions of the Royal Society of London. Series A, vol. 191, pp. 229–311, 1898.
View at: Google Scholar
C. Spearman, “Correlation calculated from faulty data,” The British Journal of Psychology, vol. 3, no. 3, pp. 271–295, 1910.
View at: Publisher Site | Google Scholar
M. Ashburner, C. A. Ball, J. A. Blake et al., “Gene ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000.
View at: Publisher Site | Google Scholar
L. Zhao and M. J. Zaki, “TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '05), pp. 694–705, June 2005.
View at: Publisher Site | Google Scholar
H. Jiang, S. Zhou, J. Guan, and Y. Zheng, “gTRICLUSTER: a more general and effective 3D clustering algorithm for gene-sample-time microarray data,” in Data Mining for Biomedical Applications, vol. 3916 of Lecture Notes in Computer Science, pp. 48–59, Springer, New York, NY, USA, 2006.
View at: Google Scholar
J. Liu, Z. Li, X. Hu, and Y. Chen, “Multi-objective evolutionary algorithm for mining 3D clusters in gene-sample-time microarray data,” in Proceedings of the IEEE International Conference on Granular Computing (GRC '08), pp. 442–447, Hangzhou, China, August 2008.
View at: Publisher Site | Google Scholar
X. Xu, Y. Lu, K. L. Tan, and A. K. H. Tung, “Finding time-lagged 3D clusters,” in Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE '09), pp. 445–456, Shanghai, China, April 2009.
View at: Publisher Site | Google Scholar
G. Wang, L. Yin, Y. Zhao, and K. Mao, “Efficiently mining time-delayed gene expression patterns,” IEEE Transactions on Systems, Man, and Cybernetics B: Cybernetics, vol. 40, no. 2, pp. 400–411, 2010.
View at: Publisher Site | Google Scholar
K. Sim, Z. Aung, and V. Gopalkrishnan, “Discovering correlated subspace clusters in 3D continuous-valued data,” in Proceedings of the 10th IEEE International Conference on Data Mining (ICDM '10), pp. 471–480, Sydney, Australia, December 2010.
View at: Publisher Site | Google Scholar
Z. Hu and R. Bhatnagar, “Algorithm for discovering low-variance 3-clusters from real-valued datasets,” in Proceedings of the 10th IEEE International Conference on Data Mining (ICDM '10), pp. 236–245, Sydney, Australia, December 2010.
View at: Publisher Site | Google Scholar
Y. C. Liu, C. H. Lee, W. C. Chen, J. W. Shin, H. H. Hsu, and V. S. Tseng, “A novel method for mining temporally dependent association rules in three-dimensional microarray datasets,” in Proceedings of the International Computer Symposium (ICS '10), pp. 759–764, Tainan City, Taiwan, December 2010.
View at: Publisher Site | Google Scholar
P. Mahanta, H. A. Ahmed, D. K. Bhattacharyya, and J. K. Kalita, “Triclustering in gene expression data analysis: a selected survey,” in Proceedings of the 2nd National Conference on Emerging Trends and Applications in Computer Science (NCETACS '11), pp. 1–6, Shillong, India, March 2011.
View at: Publisher Site | Google Scholar
S. Gremalschi and G. Altun, “Mean squared residue based biclustering algorithms,” in Bioinformatics Research and Applications, vol. 4983 of Lecture Notes in Computer Science, pp. 232–243, Springer, New York, NY, USA, 2008.
View at: Publisher Site | Google Scholar
M. Martínez-Ballesteros, F. Martínez-Álvarez, A. Troncoso, and J. C. Riquelme, “An evolutionary algorithm to discover quantitative association rules in multidimensional time series,” Soft Computing, vol. 15, no. 10, pp. 2065–2084, 2011.
View at: Publisher Site | Google Scholar
V. M. Dickison, A. M. Richmond, A. Abu Irqeba et al., “A role for prenylated rab acceptor 1 in vertebrate photoreceptor development,” BMC Neuroscience, vol. 13, article 152, 2012.
View at: Publisher Site | Google Scholar
R. B. Chalamalasetty, W. C. Dunty Jr., K. K. Biris et al., “The Wnt3a/β-catenin target gene Mesogenin1 controls the segmentation clock by activating a Notch signalling program,” Nature Communications, vol. 2, no. 1, article 390, 2011.
View at: Publisher Site | Google Scholar
J. Bunt, N. E. Hasselt, D. A. Zwijnenburg et al., “OTX2 directly activates cell cycle genes and inhibits differentiation in medulloblastoma cells,” International Journal of Cancer, vol. 7, no. 6, pp. E21–E32, 2011.
View at: Publisher Site | Google Scholar
S. Bauer, S. Grossmann, M. Vingron, and P. N. Robinson, “Ontologizer 2.0—a multifunctional tool for GO term enrichment analysis and data exploration,” Bioinformatics, vol. 24, no. 14, pp. 1650–1651, 2008.
View at: Publisher Site | Google Scholar
P. Mendes, “GEPASI: a software package for modelling the dynamics, steady states and control of biochemical and other systems,” Computer Applications in the Biosciences, vol. 9, no. 5, pp. 563–571, 1993.
View at: Google Scholar
M. Barenco, J. Stark, D. Brewer, D. Tomescu, R. Callard, and M. Hubank, “Correction of scaling mismatches in oligonucleotide microarray data,” BMC Bioinformatics, vol. 7, article 251, 2006.
View at: Publisher Site | Google Scholar
K. Hakamada, M. Okamoto, and T. Hanai, “Novel technique for preprocessing high dimensional time-course data from DNA microarray: mathematical model-based clustering,” Bioinformatics, vol. 22, no. 7, pp. 843–848, 2006.
View at: Publisher Site | Google Scholar
R. P. Pargas, M. J. Harrold, and R. R. Peck, “Test-data generation using genetic algorithms,” Software Testing Verification and Reliability, vol. 9, no. 4, pp. 263–282, 1999.
View at: Google Scholar
Apache Commons, “Commons-math: the apache commons mathematics library,” 2011.
View at: Google Scholar
T. Zeng and J. Li, “Maximization of negative correlations in time-course gene expression data for enhancing understanding of molecular pathways,” Nucleic Acids Research, vol. 38, no. 1, article e1, 2009.
View at: Publisher Site | Google Scholar
M. J. Brauer, C. Huttenhower, E. M. Airoldi et al., “Coordination of growth rate, cell cycle, stress response, and metabolic activity in yeast,” Molecular Biology of the Cell, vol. 19, no. 1, pp. 352–367, 2008.
View at: Publisher Site | Google Scholar
J. Reyes and V. H. Cárdenas, “A Chilean seismic regionalization through a Kohonen neural network,” Neural Computing and Applications, vol. 19, no. 7, pp. 1081–1087, 2010.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2014 David Gutiérrez-Avilés and Cristina Rubio-Escudero. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1722

Downloads

1085

Citations