Abstract

The most commonly applied strategies for identifying genes with a common response profile are based on clustering algorithms. These methods have no explicit rules to define the appropriate number of groups of genes. Usually the number of clusters is decided on heuristic criteria or through the application of different methods proposed to assess the number of clusters in a data set. The purpose of this paper is to compare the performance of seven of these techniques, including traditional ones, and some recently proposed. All of them produce underestimations of the true number of clusters. However, within this limitation, the gDGC algorithm appears to be the best. It is the only one that explicitly states a rule for cutting a dendrogram on the basis of a testing hypothesis framework, allowing the user to calibrate the sensitivity, adjusting the significance level.

1. Introduction

One of the main purposes of microarray experiments is to discover genes having differential expression level among a set a treatment conditions. Once the set of “candidates” genes is obtained, the problem of identifying those having a common response profile across experimental conditions remains open [13]. There are many strategies to proceed with. One of them is the exploration of gene’s ontology; the others—and more commonly applied—are based on unsupervised classification algorithms (cluster analysis). The main purpose of clustering techniques is to arrange a number of instances to produce meaningful grouping of them. Hierarchical clustering methods not only allow to group genes but also to trace their relationships. The outcome of hierarchical methods is displayed as a binary tree called dendrogram. A key point for interpreting a dendrogram is to decide where to cut it. This decision is equivalent to determine the number of clusters in the dataset. The problem is to realize which instances belong to different groups and which seem to be different just as the result of sampling errors.

Several general-purpose methods have been proposed to estimate the optimal number of clusters in a dataset. The most popular are those introduced by Calinski and Harabasz [4], Hartigan [5], Sarle [6], and Kaufman and Rousseeuw [7].

Tibshirani et al. [8] proposed the Gap statistic as a method for assessing the number of clusters in a dataset. It compares the log of the within-cluster sum of squares against its expected value under a suitable null distribution. Authors exemplified its application to the discovery of groups in a hierarchical clustering of genes of a microarray experiment.

Another method, developed in the framework of large-scale gene-expression studies, is the algorithm called Hierarchical Ordered Partitioning and Collapsing Hybrid (HOPACH) [2]. It is a hierarchical tree of clusters and was developed for the purpose of discovering patterns within a hierarchical structure.

A different approach to the problem of grouping is the visualization of data as a sample of a mixture of populations. The classification is the result of a mixture model estimation and selection that takes into account not only different number of populations present in the sample and their location parameters but also the dispersion and correlation parameters. A procedure that is representative of this kind of methods is the algorithm MClust of Fraley and Raftery [9, 10], which is based on the modeling of a multivariate normal mixture.

Every method mentioned previously assumes that each instance is represented by a -variate vector of attributes. In microarray experiments genes represent the instances and their expressions, observed in the contrasting experimental conditions, the -variate vector of attributes. In this type of experiments there are, usually, several biological replicates for each experimental condition. None of the methods mentioned above makes an explicit use of those replicates. Valdano and Di Rienzo [11] proposed a multivariate generalization (gDGC) of a univariate pairwise comparison procedure [12] which uses replicates to estimate the cutting point of a dendrogram generated by a given linkage algorithm. In this way the procedure generates a partition within a hierarchical structure which is a nice property in the framework of microarray experiments data analysis.

Considering the revisions of Tibshirani et al. [8], Lee et al. [1], Pollard and van der Laan [2], and Gentleman and Carey [3], regarding the problem of assessing the number of clusters in a dataset, we focused on the comparison of the last four methods mentioned before for estimating the number of clusters in a gene-expression matrix. However, we included general-purpose methods as reference. The comparison was done under a set of scenarios described in following sections.

2. Methods

Let the dataset consist of features measured on -independent observations; that is, is a nxp data matrix. Suppose that we have grouped the data into clusters. Let be a cluster indicator matrix ( if the th observation belongs to the th cluster, else and a matrix of cluster means:

Then, the pooled within-cluster sum of squares matrix is and the between-cluster sum of squares matrix is

The within-cluster and the between-cluster sums of squares pooled over variables, for a given number of clusters, are, respectively,

The following methods, proposed to estimate the optimum number of clusters in a dataset, are compared. They are identified by the name of the algorithm which implements them or by the initials of their authors.

CH. Calinski and Harabasz [4] based the selection of the number of clusters on the maximization of the between/within-cluster sums of squares ratio. The criterion to choose is the one that maximizes CH:

H. Hartigan [5] used the ratio between the within-cluster sums of squares of and () clusters suggesting the selection of as the minimum for which the ratio is lesser than 10:

CCC. Sarle [6] introduced the cubic clustering criterion based on the scaled , where is the proportion of variance accounted by the clusters and is the expected value of observed assuming that the data are uniformly distributed on a hypercube: is the dimensionality of the between-cluster variation. The criterion to choose the optimum number of clusters is to select the number that maximizes CCC. Maximum value of the CCC index lesser than 2 indicates that there is no evidence of the existence of clusters in the dataset.

CCCm. We also included a modified version of CCC for which the expected value of is calculated from the null distribution described as a uniform distribution over a box aligned with the principal components of the data as was proposed by Tibshirani et al. [8] for the Gap statistic.

Silh. Kaufman and Rousseeuw [7] introduced the Silhouette statistic, a measurement, calculated for each observation, based on a standardized difference between the average distance of the th observation to each other in the same cluster and the average distance to the observations in the nearest cluster :

They proposed to choose the optimal number of clusters as the value maximizing the average silhouette. It is implemented in the function silcheck in the hopach -package [13].

Gap. Tibshirani et al. [8] used the gap statistic for estimating the number of clusters in a dataset. The gap statistic compares the log of the within-cluster sum of squares against its expected value under a suitable null distribution of the dataset:

The criterion to select the number of clusters is, in this case, the lesser such that , where is the standard deviation of a prediction of gap when the number of clusters is . For the calculation of , Tibshirani et al. [8] generate a null distribution from a uniform distribution over a box aligned with the principal components of the data. The authors argued that this way of generating the data takes into account the shape of the distribution of the original observations and makes the procedure rotationally invariant, as long as the clustering method itself is invariant.

HOPACH. Pollard and van der Laan [2] proposed another application of the Silhouette statistics: the Hierarchical Ordered Partitioning and Collapsing Hybrid (HOPACH) procedure which iteratively applies a partitioning algorithm to produce a hierarchical tree of clusters. It is implemented in the hopach function of the hopach -package [13]. At each node, a cluster is partitioned into two or more smaller clusters and, before the next partitioning step, any similar clusters are merged. The algorithm estimates the optimal number of clusters based on median split silhouette criterion. The function can be called passing to it a data frame or a distance matrix. We try both: passing the data frame of the average gene-expressions (HOPACHc) and the Mahalanobis distances matrix (HOPACHm). All other arguments were left to their default settings.

MClust. Fraley and Raftery [9, 10] proposed a method of clustering that is based on the assumption that the dataset is a sample of a multivariate normal mixture. The method fits a number of models for a number of populations differing not only in the location parameter but also in the variance-covariance matrix within a set of plausible simplified correlation structure. The model selection rule is based on the Bayesian Information Criterion. As part of the output of this method, the estimated number of clusters in the data is obtained. The input to this method is the matrix of average expression level for each gene (rows) on the different experimental conditions (columns). The routine is already implemented in (mclust -package). Within our simulation, it was called without any additional arguments except the data frame.

gDGC. Valdano and Di Rienzo [11] calculated a cutting point for a dendrogram, generated by a given linkage algorithm, based on the null distribution of the root node of the binary tree produced by the clustering procedure. The node in which two mean vectors—or a cluster of them—join have an associated measure that corresponds to the distance—calculated according to linkage algorithm—between the mean vectors or the clusters that the node is joining. The node in which all mean vectors join, to form a unique cluster, is the root node. In the UPGMA algorithm, if and are two different clusters, the distance between them is defined as follow: where is the square root of Mahalanobis distance. If and are coincident, then .

The smallest value of will correspond to the pair of most similar mean vectors and the node that is formed will be at a distance from the origin. The following distance——is associated with the next node, which can join two different mean vectors or the cluster previously formed and another mean vector. At the end of the clustering algorithm, the last union will be at distance and will be referred to as the distance to the root node (Figure 1). This distance can be seen as a realization of a random variable . The (1−α)-quantile of its distribution under the null hypothesis of equal population mean vectors can be used to construct a test of size α. Given , as the α-level critical value, all will lead to the rejection of the null hypothesis. An routine that calculates critical points of the null distribution of Q is freely available to download at: http://agro.uncor.edu/~estad/gDGCQ.r. A friendly implementation of gDGC for its application on a gene-expression matrix can be found in the free-software fgStatistics http://sites.google.com/site/fgstatistics/.

3. Simulated Data

The primary output of a microarray experiment is the gene-expression matrix (GEM). It is composed by rows and columns. is the number of “genes” evaluated and is the number of microarrays (treatments × replicates) used in the experiment. Usually is bigger than and varies between hundreds to tens of thousands. Candidate genes are those genes that are differentially expressed among “treatments”. The set of candidate genes is smaller than original set of genes and its size is around tens to hundreds of genes. This drastic reduction in the number of genes relays in the assumption that most of them remain unchanged under the experimental conditions contrasted.

To simulate the candidate genes expression matrices we considered two scenarios regarding the number of differentially expressed genes (100 and 300 genes), two levels for the number of clusters which have similar profile among treatments: 2 and 10, two levels for the number of treatment conditions: 3 and 5, and two levels for the number of replicates: 3 and 6. Anumber of genes, clusters, treatments, and replicates do not intend to cover all possibilities but common cases in microarray experiments. According to the number of differentially expressed genes (2), the number of clusters (2), the number of treatments (2), and then number of replicates (2), 16 scenarios were considered, For each scenario 10 simulated candidate-gene-expression matrixes (sGEM) were randomly generated.

Each sGEM was generated from the GEM of a self-self cDNA-microarray experiment dataset [14] according to Algorithm 1 described in the appendix. The algorithm relays on the availability of a residual gene-expression matrix (rGEM). This residual matrix was obtained from the GEM of the self-self experiment (s-sGEM) by centering by rows and columns and adding to each entry the mean of all entries. This way of obtaining an rGEM assumes that each entry () in s-sGEM can be modeled as , where is a common mean, is the effect of the th gene, is the th microarray’s effect, and is a random error with zero mean. The resulting rGEM was a 3830 (rows) by 10 (columns) matrix and is available at the following link: https://docs.google.com/leaf?id=0BxMg4dIPlsq7MzhhMGNjNzMtNGUwYS00NmYzLWI0NDctZjZlNTFiYTEzYWZm&hl=es.

Clusters were generated by randomly allocating the number of genes belonging to each cluster based on Algorithm 2 described in the appendix. A randomly generated profile of treatment effects—scaled by the common within standard deviation of each gene—was added for every gene in the same cluster. The nonscaled profile was generated uniformly between −3 and 3. In this way, differences among treatment means ranged between −3 and 3 times the common within standard deviation for a given gene.

To summarize the effect of the methods to assessing the optimum number of clusters, a linear model was fitted to the difference between the number of clusters in the dataset and its estimation. Hereafter, we will refer to this difference as the bias. The factors included in the model were the following: the method used to estimate the number of clusters (), the true number of clusters (), the number of genes (), the number of treatments (), and the number of replicates (). Because each method was applied to the same simulated data, a dataset effect was included in the model as a random effect. Due to the number of terms involved in the adjusted model—main effects and their interaction—the Benjamini-Hochberg algorithm [15] was applied to adjust the raw value in order to control the false discovery rate. The significance level was 0.05. For the significant terms of the model, confidence intervals were calculated for their marginal means. The mixed model was fitted using the lme function (nlme -package).

4. Results

All the methods compared in this study (except MClust) can be applied to the same distance matrix used by the clustering algorithm. Because gDGC method uses the Mahalanobis distance to measure the dissimilarity between mean vectors (genes), we decided to base our comparison using this matrix. Mahalanobis distance is a nice metric because it takes into account variances and covariances of attributes. The covariance matrix used to calculate the Mahalanobis distance is the common—pooled—within gene covariance matrix.

Different linkage algorithms are separately analyzed. First we present results for average linkage, then the results for complete linkage. Ward’s algorithm was also included in the comparison but results are not shown because of its poor performance. Although MClust does not depend on the linkage algorithm, it will appear in the comparison under the subtitles Average linkage and Complete linkage.

4.1. Average Linkage

Table 1 summarizes the ANOVA table for the fitted model when the true number of clusters () in the dataset is 2 and 10. In both cases, the best model included a variance function to take into account that residual variance was much greater for HOPACH than for the other procedures. The residual standard deviation of HOPACHm and HOPACHc was around 10 () to 12 () times the common standard deviation of the other procedures.

Table 1 shows evidence of differences in the mean bias among methods compared. These differences do not depend on other factors when . Table 2 summarizes the performance of the methods when . It shows that no matter the input used to the HOPACH method (Mahalanobis distances matrix or the mean-GEM), it produces the highest bias, about seven clusters above the true value. On the other hand, only CCCm, CCC, gDGC, and MClust had confidence intervals compatible with the unbiasedness hypothesis.

When there is considered the case of moderate number of clusters (), the performance of the methods depends on the number of treatments. Table 3 shows the mean bias by method, grouped according to the number of treatments. The rank of the methods is almost the same when or . HOPACH overestimated whereas all other methods underestimated the number of clusters. However, as the number of treatments increases (), a differentiation in favour to gDGC and MClust is apparent.

4.2. Complete Linkage

Table 4 summarizes the ANOVA table for the fitted model when the true number of clusters () in the dataset is 2 and 10. In both cases the best model included a variance function to take into account that residual variance was much greater for HOPACH than for the other procedures. The residual standard deviation of HOPACHm and HOPACHc was around 10 () to 13 () times the common standard deviation of the other procedures.

When the true number of clusters in the dataset was 2, the highest interaction terms including method were and . The mean and 95% confidence interval for the bias for each combination of method and number of genes and of method and number of treatments are shown in Tables 5 and 6, respectively.

As a general remark the increase in is followed by a decrease in the bias. However there are important differences within methods depending on . For , methods which produced estimates compatible with the unbiasedness hypothesis were CCC, CCCm, gDGC, and MClust. For , those methods were HOPACHc, CH, gDGC, Silh, CCC, CCCm, and MClust.

Considering results shown in Table 6, the increase in the number of treatments is followed by a decrease in bias. As in the previous case there are differences in the performance of the methods depending on . However, no matter , HOPACH always overestimated the number of clusters. In the side of best performing methods the list contains Silh, CCC, CCCm, gDGC, and MClust. When , the previous list is augmented with CH.

When the true number of clusters in the dataset was 10, the highest interaction term including method was , which logically includes the also significant and interaction terms. There is also a second-order significant interaction given by . The mean and 95% confidence interval for the bias for the combinations of method and number of replicates are shown in Table 7. The corresponding table for combinations of method, number of genes, and number of treatments is shown in Table 8.

As a general remark, when the true number of clusters increases to a moderate number (), all methods underestimated the number of clusters, except HOPACH, which consistently overestimated it.

Although a significant interaction was found for the method and the number of replicates, Table 7 shows that there is no change in the ordering of the methods no matter . Gap, , gDGC, and MClust were the lesser negative-biased methods.

To analyze the performance of the methods to estimate the number of clusters regarding the interaction term, Table 8 is divided into four blocks defined by the combination levels of and . Within these blocks methods were sorted in descending order of bias. Taking into account the bias in the four blocks there are four methods that always have the lesser bias: Gap, H, gDGC, and MClust. Although their order changes in each block, the picture is the same. As in other cases analyzed, HOPACH always overestimated, by far, the number of clusters in the dataset.

5. Discussion

Two scenarios were considered in this work: when the true number of clusters is very small and when the number is moderate. In the first case (i.e., ), some methods estimated the true number of clusters quite well, no matter the linkage algorithm. These methods were CCC, CCCm, gDGC, and MClust. Meanwhile, all other methods produced overestimate. Within this group of methods, HOPACH (based on the Mahalanobis distance or the average gene-expression matrix) was, by far, the highest biased.

The case when the number of clusters is small is not the most challenging situation because most of clustering methods find the global structure. Moreover, in most microarray experiments the number of clusters will be greater than two. The problem is finding relatively small clusters in the presence of one or more larger clusters [2]. For moderate number of clusters, as could be 10, all methods gave negative-biased estimations of the number of clusters, except HOPACHm and HOPACHc that were positive biased and very variable. The positive bias of HOPACH is consistent with the properties of the median split Silhouette criterion (MSS), which was developed to be more “aggressive" for finding small, homogeneous clusters in large datasets [13].

Within the negative-biased methods, and according to the simulated scenarios, the results for the average and complete linkage algorithms suggest that the less-biased methods for assessing the number of clusters were MClust, gDGC, and Gap. However, considering all the scenarios there are two methods that consistently appeared in the best groups: gDGC and MClust.

One disadvantage of gDGC compared to MClust is that it relays on the availability of replicates. However, in actual microarray applications there are always biological replicates. So, in this context, that limitation is not a problem. Although gDGC is based on the null distribution of the root node of a binary tree, generated by a hierarchical clustering algorithm, and MClust is based on the modeling of a multivariate normal mixture, both are theoretically related. Their null model is that there is just one multivariate-normal population. For this reason, both can give, as a result, one cluster. When the null model fails, MClust assumes that the dataset is a mixture of samples from several multivariate-normal populations differing in their mean vector and possibly in their covariance matrix, with the number of populations being a parameter to estimate. gDGC also assumes that if the dataset is not a sample of a unique multivariate population, then it is a mixture of samples from several multivariate-normal populations. In contrast to Mclust, gDGC makes the simplified assumption that there is a common covariance matrix as in MANOVA. Nonetheless, an advantage of gDGC is that it can drop the assumption of multivariate normality and resample from an empirical estimated null distribution with, of course, additional computational cost.

Another point in favour to gDGC is that it is related to a dendrogram, a common way to illustrate relationships among genes (i.e., heatmaps). So, gDGC not only estimates the number of groups of genes having the same expression profile but also shows them using the intuitive idea of cutting a dendrogram, making its interpretation straightforward. Because the rule to cut the dendrogram is based of a testing hypothesis framework, it allows the user to calibrate the power of the test selecting the significance level of his/her choice.

gDGC and MClust are computer intensive methods, and the users will have to face the time cost of their implementations. However, it is possible—for common setups of the number of genes, replicates, number of treatments, and linkage algorithm—to speed up the gDGC algorithm having already calculated the appropriate percentile tables of the null distribution of its decision statistic.

In summary, there are not unbiased methods for estimating the number of clusters in a gene-expression matrix within those methods compared in this study. However, within the negative-biased methods, MClust and gDGC are the best choice.

Appendix

Algorithm A1. For the generation of a simulated gene-expression matrix (sGEM), one has the following.(1)Initialize groups = number of treatments, genes = number of genes, replicates = number of replicates, and set index , , and to 0.(2)Let .(3)Randomly choose the index that points to a row of the rGEM (residual matrix of gene expressions).(4)Initialize .(5)Let .(6)Let .(7)Randomly choose the index m that points to a column of rGEM.(8)Set that points to the columns of the sGEM (simulated matrix of gene-expression).(9)Let ( is the matrix of treatment effects).(10) If < replicates, go to 7; else if < groups, go to 6; else if < genes, go to 3; else end.

Algorithm A2. Genes belonging to each cluster will be generated by randomly assigning a row from the rGEM according to the following algorithm.(1)Let be the number of clusters to construct, the total number of candidate genes, and a list of integers of size indexing the genes.(2)Initialize with the sequence .(3)“Order at random” .(4)Select a set of () indexes between 1 and . If some of this (), are equal find another set of indexes.(5)Sort , in ascending order.(6) is the index of first and the index of the last gene belonging to cluster 1, and so on until is the index of first and the index of last gene belonging to cluster .

Acknowledgments

This work was partially supported by FONCyT-PICT 2005 Grant 32905 and PE INTA Grant AEBIO5471. The author are grateful to Dr. Ruth A. Heinz for critical reading of this manuscript.