Abstract

Random selection of initial centroids (centers) for clusters is a fundamental defect in -means clustering algorithm as the algorithm’s performance depends on initial centroids and may end up in local optimizations. Various hybrid methods have been introduced to resolve this defect in -means clustering algorithm. As regards, there are no comparative studies comparing these methods in various aspects, the present paper compared three hybrid methods with -means clustering algorithm using concepts of genetic algorithm, minimum spanning tree, and hierarchical clustering method. Although these three hybrid methods have received more attention in previous researches, fewer studies have compared their results. Hence, seven quantitative datasets with different characteristics in terms of sample size, number of features, and number of different classes are utilized in present study. Eleven indices of external and internal evaluating index were also considered for comparing the methods. Data indicated that the hybrid methods resulted in higher convergence rate in obtaining the final solution than the ordinary -means method. Furthermore, the hybrid method with hierarchical clustering algorithm converges to the optimal solution with less iteration than the other two hybrid methods. However, hybrid methods with minimal spanning trees and genetic algorithms may not always or often be more effective than the ordinary -means method. Therefore, despite the computational complexity, these three hybrid methods have not led to much improvement in the -means method. However, a simulation study is required to compare the methods and complete the conclusion.

1. Introduction

Clustering is a branch of unsupervised learning. This method is widely used as a first step to interpret the data. In this method, samples are divided into groups whose members are similar to each other [1]. A good clustering algorithm should be efficient, reliable, and capable to determine relevant clusters [2]. From the four famous crisp clustering branches including distribution-based, density-based, connection-based, and partition-based methods, EM (expectation-maximization) algorithm; DBScan (density-based spatial clustering of applications with noise), hierarchichal, and -means clustering methods can be pointed out, respectively [1]. Of course, there are other categories in clustering methods such as fuzzy clustering algorithms (such as fuzzy -means) which are not in the scope of the present research.

-means clustering is an important and popular technique in data mining. This method is a partition-based clustering algorithm which works with randomly selected points as the initial centroids (centers) at first and then updates these centroids in an iteratively process until some convergence criteria were met. The simplicity of -means clustering method makes it as a basic and popular method in different fields of research. The most important function of this method is that it works better when the clusters overlap [3]. This method also works with high-volume data. However, the more clusters there are, the more -means may fail to find all clusters correctly [3]. In addition, the clusters created in this method are spherical and convex. Its performance also depends on the initial centroids of clusters and often ends in the local optimization [3, 4]. To solve these problems, different hybrid methods have been proposed [525]. Some of them try to solve -means problems by different methods [510, 1220], and some others use the simplicity of -means method to improve the performance of other clustering methods [11, 2124]. The present paper evaluated the results of three well-known hybrid -means methods with minimum spanning tree (MST) [5], genetic algorithm (GA) [6], and hierarchical clustering [7] in different datasets. Genetic algorithm is a good option to solve the local optimization of -means and will give a proper initial cluster center [25]. Clustering based on MST is known for deriving disordered boundaries and outlier detection [23]. The MST-based clustering techniques have widely been used for efficient clustering [23, 24]. The combination of partition-based and hierarchical clustering methods may also strengthen both approaches and discard their disadvantages [7].

Meanwhile, an important task in cluster analysis is evaluating the results of a clustering method or comparing it to another clustering result. Lots of different validity measures have been proposed in the literature [26]. Among these evaluating methods, we applied eleven validity indices (internal, external, and relative) to judge or compare the results of clustering methods. Therefore, the analysis was performed in two phases. In phase I, to investigate whether -means is a proper clustering method for each dataset, EM, DBScan, hierarchical, and -means clustering methods were applied at first. Then, three hybrid methods were tested on each data in phase II and compared with the results of phase I.

Accordingly, the organization of this paper is as follows. Ordinary -means algorithm is briefly reviewed along with three hybrid methods in Section 2. Also, seven internet datasets utilized in present study are introduced there. On Section 3, four ordinary clustering algorithms including -means, hierarchical, DBScan, and EM algorithms accompanied with three hybrid methods including MST-based, GA-based, and hierarchical-based -means methods are applied on each dataset, and the results of eleven different external and internal evaluation indices are reported for comparison. Section 4 contains some discussion on comparing these methods.

2. Materials and Methods

2.1. Materials

All hybrid methods introduced with -means algorithm in the present paper with different underlying theories help improve this method by eliminating the random selection defects of initial centroids in the -means clustering method. The nature of these hybrid methods can be influenced by various factors such as number of variables (features) in the dataset, sample size, and even number of labels (number of classes) in the data and exhibit quite different results. Since these hybrid methods in a dataset have not been compared yet, seven web datasets with different features were used in order to investigate the performance of these methods in the present paper. The data consisted of three gene expression datasets relating to leukemia, prostate, and colon cancers, and they were considered as high-dimensional data with an expression of more than 20,000 genes and were downloaded from the Gene Expression Omnibus (GEO) database (Table 1). Another four datasets were also well-known standard and appropriate Internet data for clustering methods and have been used in most applied papers to measure the performance of clustering methods (for instances [5, 6]). These data are available to all researchers for scientific research in the UCI database (University of California Irvine (UCI): Center for Machine Learning and Intelligent System) (Table 1).

2.2. Methods
2.2.1. -Means Clustering Method

The basic idea in -means clustering includes the definition of clusters in a way that total within-cluster variation is minimized. There are many algorithms for -means clustering method. Mac-Queen algorithm was used in the present paper [27] that defined the total within-cluster variations as the sum of squares of the Euclidean distance between objects and centroids. Let be the set of -dimensional observations (points) to be clustered into a set of clusters, . -means algorithm finds a partition such that the squared error between the center (empirical mean) of a cluster and the points in the cluster is minimized. Let be the mean of cluster. The sum of squared error (SSE) between and the points in cluster is defined as [28]

The goal of -means is to minimize the sum of the squared error over all clusters,

In general, the algorithmic steps of this method are summarized as follows (Figure 1): (1)Initial cluster centroids are selected randomly from the observations(2)Distance between each observation and clusters’ centroid is calculated and the observation is assigned to a cluster with minimal distance from the centroid of that cluster(3)Cluster centroids are updated by averaging the observations contained in each cluster(4)Distance between each observation and new centroids of clusters is recalculated and data are placed in new clusters based on the minimum distance to the centroids(5)Steps 3 and 4 are repeated until the centroids of clusters are not changed and the convergence occurs

2.2.2. Combination of -Means Clustering Algorithm with Minimum Spanning Tree Method

Minimum spanning trees (MSTs) have been applied in data mining, pattern recognition, and machine learning for a long time [3]. The MST-based clustering techniques usually lead to efficient clustering [23, 24]. Indeed, these hybrid clustering methods can identify clusters of arbitrary shape by removing inconsistent edges and detect clusters of heterogeneous nature. MST-based clustering algorithm was proposed by Zahn [23]. Since then, some studies have been conducted to improve it (such as [5, 23, 24]). MST is utilized as the preanalysis method to find the initial centroids for -means algorithm [5].

In graph theory, a dataset can be shown by a complete graph , so that number of vertices of graph indicates number of points in a dataset. The weight between two vertices is the Euclidean distance between points based on the features (variables) vector.

Tree is an undirected connected graph that does not contain any distance. The spanning tree is a subset of a complete weighted graph in a way that it has all features of a tree and also contains all vertices of a complete weighted graph. For a complete weighted graph, the minimum spanning tree has the least weight among all spanning trees of that graph. In present study, we followed the idea introduced by Yang et al. [5] and used MST for initializing the -means clustering algorithm.

Accordingly, the MST-based -means clustering algorithm applied in present study is as follows [5]: (1)Number of points ( observations) and number of clusters () are entered as the input parameters(2)MST is generated using Prim’s algorithm(3)The set is created containing the skeleton points from which the most edges pass (the number of edges from each point is known as degree). Member of contains points which admit some specific criteria (see ref. 6 for details) and are important candidates for cluster centroids at the first stage of -means clustering(4)Distances between any two skeleton points of S set are calculated (Equation (3)) where and are the degree of and respectively.(5)The skeleton point with the highest degree is selected and entered to the set of initial centroids denoted by (6)The rest skeleton point of satisfying Equation (4) is added to set

Step 6 is repeated until the number of initial cluster centroids is equal to .

Figure 2 describes this process in a flowchart.

2.2.3. Combination of -Means Clustering with Genetic Algorithm

The genetic algorithms (GAs) in clustering analysis are usually used to determine the number of clusters automatically and to find initial centroids for -means clustering [16]. Indeed, genetic algorithm is a good option to solve the local minimum problem of -means [25]. Usually, the simplicity of the -means algorithm and the ability of the genetic algorithm are combined to provide a GA-based clustering algorithm which has even attracted the attention of researchers in health sciences ( [1721] for instances).

The genetic algorithm is inspired by the genetic science and Darwin’s theory of evolution and is based on the survival of the superiors or natural selection. A common application of genetic algorithms is its use as an optimizer function. Inspired by the evolutionary process of nature, these algorithms solve problems. In other words, they create a population of beings like nature, and reach an optimal set or being by acting on this set. The hybrid method used in the present paper provided a hybrid version of the -means algorithm with genetic algorithm that effectively solved the problem of random selection of initial centroids. Results of simulation tests confirmed this claim [11]. This algorithm preserved all important properties of the -means method and is also stronger in data contains outlier. In general, the steps of GA-based -means clustering algorithm are as follows (see ref. [6] for details): (1)The input parameters are determined including initial population size (number of chromosomes) and number of iterations (number of generations) and number of clusters and operator rates (2) chromosomes are randomly selected to generate the initial population where each chromosome is a set of initial centroids of clusters considering the notion that centroids of each chromosome should not be similarly selected(3)A target function is calculated for each chromosome. Based on the target function, the fitness value is calculated(4)Crossover, selection, and mutation operators are used to generate the next generation(5)If the number of produced generations is less than number of generations that is determined by user, it goes to stage 3 otherwise, it goes to stage 6(6)The amount of fit is calculated for the last generation per chromosome and compares the optimal amount of fit in this generation with the best fit obtained from previous generations and selects the largest one based on the estimator function(7)Finally, the initial centroids obtained from the best chromosome are used according to stage 2 as the initial centroids in the -means clustering method (Figure 3)

2.2.4. Combination of -Means Clustering Algorithm with Hierarchical Clustering Method

The hierarchical method is the second most important crisp clustering method in microarray technology. In this method, clusters are formed by calculating the size of similarities or distance between each pair of elements [27]. The number of clusters is also determined by the users based on the height that the clusters merge. The weak point of hierarchical clustering is its termination, and the most important problem of -means is its initiation [7]. Therefore, the combination of these two methods leads to a hybrid method with interesting characteristics. In present study, we apply agglomerative hierarchical clustering algorithm on a dataset at first to get initial information (initial centroids of clusters). Then -means algorithm is applied.

Steps of hierarchical-based -means method are summarized as follows [7]: (1)An agglomerative hierarchical clustering method is applied to data and the resulting tree is divided by cluster(2)Centroid of each cluster (mean clusters) is calculated and set is created(3)-means algorithm is performed for the set as the initial centroids obtained in step 2

Figure 4 summarizes this algorithm through a flowchart.

2.2.5. Validation of Clustering Methods

To evaluate the results of clustering algorithms, some cluster validation methods are used. These methods prevent the occurrence of random patterns in data and also allow the comparison of different clustering algorithms. A good validity measure should be invariant to the changes of sample size, cluster size, and number of clusters [26].

In general, clustering evaluation indices are classified into three categories: internal, external, and relative. Internal validity indices measure compactness, connectedness, and separation of each cluster while external validity indices measure matching the results of a clustering to the truth (if available) or another clustering method [26]. Relative validity methods in comparison are used to determine the optimal input parameter by changing the values such as the number of clusters in -means for an instance and also comparing the clustering methods.

Silhouette criterion (Si), Dunn index, and the hybrid index robustness-performance trade-off (RPT) were applied in the present study for internal evaluation. External validity methods can be categorized into pair-counting, information theoretic, and set matching measures. Pair-counting measures (such as rand index (RI) and adjusted rand index (ARI) used in our research) are based on counting the pairs of objects in a dataset on which two different partitions agree or disagree. For instance, if two objects in one cluster in the first partition place also in the same cluster in the second partition, it is considered an agreement [26].

Information theoretic indices such as mutual information (MI) measures the information that two clustering methods share and variation of information (VI) as a simple linear explanation of MI is applied in the present study. Set matching indices such as accuracy (AC), measure, and Habers Γ index (HI) utilized here are based on pairing similar clusters in two partitions.

It should be noted that the optimal number of clusters in the present paper was determined by the majority rule and using three methods including the average silhouette criteria, gap statistics, and elbow; data were standardized prior to any clustering analysis.

3. Results

To compare the performance of three hybrid methods and ordinary -means method, seven free downloadable datasets on Internet including “leukemia cancer,” “prostate cancer,” and “colon cancer” from GEO site and “haberman,” “iris,” “wine,” and “glass” datasets from UCI: Center for Machine Learning and Intelligent System were applied. Table 1 summarizes the description of these datasets.

To decrease the dimension of gene expression datasets and find the important genes (attributes), the result of the article by Ram et al. [29] was used. They selected a subset of three or four genes as the important ones by a feature selection method based on the random forest model. The clustering methods (ordinary or hybrid) were applied to the selected subsets for these four datasets.

It is necessary to mention that these datasets already contain some classes (labels). Ignoring these classes, we obtain the optimal number of clusters (among 2-15 clusters) for each dataset based on the majority rule according to the mean value of silhouette, elbow criterion, and gap statistics.

Then, the data analysis was organized in two phases:

3.1. Phase I

To investigate whether -means is an appropriate clustering method for each dataset, four ordinary clustering methods including -means, DBScan, hierarchical, and EM algorithms were applied on datasets at first. The mean value of silhouette and RPT criterion were then used to determine the best method for each data set (Table 2). The mean value of silhouette near to one and the high value of RPT reveal good clustering. Accordingly, -means clustering method was the best method for just two out of seven data sets discussed in present study, leukemia and colon cancer datasets. Hierarchal clustering method was the best for prostate and haberman, the DBScan method was the best for iris and glass datasets and EM algorithm was the best method for the wine dataset.

3.2. Phase II

The hybrid -means methods were then applied on each dataset, and the results were summarized in Table 3. The higher the value of these evaluation criteria, the better is the clustering algorithm; except for SSE and VI indices (their fewer values are desirable). Figure 5 detects that all hybrid methods are faster in convergence than -means method in terms of the number of iterations (the line belongs to -means dominates the others).

Obviously, based on all evaluation criteria, one superior clustering method could not be achieved. But, depending on the purpose of the study, internal or external validity indices may be important. Therefore, according to internal validity indices, the MST-based clustering method was the best for all datasets except for the leukemia, wine, and glass datasets. For the former, GA-based and for the two latter, hierarchal-based methods are the best hybrid method (Table 3). However, the internal validity indices for the best hybrid method could not reach the values for the best ordinary clustering method determined in phase I (Tables 2 and 3), except for those two dat sets (leukemia and olon cancer) which K-means was the best ordinary method. According to external validity indices, MST-based clustering method for leukemia and haberman, GA-based method for prostate and hierarchal-based method for iris and glass were the best hybrid clustering methods. For colon cancer and wine datasets, all three hybrid methods have the same performance.

Totally, the hybrid methods could not greatly improve the performance of -means clustering method in the present study. Meanwhile, although the results do not reveal any regular relationship between sample sizes, number of variables, and number of classes with the best hybrid method, but it seems hierarchal-based method works better in larger sample size with more variables (in wine and glass datasets).

4. Discussion

We have conducted a comparison study on three hybrid clustering methods which try to solve the random centroids problem in -means clustering [57]. Seven existing datasets on Internet were applied to compare the methods. Eleven indices from different clusters’ validation methods were the criteria for comparison. The hybrid methods including MST-based [5], GA-based [6], and hierarchical-based [7] -means clustering are three popular hybrid methods for modifying random centroids problems in -means. However, there are other methods which try to improve the -means performance such as principal component analysis [8], different rules for updating the new centroids [912] and machine learning online algorithm [13]. Meanwhile, some previous studies report the improvements occurred by -means in other clustering methods [11, 2124].

To the best of our knowledge, MST-, GA-, and hierarchical-based -means methods utilized in the present study have not been compared in any simulation or experimental study before. Seven datasets used here were different in terms of sample size, numbers of variables, and natural classes. Hence, these three methods compared here from different aspects.

Results of this research indicated that the hybrid methods did not necessarily improve the ordinary -means method; and they even sometimes had poorer performance in some indices than the ordinary -means method (Table 3). The performance of ordinary K-means method is improved in hybrid methods only in the number of iterations to reach the final solution. In this regard, hierarchical-based, MST-based, and GA-based clustering methods are in the first to third ranks of convergence rate (Figure 1).

Totally, the hybrid methods could not greatly improve the performance of -means clustering methods in internal validity indices. However, in external validity indices, these methods outperformed the -means clustering method (Table 3).

Finally, since some previous studies reported better performance for these three hybrid methods than the ordinary K-means clustering algorithm [57] simulation studies are recommended to compare these hybrid methods with -means clustering in terms of initial centroids.

Data Availability

The data used to support the findings of this study have been deposited in the Gene Expression Omnibus repository for Leukemia, Prostate, and colon cancers (https://www.ncbi.nlm.nih.gov/gds) and in University of California Irvine (UCI) repository for Haberman, Iris, Wine and Glass data sets (https://archive.ics.uci.edu/ml/datasets.php).

Disclosure

This article was extracted from Atefeh Bassirat’s Master of Science Thesis.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the grant number 98-20079 from the Shiraz University of Medical Sciences Research Council.