-Means: A Fast Fuzzy Clustering
A novel hybrid clustering method, named -Means clustering, is proposed for improving upon the clustering time of the Fuzzy -Means algorithm. The proposed method combines -Means and Fuzzy -Means algorithms into two stages. In the first stage, the -Means algorithm is applied to the dataset to find the centers of a fixed number of groups. In the second stage, the Fuzzy -Means algorithm is applied on the centers obtained in the first stage. Comparisons are then made between the proposed and other algorithms in terms of time processing and accuracy. In addition, the mentioned clustering algorithms are applied to a few benchmark datasets in order to verify their performances. Finally, a class of Minkowski distances is used to determine the influence of distance on the clustering performance.
Clustering is a method of separating similar data from distinctly different ones into relevant categories or clusters. Being an unsupervised approach, it helps to recognize and extract hidden patterns within the data. The distance, such as Euclidean and Manhattan as a special case of Minkowski, plays an important role in clustering algorithms. Clustering techniques enjoy some advantages as no requirement for domain knowledge or labeled data while they are able to deal with a wide variety of data, including noise and outliers, as well.
Clustering methods may be categorized into two general types: hard and soft. Hard clusters possess well-defined boundaries; examples include -Means (KM) and hierarchical methods . To improve the time processes of fuzzy clustering, we propose a 2-step hybrid method of -Means Fuzzy -Means (KCM) clustering that combines the KM clustering algorithm with that of the Fuzzy -Means (CM).
We begin with a review of the current literature on classical and fuzzy clustering methods. Huang  extended the KM algorithm to categorical domains. In order to decrease the computational complexity associated with the conventional CM clustering method, Chang et al.  proposed a CM using the cluster center displacement of successive iterative processes clustering method. Volmurgan  investigated the performance of two partitions-based clustering methods, i.e., KM and CM algorithms. He made the comparison through clustering randomly distributed data points. Havens et al.  compared the efficacy of three different techniques in order to extend the application of CM clustering to very large datasets. Panda et al.  implemented clustering techniques in such wide areas as medicine, business, engineering systems, and image processing. Grover  studied a wide variety of fuzzy clustering methods such as CM, Possibilistic CM, and Fuzzy Possibilistic CM algorithm and reported their advantages and drawbacks. Bora and Gupta  conducted a comparative study of the fuzzy and hard clustering methods. Finally, Fajardo et al.  investigated the fuzzy clustering of certain spectra for the objective recognition of soil morphological horizons in soil profiles.
The present study proposes a hybrid clustering algorithm by the name of KCM that combines KM and CM algorithms to achieve its objective by improving the time processing of the CM method. The performances of KM, CM, and KCM techniques are then compared in terms of their accuracy and time processing using simulated data from sub-Gaussian distributions. The methods are also applied to the three standard real datasets, to determine and compare the precision and accuracy of the investigated algorithms. Finally, KM, CM, and KCM are compared using Minkowski distances. The objective is to identify the best combinations of the clustering method and distance measure with higher precision, accuracy measures, and cluster quality in terms of compactness and distinctiveness.
2. Clustering Algorithms
By definition, clustering groups a sample of vectors to clusters, using an appropriate similarity criterion such as distance from the center of the cluster.
2.1. K-Means Algorithm
KM is one of the most popular clustering algorithms [10, 11]. The clustering results of the KM algorithm are very sensitive to the positions of the initial cluster centers. Being efficient in clustering large data sets, it often terminates at a local optimum and applies only to numeric values . Given a set of elements and a set of centers , where , , , We recall that Minkowski distance for two points , is defined as follows:Euclidean and Manhattan distances are two special cases of Minkowski distance with and , respectively. In the rest of the paper, for Minkowski distance, we consider The steps of the KM clustering algorithm are shown in Algorithm 1.
2.2. Fuzzy Clustering Algorithms
In KM clustering, data is divided into disjoint clusters, where each data element belongs to exactly one cluster. In fuzzy clustering, an object can belong to one or more clusters with probabilities . One of the most widely used fuzzy clustering methods is the CM algorithm, originally due to Dunn  and later modified by Bezdek . The CM method attempts to partition a finite collection of elements to a collection of fuzzy clusters with respect to some given criterion, where is an observation vector. A fuzzy -partition of is a family of fuzzy subsets of denoted by , which satisfies , , and , where is a positive integer. The problem of fuzzy clustering is to find a fuzzy -partition and the associated cluster centers by which the structure of the data is represented as best as possible. To solve the problem of fuzzy clustering, this criterion needs to be formulated in terms of a performance index. The cluster centers associated with the partition are calculated as follows:where is a real number that governs the influence of membership grades, is viewed as the cluster center of the fuzzy class , and the performance index of a fuzzy -partition , , is then defined in terms of the cluster centers using the formulaThis performance index measures the weighted sum of distances between cluster centers and elements in the corresponding fuzzy clusters. The goal of the CM clustering method is to find a fuzzy -partition that minimizes the performance index . In other words, the clustering problem is an optimization problem . The convergence properties of CM algorithms are theoretically important. The optimal cluster centers are the fixed points of CM clustering algorithms. The algorithm is limited by long computational time and sensitivity to noise, outliers, and initial guess [17, 18]. The two steps of the CM clustering algorithm which should be modified in KM algorithm are shown in Algorithm 2.
3. A Hybrid Method: -Means Algorithm
A novel approach called KCM method is proposed herein that combines the KM and CM methods. The combination is meant to overcome the limitations of both but enjoys their advantages. One of the disadvantages of CM method is long computational time while quick running is one of the advantages of KM method. The goal of the hybrid method is to introduce a fuzzy method faster than CM while its accuracy is close to the CM. In the proposed technique, KM is initially applied to individual data objects to generate clusters, designated as middle-level clusters. Each cluster is then represented by its centroid. The CM clustering is subsequently applied to those centroids in order to structure the final clustering. The distance between two middle-level clusters is measured as the distances between their centroids. The hybrid method considers the final centers produced by KM as the dataset for CM, so that the number of observations in the CM is equal to the number of centers produced by the KM method. Therefore, the KCM time is much less than the time of CM method.
The hybrid method is more suitable for the large dataset, where it has reduced clusters of observations by their centers, eventually computed from the KM. The performance of the proposed approach is evaluated by comparing it with KM and CM algorithms in terms of both accuracy and time processing. It is shown that the proposed technique outperforms CM in time processing; it yields results over shorter times when compared with the CM algorithm. Given a set of elements , where , the steps of the KCM clustering algorithm are shown in Algorithm 3.
Simulated datasets are used to evaluate the KM, CM, and KCM clustering methods. We use an external clustering evaluation criterion for comparisons. The Rand index is a criterion used to compare an induced clustering structure () with a given clustering structure () defined as follows :where , , , and are the numbers defined as follows:(i) is the number of two points belonging to the same cluster, according to and .(ii) is the number of points belonging to the same cluster according to but not .(iii) is the number of points belonging to the same cluster according to but not .(iv) is the number of points that do not belong to the same cluster, according to and .
The quantities can be interpreted as agreements and as disagreements. The Rand index value lies within the range and the clustering performance is considered to be good if the Rand index value converges to one [4, 13].
We used R 3.3.3 software, on a PC with CPU Core i5-3210 with 4 GB RAM to run all experiments in the next sections. For a fair comparison, termination condition of the algorithms is set as default of R standard codes.
5. Simulation Study
A dimensional random vector has a sub-Gaussian distribution with location vector and dispersion matrix if its characteristic function is of the formwhere , , and is a positive definite matrix. In the case of equal to 2, we get the multivariate normal distribution that its covariance matrix is . If then However, the expectation of does not exist for .
In this simulation study, a set of real and simulated data generated by the sub-Gaussian and multivariate normal distributions was used. For clustering data using the proposed KCM method, the three Euclidean, Manhattan, and Minkowski distances were used. In addition, the results obtained from the KM, CM, and KCM algorithms were compared in terms of their time processing (in milliseconds) and accuracy. A set of data of 15000 observations having 30 attributes and parameter of stability in the range of = 0.5, 1, 1.5, 2 was generated, where if = 2, there will be a multivariate normal distribution. Then, the data were partitioned into 5, 10, 15, 20, 25, 30, and 40 clusters. As previously mentioned, our CM method is based on KM where -value, which is the cluster number, is to be defined. The simulation results of the test are shown in Figures 1 and 2 showing the accuracy and time of KM, CM, and KCM method with Euclidean, Manhattan, and Minkowski ( distances for = 0.5, 1, 1.5, 2.
We have implemented the algorithms 100 times, and the average values of accuracy and time processes were computed. We classify the results as follows.
Time. In general, the time processes of KM were less than the time processes of CM and KCM algorithms for all values of and the number of clusters, . CM recorded a long-time process compared with either of the KM or KCM algorithms. The type of distance did not significantly affect the time processing of KM, where the results obtained with the three distances were close to one another. While the time processing of CM and KCM with Minkowski distance is longer than with Euclidean and Manhattan distance.
The increase in values of does not affect the time processing of KM and KCM, where the values of time are almost close for all , while the time processing of CM is decreasing with increasing value of if we used the Euclidean distance, but if we used the Manhattan distance, it is increasing when the value of α increased. The speed of KM and CM is decreased if the number of clusters increases, but it does not affect much the speed of the KCM algorithm with Euclidean and Manhattan distances.
Generally, the processing time of the KCM algorithm is less than the CM algorithm. For example, when the number of clusters is 40, the processing time of CM and KCM is shown in Table 1.
Accuracy. Distance type had no significant effect on the accuracy of the KM and CM algorithms as almost the same results obtained with either. However, accuracy increased with . The accuracy of KCM algorithm is increasing with increasing the values of . In general, the accuracy of KCM and CM with Euclidean and Minkowski distances is better than that with Manhattan.
6. Comparison of Algorithms Using Real Data
In this section, the KM, CM, and KCM algorithms are tested for their performance using Iris (150 × 4), Wine (178 × 13), and Lens (24 × 4) datasets. The three Euclidean (Euc), Manhattan (Man), and Minkowski (Min) distance measures are used to see how they influence the overall clustering performance. The performance of these three techniques has been compared based on the following parameters:(1)Precision = .(2)Accuracy = .
A true positive () decision assigns two similar documents in the same cluster; a true negative decision assigns two dissimilar documents to different clusters. A decision assigns two dissimilar documents to the same cluster. A decision assigns two similar documents to different clusters. The experimental results indicating the performance of each technique on the three datasets are reported in Table 2.
Using the Iris dataset led to a greater average precision of clusters formed by KCM-Euc and KCM-Man than those by KM and CM with the three distances. CM-Man recorded a greater accuracy than any of those formed by KM or KCM. Distance and algorithm type had no significant effect on the accuracy. As for the Lens dataset, average precision was generally low with all the algorithms examined. It, however, yielded acceptable accuracy values with the KM, CM, and KCM algorithms, but it does not exceed 0.50. With the Wine dataset, distance and algorithm type had a significant effect on the accuracy and average precision. The average of precision does not exceed 0.70 and the highest average recorded by KCM-Man. The CM-Man recorded a greater accuracy than any of those formed by KM or KCM.
In this paper, the two most famous clustering techniques, namely, -Means and Fuzzy -Means, were investigated for their performance. To improve the time processes of the fuzzy clustering technique, a hybrid algorithm, named KCM, combining the KM and CM algorithms, was proposed.
It was found that the KM algorithm had shorter time processes than CM and KCM algorithms for all values of and . In addition, the speed of CM was observed to be less than those of KM and KCM. However, the time processes of CM with the Euclidean and Manhattan distances were observed to be shorter than that with Minkowski distance. The value of did not affect time processes under KM and KCM; however, that of CM decreased with increasing values of with Euclidean but increased with Manhattan distance.
The accuracy of KM, CM, and KCM algorithms was increasing for . Distance type had a significant effect on the accuracy of KM and CM algorithms, but the accuracy of the KCM and CM algorithms with Euclidean and Minkowski distances was better than that with Manhattan distance.
Using the real datasets revealed that the Iris dataset yielded higher precision values for clusters with the three distances. The clusters formed by the combined KCM-Euc were observed to be more distinct. Using the Lens dataset led to poor precision levels but acceptable accuracy values for all the combinations. With the Wine dataset, medium precision levels were achieved with all the combinations. CM-Man and KCM-Euc yielded the most compact clusters, while KCM-Man yielded the most distinct ones. In general, the Iris dataset not only formed the most compact and distinct clusters, but also yielded higher precision and accuracy levels for KM, CM, and KCM clusters with the three distances than did the Lens or Wine datasets.
Finally, we recall that the time computation in a clustering method depends on the algorithm and its implementation, programming language, and hardware. Therefore, based on the complexity of the clustering problem one can consider the best of them.
The codes and data are available upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
T. Volmurgan, “Austria performance comparison between K-means and fuzzy C-means,” Wulfenia Journal, vol. 19, pp. 234–241, 2012.View at: Google Scholar
B. Panda, S. Sahoo, and S. K. Patnaik, “A comparative study of hard and soft clustering using swarm optimization,” International Journal of Scientific & Engineering Research, vol. 4, pp. 785–790, 2013.View at: Google Scholar
M. R. Anderberg, Cluster Analysis for Applications, Academic Press, New York, NY, USA, 1973.View at: MathSciNet
G. Gan, C. Ma, and J. Wu, Data Clustering Theory: Algorithms and Applications, SIAM, Virginia, 2007.
J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, NY, USA, 1981.View at: MathSciNet
G. J. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall, New York, NY, USA, 1995.View at: MathSciNet
R. Suganya and R. Shanthi, “Fuzzy C-means algorithm - a review,” International Journal of Scientific and Research Publications, vol. 2, pp. 440–442, 2012.View at: Google Scholar