Abstract

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.

1. Introduction

One of the best known problems in the data mining is the clustering. Clustering is the task of categorising objects having several attributes into different classes such that the objects belonging to the same class are similar, and those that are broken down into different classes are not [1]. There are several clustering algorithms that have been proposed till now. Due to no prior information in clustering, the suitable evaluation of the results is necessary. Evaluation means measuring the similarity between clusters, measuring the compactness, and separation between clusters [2]. Evaluation measurement is also proposed as a key feature in internal and external cluster validation indexes [3]. Such a measure can be used to compare the performance of different data clustering algorithms on different real life datasets. These measures are usually tied to the type of criterion being considered in assessing the quality of a clustering method. Three different techniques are available to evaluate the clustering results: external, internal, and relative [4]. Both internal and external criteria are based on statistical methods and they have high computation demand. The external validity methods evaluate the clustering based on some user specific intuitions [4]. The objective of this paper is the comparison of the different clustering schemas that have been already proposed [5] with Canonical PSO based K-means clustering algorithm.

The rest of the paper is organized as follows. The Canonical PSO based K-means algorithm is proposed in Section 2 with some other existing clustering algorithms. Some popular and widely used validity indices are introduced in Section 3. Section 4 demonstrates the clustering compactness measurements on a toy example dataset using K-means and DBSCAN clustering algorithms. Section 5 demonstrates the clustering compactness measurements with experimental results and comparison of the indices is outlined in this section, and Section 7 gives a brief conclusion of this paper. Interested readers may found some significant references at the end section of this paper.

This paper mainly focuses on the implementation of Canonical PSO based K-means algorithm. Some cluster validity indices (e.g., Dunn index, DB index, Silhouette index, Rand index, Mirkin index, etc.) and ANOVA test are analysed. All these indices are individually experimented on the air pollution dataset, customer, wine, and vehicle dataset using typical K-means, DBSCAN, Hierarchical, simple PSO based K-means, and Canonical PSO based K-means algorithms. The overall motivation of this paper is given in Figure 2.

2. Clustering Algorithms

This section introduces a new modified Canonical PSO based K-means clustering algorithm and also describes all these well-known algorithms briefly.

2.1. Simple PSO Based K-Means Clustering

James Kennedy and Russell C. Eberhert originally proposed the Particle Swarm Optimization (PSO) algorithm for optimization. It is a population based, robust, and stochastic optimization technique mainly designed for the balancing weights in neural networks [6]. In data clustering, it is possible to view the clustering problem as an optimization problem that locates the optimal centroids of the clusters rather than to find an optimal partition. This view offers us a probability to apply PSO optimal algorithm on the clustering solution. The use of basic PSO technique in document clustering analysis is proposed in many papers [7, 8].

In the PSO based K-means algorithm, the capability of globalized searching of the PSO algorithm and the fast convergence of the K-means algorithm are combined [9]. The algorithm results in better accuracy than existing algorithms. In the original PSO, at any instance each particle has a position and a velocity. At the beginning, population of particles is initialized with random positions denoted by vectors and random velocities . The population of such particles is called a swarm, . Each particle is searching for the optimum. Each particle remembers the position it was in where it had its best result so far (its personal best) [7]. The particles in the swarm cooperate. They exchange information about what they have discovered in the places they have visited.

In each time step, a particle has to move to a new position. It does this by adjusting its velocity. Velocity is updated based on information obtained in previous steps of the algorithm.

This updating of velocity and position can be described by the following set of equations: where , .

Here, is the new velocity at time step , is the old velocity at time step , is the best position of each particle, is the best position of swarm, is current position of each particle, and is old position of each particle.

and are random variables uniformly distributed within and are weighting factors, also called the cognitive and social parameter, respectively. In the first version of PSO, a single weight, , called acceleration constant, was used instead of the two distinct weights in (1).

2.2. Canonical PSO Based K-Means Clustering

Oscan and Mohan (1999) focused on the early PSO model of (1) and (2), and they showed that particles were actually moving on sinusoidal waves per coordinate of the search space, while causticity was offering a means to control its frequency and amplitude. They had modified the PSO model and the model is defined by the following equation: where particles (vector data) and .

Where chi() is a parameter called constriction coefficient or constriction factor, responsible for keeping the particle moving in the same direction, it was originally heading, while the rest of the parameters remain the same as for the previously described PSO models [9].

2.3. Proposed Algorithm

Input = number of clusters. = number of iterations.Data = different real dataset.

Output. A set of clusters.(1)Initialize each particle randomly by taking different data samples from the dataset as the initial cluster centers.(2)Initialize velocity and personal best position of each particle.(3)Repeat for each particle.(a)Calculate the performance of each particle based on the -means fitness function where denotes the th data point, denotes the center of the th cluster , and () denotes the distance (e.g., Euclidean distance) of from .(b)Reassign the data vector to the centroid vector according to the fitness value.(c)Update the personal best position.(d)Update the global best position.(e)Modify velocity and position of each particle using formula (3) and generate the next solution.(4)Repeat Step until one of the following termination conditions is satisfied.(a)The maximum number of iterations is exceeded.(b)The average change of centroid vector is less than the predefined value.

2.4. K-Means Algorithm

K-means clustering is a method of cluster analysis which aims to partition observations into K clusters depending on some similarity/dissimilarity metric where the value of K may or may not be known a priori [7]. The objective of K-means clustering algorithm is usually to create one set of clusters that partitions the data into similar groups. As a result, maximum similarity samples are placed in same cluster and low similarity samples are placed in different clusters [10].

2.5. Hierarchical Algorithm

In hierarchical clustering clusters are generated by grouping data with similar pattern of expression across a range of samples located near each other. Hierarchical clustering calculates all pairs-wise distance associations between samples and experiments to merge pairs of values that are mainly similar [9].

2.6. DBSCAN Algorithm

This is a density-based clustering algorithm that produces a partitional clustering, in which the number of clusters is automatically determined by the algorithm. DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region (minpts) [11]. It starts with an arbitrary starting point that has not been visited. This point’s ε-neighbourhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labelled as noise. If a point is found to be a dense part of a cluster, its ε-neighbourhood is also part of that cluster [12]. Hence, all points that are found within the ε-neighbourhood are added as is their own ε-neighbourhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.

3. Clustering Validity Indices

In this paper the behaviour of several clustering validity indices has been examined (Figure 1). Among these indices some of them are clearly illustrated in this section. These indices are used for measuring “goodness” of a clustering result comparing to other ones which were created by other clustering algorithms or by the same algorithms but using different parameter values [13].

3.1. Silhouette Index

Silhouette index is used for cluster analysis formed by different clustering algorithms (Tables 6, 7, and 8). The silhouette validation technique calculates the silhouette width for each sample, average silhouette width for each cluster, and overall average silhouette width for a total dataset in this paper [14]. The average silhouette width could be applied for evaluation of clustering validity and also could be used to decide how good the number of selected clusters is.

To construct the silhouettes the following formula is used: where is the average dissimilarity of th object to all other objects in the same cluster and is the minimum of average dissimilarity of -object to all objects in other clusters [9].

If silhouette value is close to 1, it means that sample is “well-clustered” and it was assigned to a very appropriate cluster. If silhouette value is about zero, it means that sample could be assigned to another closest cluster as well, and the sample lies equally far away from both clusters. If silhouette value is close to –1, it means that sample is “misclassified” and is merely somewhere in between the clusters. The overall average silhouette width for the entire plot is simply the average of the for all objects in the whole dataset.

3.2. Davies-Bouldin Index (DB)

The Davies-Bouldin index (DB) can be calculated by the following formula: where is the number of clusters, is the centroid of cluster, is the average distance of all elements in cluster to centroids, and is the distance between centroids [15]. Since algorithms that produce clusters with low intracluster distances (high intracluster similarity) and high intercluster distances (low intercluster similarity) will have a low Davies-Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest Davies-Bouldin index is considered the best algorithm based on this criterion [9].

3.3. Dunn Index

The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal intercluster distances to maximal intracluster distance. For each cluster partition, the Dunn index can be calculated by the following formula: where represents the distance between clusters and and measures the intracluster distance of cluster . The intercluster distance between two clusters may be any number of distance measures, such as the distance between the centroids of the clusters [15]. Similarly, the intracluster distance may be measured in a variety of ways, such as the maximal distance between any pair of elements in cluster . Since internal criterion seeks clusters with high intracluster similarity and low intercluster similarity, algorithms that produce clusters with high Dunn index are more desirable [16].

3.4. Rand Index

The Rand index computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. One can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm [16].

Given a set of elements and two partitions of to compare, , a partition of into subsets, and , a partition of into subsets, then the Rand index, , is where is the number of pairs of elements in that are in the same set in and in the same set in is the number of pairs of elements in that are in different sets in and in different sets in , is the number of pairs of elements in that are in the same set in and in different sets in , and is the number of pairs of elements in that are in different sets in and in the same set in .

3.5. Mirkin Index

Mirkin index is also known as equivalence mismatch distance. It is defined by where , .

Let be a finite set with cardinality . A clustering is a set of nonempty disjoint subsets of such that their union equals . Clustering is a refinement of cluster of , formally:

3.6. ANOVA Test

“Analysis of variance” test is used to compare three or more groups or conditions in an experiment. A one-way ANOVA can help to find out if the means for each group/condition are significantly different from one another or if they are relatively the same. The null hypothesis typically corresponds to a general or default position. One-way ANOVA is a simple special case of the linear model. The one-way ANOVA form of the model is where is a matrix of observations in which each column represents a different group. is a matrix whose columns are the group means (the “dot ” notation means that applies to all rows of the th column. That is, the value is the same for all ). is a matrix of random disturbances.

The standard ANOVA table has six columns:(i)the source of the variability,(ii)the sum of squares (SS) due to each source,(iii)the degrees of freedom (df) associated with each source,(iv)the mean squares (MS) for each source, which is the ratio SS/df,(v)the -statistic, which is the ratio of the mean squares,(vi)the probability value, which is derived from the cdf of .

The ANOVA test makes the following assumptions about the data:(i)all sample populations are normally distributed;(ii)all sample populations have equal variance;(iii)all observations are mutually independent.

4. Mathematical Illustration

Every logical experiment needs a well-defined mathematical illustration. In this section, the Davis-Bouldin (DB) index and Dunn index are applied on three most commonly used clustering algorithms (typical -means, DBSCAN, and Hierarchical) with the help of mathematical examples.

Eg.1. Let us assume there is a set of 15 data objects, such as 10, 12, 15, 7, 22, 29, 31, 3, 7, 5, 1, 4, 12, 11, and 10.

Sol:

K-MeansClustering

Step 1. In the first step typical K-means algorithm is applied on this dataset. Suppose the total number of clusters is 3 and also assume their centroids are Clust1 = 10, Clust2 = 22, and Clust3 = 1.
Now use Manhattan distance measure to calculate the following.

First iteration: see Table 1 and Figure 3.

Now, the new centroids are Clust1 = 10.5, Clust2 = 27.3, and Clust3 = 3.25.

Second iteration: see Table 2.

No change occurs.

So those three clusters are final clusters and now based on those clusters two popular indexes Davis-Bouldin (DB) index and Dunn index which are calculated below.

Davies-Bouldin Index. Consider where ,  ,  , , , ,   and .

So,

Dunn Index. The simple form of Dunn index calculation is where is the minimum distance between two points belonging to different clusters and is the maximum distance between any two points selected from the same cluster [13].

According to this form, the calculated values are as shown in Table 3.

So,

DBSCAN Clustering. Now, on the same dataset DBSCAN clustering algorithm is applied with assuming the value of minimum number of points (minpts) = 3 and the radius of a cluster (eps) = 1 cm. Initially there are three clusters with the same means and they are shown in Figure 4.

According to the DBSCAN clustering criteria, we have the following.

Here two data objects are treated like outliers because both objects 15 and 31 are ≥eps (1 cm) and due to the minimum number points condition does not satisfied by the Clust3 that is why the objects of the clust3 also acts like outliers. But the objects of Clust3 and the objects 15 and 31 all together form a cluster () which satisfies both criteria of DBSCAN clustering. It is also called the cluster of noise.

Now in the same way the DB and DUNN indexes are calculated on those final clusters,

These values may be changed based on the assumption of eps and minpts values.

Evaluation. After checking the values of the indexes it can be said that for that particular set of data objects the typical K-means clustering gives better compactness and better separability than DBSCAN clustering (due to the fact that less value of DB index and high value of Dunn index give more desirable results). But in most of the cases DBSCAN gives bad results compared to K-means clustering because it forms arbitrary shape of clusters.

5. Result Analysis

5.1. Datasets
5.1.1. Air Pollution Dataset

The experimented air pollution database is taken from the http://www.wbpcb.gov.in/ website of Kolkata District from 01/07/2011 to 01/05/2012. The database contains different air pollutant data like benzene, CO2 (carbon-dioxide), NO2 (nitrogen-dioxide), O3 (ozone), PM (particular matter), and SO2 (sulphur-dioxide).

5.1.2. Customer Dataset

The wholesale customer dataset is taken from UCI Machine Learning Repository. The dataset contains eight attributes like fresh products, milk, grocery, frozen, detergents, delicatessen, channel, and region.

5.1.3. Wine Dataset

The wholesale wine dataset is taken from UCI Machine Learning Repository. The dataset contains thirteen attributes like alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, and proline.

5.1.4. Vehicle Dataset

The wine dataset is taken from UCI Machine Learning Repository. The dataset contains eighteen attributes like compactness, circularity, distance circularity area, radius ratio, Pr.Axis aspect ratio, max. length aspect ratio, scatter ratio, elongatedness, Pr.Axis rectangularity, max. length rectangularity area, scaled variance along major axis, scaled variance along minor axis, scaled radius of gyration, skewness about major axis, skewness about minor axis, kurtosis about minor axis, kurtosis about major axis, and hollows ratio. Here 4 classes are considered: Opel, Saab, Bus, and Van. The 4 real datasets and their main characteristics are shown in “Table 4.” The experiment is based on 150 iterations and five clustering algorithms.

5.2. Experimental Analysis

At first analysis, the accuracy of different clustering method is measured using silhouette index depicted in Tables 2, 3, 4, and 5. Probably 150 iterations are considered in each case. One-way ANOVA test is applied on the values of silhouette index at different iterations and null hypothesis has been measured and reported in Tables 9, 10, 11, and 12. The accuracy comparisons of all clustering algorithms on different datasets are explained in Table 13.

The results of ANOVA test is as shown in Tables 9, 10, 11, and 12 and Figures 5, 6, 7, and 8.

The reason for doing an ANOVA test is to see if there is any difference between groups on some variables. If probability > , we reject null hypothesis. Here in all cases, Prob > . Therefore, we reject the null hypothesis. The means of the five algorithms are not equal. At least one of the means is different. However, the ANOVA test does not tell where the difference lies. It requires a “-test” to test each pair of means.

6. Discussion

So from Table 13 it can be concluded that Canonical PSO based -means algorithm provides most desirable results compared to other clustering algorithms stated in Section 2 and DBSCAN clustering provides the worst results than the others (based on the initial values of eps = 25 and minpts = 65). In this part, the results of these algorithms are evaluated by applying other well-known indexes (such as, Dunn, Davies-Bouldin, Rand, and Mirkin indices).

From “Table 14” the results of this experiment can be summerised as follows.(1)According to the Dunn index, algorithm that produces clusters with high Dunn index is more desirable. So here the Canonical PSO based -means algorithm produces better results compared to the others (0.7132 > 0.6542 > 0.6152 > 0.5922 > 0.3984).(2)According to the Davis-Bouldin (DB) index, algorithm that produces a collection of clusters with the smallest Davies-Bouldin index is considered the best algorithm. So here again the Canonical PSO based -means algorithm produces better results compared to the others (0.2074 < 0.3283 < 0.3344 < 0.5484 < 0.5816).(3)According to the Rand index, algorithm that produces clusters with high Rand index is more suitable and compact. So here the Canonical PSO based -means algorithm produces more good results than the others (0.7825 > 0.6812 > 0.4687 > 0.3702 > 0.3022).(4)Mirkin index gives smallest value when the number of clusters attains optimal level. So here Canonical PSO based -means algorithm produces much better results compared to the others (0.1123 < 0.2312 < 0.3451 < 0.4802 < 0.7167). But the inefficient clustering algorithm in this experiment is not fixed according to the above indices of Table 14. But after the average cases analysis, it can be noticed that the DBSCAN algorithm produces improper clusters than the others (but it totally based on the initial assumption of minpts and eps values of DBSCAN algorithm). All these results and analysis are represented in a pictorial form as in Figure 9.

The same approach and results are also valid for the other 3 datasets (such as customer, wine, and vehicle). The detailed analysis of results is tabulated in Tables 15, 16, and 17 and their corresponding pictorial forms are represented in Figures 10, 11, and 12.

7. Conclusion and Future Work

Although there are several proposals for validity indices in the literature, most of them succeed only in certain situations and are based on some certain conditions. This paper introduces the new concept like Canonical PSO based -means algorithm and also presents ample comparisons of six popular clustering validity approaches. All these analyses are done on a few real time datasets which contain air pollutant particles, wholesale customer, wine, and vehicle data. This proposed algorithm provides better impact to produce desirable clustering results compared to other existing algorithms described in this paper (provided some certain assumptions).

Several extensions to the present work are in progress. Finally, studies of other clustering algorithms and others indices and their effects to measure proper, compact clusters are also underway. In future work, we can amalgamate DBSCAN and Hierarchical algorithms with PSO approach to seek some better ways of producing improved clustering output.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.