VIASCKDE Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation

Şenol, Ali

doi:https://doi.org/10.1155/2022/4059302

Computational Intelligence and Neuroscience

On this page

Abstract Introduction Background and Related Works Discussion Conclusion Data Availability Conflicts of Interest References Copyright Related Articles

Research Article | Open Access

Volume 2022 | Article ID 4059302 | https://doi.org/10.1155/2022/4059302

VIASCKDE Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation

Ali Şenol¹

Academic Editor: Miaolei Zhou

Received15 Feb 2022

Revised13 Apr 2022

Accepted30 Apr 2022

Published08 Jun 2022

Abstract

The cluster evaluation process is of great importance in areas of machine learning and data mining. Evaluating the clustering quality of clusters shows how much any proposed approach or algorithm is competent. Nevertheless, evaluating the quality of any cluster is still an issue. Although many cluster validity indices have been proposed, there is a need for new approaches that can measure the clustering quality more accurately because most of the existing approaches measure the cluster quality correctly when the shape of the cluster is spherical. However, very few clusters in the real world are spherical. Therefore, a new Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (the VIASCKDE Index) to overcome the mentioned issue was proposed in the study. In the VIASCKDE Index, we used separation and compactness of each data to support arbitrary-shaped clusters and utilized the kernel density estimation (KDE) to give more weight to the denser areas in the clusters to support cluster compactness. To evaluate the performance of our approach, we compared it to the state-of-the-art cluster validity indices. Experimental results have demonstrated that the VIASCKDE Index outperforms the compared indices.

1. Introduction

Clustering approaches are unsupervised learning techniques that separate data into groups called clusters according to the similarities and dissimilarities among the data [1, 2]. The DBSCAN [3], kmeans [4], BIRCH [5], Spectral Clustering [6], Agglomerative Clustering [7], HDBSCAN [8], Affinity Propagation [9], and OPTICS [10] are some examples of them, and they are used in many fields such as pattern recognition [11–13], machine learning [14–16], data mining [17, 18], web mining [1, 19], bioinformatics [20, 21], and streaming data mining [22, 23]. On the other hand, measuring the performance of any proposed clustering approach is also an important issue because each algorithm has its special point of view, and the results of each clustering technique vary. Therefore, to overcome this problem, cluster validation analysis or cluster validation indices have emerged. These approaches are generally used for two purposes, which are measuring the performance of clustering algorithms and contributing to clustering algorithms as a guide by finding the optimum number of clusters.

Cluster validation indices are divided into two main categories as internal and external indices. In external indices, true class labels are compared with the labels that are assigned by the proposed algorithm to measure the performance. Therefore, to use these indices, there is a need for true class labels. The Purity [24], Rand Index [25], Adjusted Rand Index [26], Accuracy, Precision and Recall [27], F-Measure [28], and NMI [29] can be given as examples of these types of indices. On the other hand, in the internal indices, we do not need actual class labels to measure the quality of clusters. In these indices, the evaluation of clustering performance is based on how similar the data in the same cluster are to each other, known as compactness, and how dissimilar the data in different clusters are from each other, known as separation. The Silhouette Index (SI) [30], Dunn Index [31], Davies–Bouldin (DB) [32], Calinski-Harabasz (CH) [33], Xie-Beni (XB) [34], S_Dbw [35], and RMSSTD [36] can be mentioned as primary cluster validity indices. Besides, there are many new cluster validity indices such as the CVNN [37], CVDD [38], DSI [39], SCV [40], and AWCD [41].

The main problem of the majority of state-of-the-art cluster validity indices is that they measure the cluster quality correctly when the shapes of the clusters are spherical. As an example, Silhouette Index (SI) uses the means of distances of each data in the cluster to evaluate their quality. Similarly, Davies–Bouldin (DB) uses cluster diameters and cluster centroids, and the Calinski-Harabasz (CH) uses the square of intracluster and intercluster distances. These all calculations are ideal if the shape of the cluster is spherical. However, the shapes of the minority of clusters are spherical in the real world. Additionally, if the shape is arbitrary, these indices cannot measure the cluster quality correctly because the center of gravity of any cluster is in the middle only if the shape is spherical.

Similar to our approach, there is another kernel density estimation-based cluster validation index, named the M_clus [42]. In the M_clus, the authors used a function of estimation of the mode to assess cluster quality. This mode function allows the index to assess the cluster quality by adopting interpoint distance measures that can be defined to have a probability density function. To evaluate clustering with the number of clusters greater than 1 (K > 1), they applied the mode estimation procedure for interpoint distances that are assumed to have a probability density function between the data members. On the other hand, in this study, we proposed a novel Internal Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (the VIASCKDE Index). We aimed to calculate the cluster quality accurately by using compactness and separation of each data to support arbitrary-shaped clusters and the kernel density estimation (KDE) to weight denser regions in the clusters to the compactness of the clusters. Therefore, the advantages of our new approach can be listed as follows:(i)The VIASCKDE Index can evaluate arbitrary-shaped clusters correctly(ii)It weights denser regions to support the compactness of clusters(iii)It is suitable for all types of clustering techniques, especially for density-based algorithms(iv)It can be used for micro-cluster-based approaches(v)It has greater performance when compared with state-of-the-art techniques

The rest of this paper was organized as follows: in Section 2, the related studies were reviewed. In the 3^rd section, the problem with existing works and the need for the proposed approach was explained. While details about the VIASCKDE Index were given in the 4^th section, the comparison of experimental results with the state-of-the-art approaches on real and synthetic datasets was given in the 5^th section. After that, the discussion on the results was provided in Section 6. Finally, the conclusion of the study was presented in Section 7.

As cluster validation techniques, in internal methods, we do not need the actual class labels. The cluster validation operation is done by calculating the similarities in the intraclusters and the differences in the interclusters produced by the model to reveal how consistent the produced clusters are [43]. As mentioned above, in the internal methods, cluster quality is evaluated in the aspects of two concepts [44]:(1)Compactness: it states how much the data, which is in the same cluster, are close to each other. Closer data mean better clustering.(2)Separation: it evaluates how much the clusters are far from each other. In the clustering evaluation, it is expected to be far from each other as much as possible.

The illustration of these two concepts is presented in Figure 1, while the equation is demonstrated in Eq. (1). Here, α and β are the weights.

There are many internal methods proposed in the literature. In this section, we focused on the validation indices that are relevant to our approach. To make definitions shorter and more understandable, the general definitions are as follows:

Let X = {x₁, x₂,…,x_n} ∈ R^d be a dataset containing n points in a d-dimensional space, and x_i ∈ R^d. X is a set of disjoint k clusters (where C_i is a cluster and i = 1,2,3,…,k), and n_i data are in the C_i cluster. While the cluster center that is the gravity center of cluster C_i is the mean of the data that belongs to C_i and calculated by , the mean of all datasets is calculated by . In the present study, the mentioned distance is the Euclidean distance; one of each x and y is data of the dataset, and the Euclidean distance between these two data is expressed as d_e( x , y ). In light of this information, we can briefly list the main internal cluster validity indices as follows:

Silhouette Index (SI) [30]: as given in Figure 2, the compactness value of one of the data in any cluster is calculated by measuring the distance from the data to each data in the same cluster. Then, the compactness of the cluster, which is notated as a(x), is calculated by measuring the mean of compactness of all the data that the cluster has. The average of the distances from the elements of the nearest cluster, to which the mentioned data do not belong, gives the separation value of that data. After that, the separation value of the cluster is found by calculating the mean of the separation values of all the data of the cluster and it is notated as b(x). From now on, we can calculate the SI value, which is the cluster validity index of the model. The equations to calculate SI, a(x), and b(x) are given in equations (2)–(4), respectively. The SI value is [−1, +1]. While -1 means the worst clustering, +1 means the best clustering.

Dunn Index (DI) [31]: the DI calculates the success of the model based on compactness and the separation between the clusters. To do this, the DI value of a cluster is calculated by the distance to the closest cluster and its own diameter. Let d_min be the closest distance between clusters C_i and C_j, and let diam(C_l) be the diameter of the cluster C_l, and the values of these two variables are calculated by and . Therefore, by knowing the value of d_min and diam(C_l), the DI of the model is calculated by equation (5). The larger the result value, the more successful the clustering is.

Calinski-Harabasz (CH) [33]: the CH calculates compactness and separation values via the mean of the squares of the interclass and intraclass distances. The CH index value is calculated by (6). In the CH index, the goal is to make the result as large as possible.

Davies–Bouldin (DB) [32]: the compactness value is calculated over the mean of the variance of the data in each cluster. On the other hand, the separation value is calculated over the distance from the center of the cluster to the center of the closest one. Let avg(C_i), which is calculated by (7), be the average of the distances of each data in the cluster i to the cluster center, and the avg(C_i) is calculated by (8).

S_Dbw Index [35]: The S_Dbw calculates the compactness value of the clusters over the standard deviations (σ) of the data that the cluster has. On the other hand, it calculates the separation value by the distance between the centers of the clusters. The S_Dbw index is a type of index that considers the density of clusters. Let den be the density of the cluster, and the S_Dbw index value is calculated with the following equations:

Distance-based Separability Index (DSI) [39]: the DSI is another approach that measures the cluster quality by the means of the distances based on intercluster and intracluster. Let C_i and C_j be two clusters and have N_i and N_j data points, respectively. The intracluster distance set of cluster C_i will be a set as given equation (13). Moreover, the intercluster distance set is measured based on the distances of data pairs of clusters C_i and C_j. To compute the DSI, the Kolmogorov–Smirnov (KS) test was utilized.

Let be Kolmogorov–Smirnov test of cluster C_i, which is calculated as and be of C_j, and the DSI of these two clusters is the result of the following equation:

RMSSTD [35]: the root-mean-square standard deviation (RMSSTD) aims to calculate the clustering quality by measuring the homogeneity of clusters. It is commonly used for hierarchical clustering. Let the dataset consists of k clusters, p be the number of independent variables, be the mean of data in variable j and cluster i, and is the number of data in variable p and cluster k. RMSSTD is measured by equation (12). The lower RMSSTD means better clustering.

3. Statement of the Problem

Although many approaches have been proposed, analysis of the cluster quality is still an issue. Because there are many clustering approaches in the literature, they differ from each other in many aspects. Therefore, no cluster validation technique can evaluate the quality of all produced clusters precisely. However, some approaches have been used in this task including the Silhouette Index, Dunn Index, Davies–Bouldin, Calinski-Harabasz, and S_Dbw. Although these indices have been used commonly, each of them has a specific problem with cluster validation as given in Table 1. For example, a significant part of the proposed cluster validity indices assumes the shapes of clusters are spherical. In fact, the minority of clusters are spherical in the real world as some examples are given in Figure 3. The SI can be given as an example of these kinds of indices. It cannot achieve a good score if the shape of the cluster is not spherical. On the other hand, the DB and the CH identify clusters that are compact and well separated. However, in the real world, very few clusters are in that shape. Similarly, despite being better than the DB and the CH in case of the clusters are not well separated, the DI encounters some issues with computational cost when the number of clusters or dimensionality is high. Besides, it is affected by the noisy data due to increasing diameter. As for the S_Dbw, although it is proposed as a density-supported validity index and gets a good score with the compact and well-separated clusters, it is affected by the distribution of the data. In addition, thanks to being a density-based clustering validity index, the DSI is good at dealing with arbitrary-shaped clusters. It can successfully evaluate any cluster quality. However, the DSI is also another cluster validity index that is affected when clusters are too close. Likewise, the RMSSTD is another validity index that encounters some problems when the clusters are close to each other. The examples of the problems on the shapes of clusters that existing indices come across can be increased.

Another problem with existing cluster validation indices is that they assume that all the data in any cluster have a homogeneous distribution. However, data inside the cluster mostly have various regions that have different densities, as seen in Figure 4 (darker areas mean denser regions). Moreover, the data in the same cluster may not have homogeneous distribution as can be seen in Figure 4(b). So, any approach that considers the density of data in the clusters is still needed to support the compactness of the cluster. Although the S_Dbw and the DSI are two examples of cluster validity indices that take into consideration the density of clusters, they do not examine the density areas inside the clusters. These kinds of indices are useful to discover the shapes of clusters. However, maybe, some regions are denser than the other regions inside the cluster, and these indices do not take into account such problems. Giving more weight to denser regions may make the approach more accurate while identifying it because of supporting compactness. In the present study, we proposed a new cluster validity index that can discover the arbitrary-shaped clusters and weight the denser regions by using the Kernel. Density estimation was explained in Section 4.2.

4. Proposed Cluster Validity Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation (The VIASCKDE Index)

4.1. Basic Idea

In the present study, a new cluster validation index, which has been named shortly the VIASCKDE (the Validity Index for Arbitrary-Shaped Clusters based on the Kernel Density Estimation) index, was proposed. The VIASCKDE Index is a kind of index that is not affected by cluster shape, and thus, it can make a realistic evaluation of clustering performance regardless of the clusters’ shape. Unlike the existing cluster validation indices, our index calculates the compactness and separation values of the cluster based on calculating the compactness and separation values for each data separately. In other words, it calculates the compactness and separation values of the cluster over the distance of data, independent of parameters such as the cluster center because, in nonspherical clusters, the distance of the data to the closest data is more important than its distance to the cluster center. As can be seen in the example given in Figure 5, the closest data in the cluster that “it belongs to” are used when calculating the compactness value for the data x. Similarly, the separation value of x is calculated by the distance to the closest data of the cluster that “it does not belong.”

As mentioned before, another problem with existing cluster validity indices is to assume that the distribution of the data inside the cluster has homogeneous distribution, even if the shape of the cluster is arbitrary. Therefore, they weight each data of the cluster as the same value, whereas, as presented in Figure 4, the distribution of data that is inside the same cluster may vary. Therefore, we need a new method that considers this situation. To overcome this problem, we proposed the kernel density estimation (KDE), which is detailed in the next section based on weighting method.

4.2. Kernel Density Estimation-Based Weighting

In the literature, there are two types of distribution estimation methods that are parametric and nonparametric. In parametric methods, for example, the Gaussian distribution assumes the distribution of any dataset is gathered around the center and the majority of the data is in a circle having a radius of the standard deviation. It means that the curve has only one peak on distribution. It is important to keep in mind that the univariate normal distribution, with mean µ and variance σ², has the probability density functionwhere x is in -∞ < x < ∞ interval. On the other hand, in nonparametric distribution estimation methods, it is assumed that there may be more than one distribution peaks on the curve. Let be an n-dimensional vector that has a multivariate Gaussian (or normal) distribution with the n-dimensional mean vector and ∑ be the n x n covariance matrix. The multivariate Gaussian distribution is calculated as follows:

The kernel density estimation (KDE) is a nonparametric density estimator that is used for density estimation. It is also a method that is used to analyze existing data to decide which incoming data is placed correctly in which place. For this ability, it is commonly used in many areas such as data analysis procedures in healthcare services, artificial intelligence applications, the stock market, and many other areas [2]. The bar graph represents the histograms, and the orange line represents the KDE, and it is calculated over the histograms as presented in Figure 6. In analyzing the data and representing its application, it figures out the distribution of data according to various methods, which are given in Figure 7. Each one has its characteristic and equation. In mathematical formulation, the KDE is a functionwhere K(.) is one of the functions, which are given in Figure 7. The most commonly used one is the Gaussian function. These functions are known as smooth functions that control the amount of smoothing where the h > 0. The KDE smooths each data; here, it is X_i, one after the other one until reaching the final density estimation.

In addition to estimating the density function of univariate data, as an example given in Figure 6, we can apply the KDE to multivariate datasets. In this case, we have to use a kernel function that could process a multidimensional dataset. To achieve this, the mentioned kernel function should be constructed by a product kernel or a radial basis approach. Let denote a sample of size n from a multivariate random variable with density defined on , and let be an independent random sample drawn from . In the following example, we only considered the two-dimensional case without the loss of generality. Thus, is given by , where and denote the x and y coordinates, respectively. The multivariate kernel density estimator at point x is given bywhere K(.) is a multivariate kernel function and h denotes a symmetric positive definite bandwidth matrix.

Although KDE is a nonparametric probability density function to solve the inhomogeneous distribution problem, we can also use it as a weighting function to support the compactness of clusters. As the KDE of any data is the summation of the data around it, it is expected the weight of any data close to the edges of data distribution would be less, while the KDE of the data in the near center would be more. Therefore, the KDE could be used as a weighting function to weight the data. In our approach, doing that will support the compactness of the cluster regardless of its shape. Namely, we used the KDE to weight each data to give more importance to the data in the denser regions. Therefore, we calculated the weight of each data that is W_KDE according to obtained KDE value. For example, let us assume we want to find W_KDE values for data x₁ = 30 and x₂ = 40 in the example of the dataset given in Figure 6. W_KDE for x₁ would be 0.007, while W_KDE would be 0.05 for x₂, which is very high when compared to the other one. That makes our approach superior when compared with existing clustering validity indices, which ignore the distribution of data in the same cluster. In other density-based approaches, they would weight x₁ and x₂ as equal for this example and this would be incorrect.

4.3. Definitions and Equations

In light of these explanations, let us explain the details of the VIASCKDE Index.

Definition 1. (CoSeD—Compactness and Separation Value of a Data): the CoSeD can be described as the compactness and separation value of any data. To calculate this value, W_KDE value of each data, which is explained in Section 4.2, is calculated first. Let a( x ) (compactness) be the distance from x to the closest data of cluster C_i in which the data x also belong, and let b( x ) (separation) be the distance from x to the closest data of cluster C_j in which the data x do not belong to; therefore, the compactness and separation value of the data x, CoSeD( x ), are calculated by the following equation:

Definition 2. (CoSeC—Compactness and Separation Value of a Cluster): the CoSeC value is the average of the CoSeD values of the data owned by the cluster. The CoSeC value of the cluster C_i is calculated by equation (18), where C_i is the cluster to which the data x belong, and n is the number of the data that cluster C_i possesses.

Definition 3. (the VIASCKDE, the Value of Overall Clustering): let k be the number of clusters, let n_j be the number of data that cluster C_j possesses, and let CoSeC_j be the value of cluster C_j, which is calculated in equation (18); therefore, the VIASCKDE Index value is calculated by equation (19). The VIASCKDE value is expected to be in between [−1, +1], where +1 refers to the best possible value, and -1 refers to the worst possible value.

4.4. The Algorithm

Let Gaussian_KDE be a function that calculates the KDE and MinMaxNormalization, which is also a function that normalizes the data to the range of [0, 1]. The CoSeD and CoSeC values were explained in Section 4.3. In light of this information and the equation given in the previous section, the pseudocode of VIASCKDE Index was given in Algorithm 1.

	Input; X, labels
	Output; VIASCKDE
	KDE_X←Gaussian_KDE(X.T)
	for k = 1 to size (unique (labels)) do
	data_of_Cluster_k←X[labels = k] ►data belongs to cluster k
	data_not_belong_k_X[labels ≠ k] ►data not belongs to cluster k
	kde_k←MinMaxNormalization (KDE_X [labels = k]) ►KDE of each
	►data of cluster k and normalize them
	for i = 1 to size (data_of_k)do
	a_i ← closed_data (data_of_cluster_k) ►distance from i^thdata
	►to closed one in Cluster k
	b_i ← closed_data (k_ait_olmagan_verileri) ►distance from i^th
	►data to closest one that does not in the cluster k
	CoSeD_i = kde_k[i][(b_i−a_i)/max(a_i, b_i0)] ►Compactness and
	►Separation of data i
	CoSeC_k = mean (CoSeD_k) ►Compactness-separation of Cluster k
	VIASCKDE = ►Overall VIASCKDE value of Clustering

4.5. Computational Complexity

Let k be the number of clusters in the dataset, let n be the number of data that clusters possess, and let d be the number of features each data possesses; therefore, the time complexity of the VIASCKDE Index is calculated as the O(kn²d), since it calculates the distance of each data to all others. This means that the complexity of the proposed approach is the O(n²). This is acceptable when the index is compared with the complexity of other indices given in Table 1.

5. Experimental Study

5.1. Development Environment

To demonstrate the effectiveness of the VIASCKDE Index (https://github.com/senolali/VIASCKDE) on the experimental studies, the data were processed with using the Python language in the Anaconda Spyder environment. Various machine learning libraries of the Scikit-learn library such as the DBSCAN, Spectral Clustering, HDBSCAN, and metrics were used. The dataset was imported with the Pandas library, and mathematical operations were performed with the NumPy library. Visualization processes were also carried out with the matplotlib library. All experiments and comparison operations were performed on a computer with 16 GB RAM, Intel i7 processor, and Windows 11 operating system.

5.2. Used Datasets

To measure the performance of the proposed approach, we performed an experimental study in both synthetic and real datasets. Since the main purpose of our approach is to measure the performance of nonspherical clusters, artificial datasets containing clusters in different shapes were used. In Figure 3, some of the used datasets that contain clusters in different shapes are demonstrated. In addition to these synthetic datasets, real datasets, which are frequently used in the clustering field, were also used for testing. Details of the datasets used in the comparison process are provided in Table 2. Additionally, as given in Figure 8, some imbalanced datasets were used to analyze the performance of our cluster validation index on the imbalanced data distribution.

5.3. Experimental Procedure

For the experimental study, we used the procedure given below. But firstly, to ensure that each data are between the same ranges and to make it easy to determine parameters, the data were normalized using the min-max normalization that was demonstrated in (20). In addition, the ARI (Adjusted Rand Index) was used as the ground truth method to evaluate the performance of cluster validation indices by comparing the cluster labels that were produced by the clustering algorithm with the actual cluster labels. The reason we chose the ARI is that the generated cluster labels do not need to be the same as the actual cluster labels. For example, let us assume the clustering algorithm produced {1,1,1,2,2,2} cluster labels and actual labels are {2,2,2,4,4,4}. The accuracy value for this situation would be 0%, while it would be 100% with the ARI value, which should be the actual result.

The procedure established in the testing process is as follows: Step #1: Select one of the algorithms (DBSCAN,HDBSCAN, and Spectral Clustering) Step #2: Test the algorithm with randomly selected parameters on one of the selected datasets. Step #3: Evaluate the cluster qualities of clusters that were produced by the selected algorithm with clustering validation indices (SI, DI, CH, DB, S_Dbw, DSI, RMSSTD, and VIASCKDE). Step #4: Calculate the VIASCKDE Index via produced clusters and evaluate it to see whether this is the best result so far. If it is, we accept this value as the best one for the VIASCKDE Index. Then, we do the same operation for the other indices. Step #5: To test each index sufficiently, go to Step #2 and repeat the cycle 100 times. If the cycle is completed go to Step #6. Step #6: Calculate the ARI value that corresponds to the most successful value obtained for each of the clustering validity indices including our proposed approach. Step #7: Compare the ARI values calculated by all cluster validity indices. Consider the one with the highest ARI value as the most competent one for this dataset. Step #8: Go to Step 2 and do the same operations for the new dataset. If all datasets are performed, go to Step 9. Step #9: If all algorithms are performed, finish the procedure; otherwise, go to Step 1.

5.4. Experimental Study

5.4.1. The Selection of Density Distribution Estimation Method

We performed some experimental studies on the datasets to decide which data distribution method should be selected, either parametric or nonparametric. For the parametric method, we selected the Gaussian method and the KDE for the nonparametric method. We carried out experiments with the procedure given in Section 5.3, by using the DBSCAN in which the parameters are randomly selected. Besides, the kernel = “Gaussian” and h = 0.05 were the parameters of KDE based on the VIASCKDE Index approach, while the Gaussian was the method of parametric VIASCKDE Index. According to obtained results, while the Gaussian-based method outperformed in 15 datasets, the KDE-based method was the best in 17 datasets, as demonstrated in Table 3. Therefore, we selected the KDE-based method as the weighting function for our approach.

5.4.2. The Kernel Selection for KDE

As mentioned in Section 4.2, there are various kernels in the literature. The Gaussian, cosine, linear, tophat, and exponential can be given as examples, and they affect the smoothness of distribution. We fulfilled the operation with the procedure provided in Section 5.3 where the parameters of DBSCAN algorithm were selected randomly. We performed the experiments by choosing each kernel in each experimental study. As it can be seen in Table 4, the Gaussian kernel was the best in all of the selected datasets, when the bandwidth was 0.05.

5.4.3. Bandwidth Selection for the KDE

One of the most important parameters of KDE is bandwidth (h). It possesses a direct effect on the results. When the h is too small, there would be many wiggly structures on the density curve. On the other hand, when the h is too large, the bumps on the curve would be smoothed out as given in Figure 9. To find which bandwidth is the best for our approach, we fulfilled some experimental studies with the procedure given in Section 5.3 by testing it with different bandwidth values on some datasets, which are provided in Table 2. The best bandwidth was found to be 0.05 as it can be seen in Table 5, when the kernel was the Gaussian.

5.4.4. The Tests on Both Synthetic and Real Datasets

In this section, experimental works were executed on both synthetic and real datasets. To detect nonspherical clusters in the test process, the DBSCAN, Spectral Clustering, and HDBSCAN were used. The DBSCAN algorithm uses two parameters (MinPts: the clustering threshold value, and ε: the accessibility distance) and Spectral Clustering uses one parameter as input (n_clusters: the number of clusters) if the affinity = “nearest_neighbors,” while the HDBSCAN Clustering uses two parameters (min_cluster_size: the number of clusters, and min_samples). To test each algorithm with different parameters, we performed the random search method on the procedure given in Section 5.3. The procedure given above with each cluster validity index was used as the leading method to reach better clustering results. As an example is given in Figure 10, each index proposed various results. It means that the cluster validation performance of each one is also different. According to obtained results, our index was the best one. The performance of each index in all datasets is presented in the following tables for each clustering algorithm (Tables 6–14).

6. Evaluation of the Results and Discussion

In our approach, we used the compactness and separation values of each data to support the arbitrary-shaped clusters. In this case, our approach tended to divide the spherical clusters into small partitions. To cope with this issue, we used a density estimation method to support the compactness of clusters. In the literature, there are two types of density estimation methods, parametric and nonparametric methods. To decide which one is the best for our approach, we carried out some experiments on the datasets by using the DBSCAN as the clustering algorithm. According to the experimental study, the nonparametric method was better than the parametric method, and the results of it can be seen in Table 3. After deciding that the nonparametric method was the best for our approach, we selected the kernel density estimation as the nonparametric density estimation method in order to support the multivariate (Table 4).

The second point worth discussing is the selection of parameters of the kernel density estimation. The kernel density estimation has two parameters: the first one is the kernel method and the second one is the bandwidth. To find the best parameters of the kernel density estimation, we conducted some experimental studies. We carried out separate experiments for each parameter by using the procedure given in Section 5.3 by using the DBSCAN with randomly selected parameters. As it can be seen in Tables 4 and 5, the Gaussian was the best kernel method and the h = 0.05 was the best bandwidth. These parameters were the parameters that were used in experimental studies, which were used to compare our approach with the other indices.

One of the advantages of the proposed VIASCKDE Index is that it can realistically evaluate the clustering performance regardless of the cluster shape. To test the success of our index on different cluster types, we used the DBSCAN, Spectral Clustering, and HDBSCAN algorithms with the procedure given in Section 5.3. The highest ARI values found as the best value by each index are given in Tables 11, 12 and 14. As it can be seen in the tables, the VIASCKDE Index reaches the highest ARI values on most of the datasets. The VIASCKDE Index reaches the highest ARI values in 47 of the 60 experiments, as given in Table 15. In addition, the ARI value of our index was very high, even if it was not the index that had the highest ARI value. In addition, when our index was compared with the density-based two indices, which were the S_Dbw and DSI, better results were obtained, and they are demonstrated in Table 15.

The other important advantage of our approach is that it considers the density of each cluster independently. For example, the Aggregation dataset has a nonhomogeneous density as it can be seen in Figure 4, and each cluster also may have a nonhomogeneous distribution as it was given in Figure 4(b). So, our approach does not assume all data inside any cluster has homogeneous distribution and also does not weight each data equally. It gives more importance to the data in the denser regions by multiplying those data with a coefficient that is detected by the KDE. Doing that supports the compactness of clusters. In other words, this approach made our index got better results.

Since the VIASCKDE Index has a density-based approach, it can also be used to evaluate the performance of the algorithms that are based on a microcluster structure, which is used by the majority of density-based clustering algorithms because such algorithms use the center of each of the microclusters as the actual data in the offline phase. Therefore, the VIASCKDE Index can also be used to evaluate the performance of micro-cluster-based clustering algorithms.

7. Conclusion and Future Works

In the present study, we proposed a cluster validation index, which is called the VIASCKDE Index to validate the clusters quality of both the spherical and nonspherical clusters. Our approach draws its strength from considering the distribution of data inside the clusters by using the KDE. Doing that supports the compactness of clusters irrespective of the cluster center, and thus, the shape of the cluster can be in the form of arbitrary cluster. Most of the cluster validity indices in the literature can only do a realistic cluster quality evaluation when the cluster shape is spherical. However, in many instances, the cluster shape is not spherical. Our proposed approach calculates the compactness and separation values only based on the data. This approach makes it possible to evaluate cluster quality irrespective of its shape. Experimental studies revealed that the VIASCKDE Index reached the highest ARI values in most of the datasets. This means that the approach we proposed is the most successful one among the others. It has been planned to carry out studies to decrease the runtime complexity of the proposed index in the future.

Data Availability

Python implementation of the proposed index is shared on GitHub (https://github.com/senolali/VIASCKDE).

Conflicts of Interest

The authors declared that they have no conflicts of interest.

References

C. C. Aggarwal and C. R. D. Clustering, Algorithms and Applications, CRC Press Taylor and Francis Group, Boca Raton, Florida, 2014.
S. Węglarczyk, “Kernel density estimation and its application,” in Proceedings of the ITM Web of Conferences, EDP Sciences, 2018, https://www.itm-conferences.org/articles/itmconf/ref/2018/08/itmconf_sam2018_00037/itmconf_sam2018_00037.html.
View at: Google Scholar
M. Ester, H. P Kriegel, J Sander, and X Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231, AAAI Press, Portland, Oregon, 1996.
View at: Google Scholar
S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.
View at: Publisher Site | Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny, “Birch,” ACM SIGMOD Record, vol. 25, no. 2, pp. 103–114, 1996.
View at: Publisher Site | Google Scholar
J. Jianbo Shi, J. Malik, and M. intelligence, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
View at: Publisher Site | Google Scholar
G. N. Lance and W. T. Williams, “A general theory of classificatory sorting strategies: 1. Hierarchical systems,” The Computer Journal, vol. 9, no. 4, pp. 373–380, 1967.
View at: Publisher Site | Google Scholar
L. McInnes, J. Healy, and S. Astels, “hdbscan: hierarchical density based clustering,” The Journal of Open Source Software, vol. 2, no. 11, p. 205, 2017.
View at: Publisher Site | Google Scholar
B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, no. 5814, pp. 972–976, 2007.
View at: Publisher Site | Google Scholar
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics,” ACM SIGMOD Record, vol. 28, no. 2, pp. 49–60, 1999.
View at: Publisher Site | Google Scholar
B. Sathya and R. Manavalan, “Image segmentation by clustering methods: performance analysis,” International Journal of Computer Application, vol. 29, no. 11, pp. 27–32, 2011.
View at: Publisher Site | Google Scholar
C. Li, F. Kulwa, J. Zhang, Z. Li, H. Xu, and X. Zhao, “A review of clustering methods in microorganism image analysis,” Advances in Intelligent Systems and Computing Information Technology in Biomedicine, Springer International Publishing, pp. 13–25, 2021.
View at: Publisher Site | Google Scholar
J. Yang, “A deep learning and clustering extraction mechanism for recognizing the actions of athletes in sports,” Computational Intelligence and Neuroscience, vol. 2022, Article ID 2663834, pp. 1–9, 2022.
View at: Publisher Site | Google Scholar
A. Şenol and H. Karacan, “A survey on data stream clustering techniques,” European Journal of Science and Technology, vol. 13, no. 13, pp. 17–30, 2018.
View at: Google Scholar
V. Kumar, M. S. Chauhan, and S. Khan, “Application of machine learning techniques for clustering of rainfall time series over ganges river basin,” The Ganga River Basin: A Hydrometeorological Approach, Springer, NY, USA, pp. 211–218, 2021.
View at: Publisher Site | Google Scholar
C. Zhang, J. Xue, and X. Gu, “An online weighted bayesian fuzzy clustering method for large medical data sets,” Computational Intelligence and Neuroscience, vol. 2022, Article ID 6168785, pp. 1–11, 2022.
View at: Publisher Site | Google Scholar
M. Kamber and J. Pei, “Data mining concepts and techniques,” The Morgan Kaufmann Series in Data Management Systems, vol. 5, no. 4, pp. 83–124, 2011.
View at: Google Scholar
K. Sabor, J Damien, G Roger et al., “A data mining approach for improved interpretation of ERT inverted sections using the DBSCAN clustering algorithm,” Geophysical Journal International, vol. 225, 2021.
View at: Publisher Site | Google Scholar
M. Rambabu, S. Gupta, and R. S. Singh, “Data mining in cloud computing: survey,” Advances in Intelligent Systems and Computing, in Innovations in Computational Intelligence and Computer Vision, Springer, Singapore, pp. 48–56, 2021.
View at: Publisher Site | Google Scholar
Z. Yu, H.-S. Wong, and H. Wang, “Graph-based consensus clustering for class discovery from gene expression data,” Bioinformatics, vol. 23, no. 21, pp. 2888–2896, 2007.
View at: Publisher Site | Google Scholar
Q. Zou, L. Gang, J. Xingpeng, L. Xiangrong, and Z. Xiangxiang, “Sequence clustering in bioinformatics: an empirical study,” Briefings in Bioinformatics, vol. 21, no. 1, pp. 1–10, 2020.
View at: Google Scholar
A. Şenol and H. Karacan, “Kd-tree and adaptive radius (KD-AR Stream) based real-time data stream clustering,” Journal of Faculty of Engineering and Architecture of Gazi University, vol. 35, no. 1, pp. 337–354, 2020.
View at: Google Scholar
M. O. Attaoui, A Hanene, L Mustapha, and K Nabil, Improved Multi-Objective Data Stream Clustering with Time and Memory Optimization, 2022, https://arxiv.org/abs/2201.05079.
H. Schütze, C. D. Manning, and P. Raghavan, Introduction to Information Retrieval, Cambridge University Press, Cambridge, 2008.
W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971.
View at: Publisher Site | Google Scholar
L. Hubert and P. Arabie, “Comparing partitions,” Journal of Classification, vol. 2, no. 1, pp. 193–218, 1985.
View at: Publisher Site | Google Scholar
D Powers, Recall and Precision versus the Bookmaker, International Conference on Cognitive Science, Istanbul, Turkey, pp. 529–534, 2003.
M. Brun, C. Sima, J. Hua et al., “Model-based evaluation of clustering validation measures,” Pattern Recognition, vol. 40, no. 3, pp. 807–824, 2007.
View at: Publisher Site | Google Scholar
L. Danon, A. Díaz-Guilera, J. Duch, and A. Arenas, “Comparing community structure identification,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2005, no. 09, p. P09008, 2005.
View at: Publisher Site | Google Scholar
P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1987.
View at: Publisher Site | Google Scholar
J. C. Dunn, “Well-separated clusters and optimal fuzzy partitions,” Journal of Cybernetics, vol. 4, no. 1, pp. 95–104, 1974.
View at: Google Scholar
D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224–227, 1979.
View at: Publisher Site | Google Scholar
T. Calinski and J. Harabasz, “A dendrite method for cluster analysis,” Communications in Statistics - Theory and Methods, vol. 3, no. 1, pp. 1–27, 1974.
View at: Publisher Site | Google Scholar
X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 8, pp. 841–847, 1991.
View at: Publisher Site | Google Scholar
M. Halkidi and M. Vazirgiannis, “Clustering validity assessment: finding the optimal partitioning of a data set,” in Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, December 2001.
View at: Publisher Site | Google Scholar
M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On clustering validation techniques,” Journal of Intelligent Information Systems, vol. 17, no. 2, pp. 107–145, 2001.
View at: Publisher Site | Google Scholar
Y. Yanchi Liu, Z. Zhongmou Li, H. Hui Xiong, X. Xuedong Gao, J. Junjie Wu, and S. Sen Wu, “Understanding and enhancement of internal clustering validation measures,” IEEE Transactions on Cybernetics, vol. 43, no. 3, pp. 982–994, 2013.
View at: Publisher Site | Google Scholar
L. Hu and C. Zhong, “An internal validity index based on density-involved distance,” IEEE Access, vol. 7, pp. 40038–40051, 2019.
View at: Publisher Site | Google Scholar
S. Guan and M. Loew, “An internal cluster validity index using a distance-based separability measure,” in Proceedings of the 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA, November 2020.
View at: Publisher Site | Google Scholar
Q. Xu, Q. Zhang, J. Liu, and B. Luo, “Efficient synthetical clustering validity indexes for hierarchical clustering,” Expert Systems with Applications, vol. 151, Article ID 113367, 2020.
View at: Publisher Site | Google Scholar
Q. Li, S. Yue, Y. Wang, M. Ding, and J. Li, “A new cluster validity index based on the adjustment of within-cluster distance,” IEEE Access, vol. 8, Article ID 202872, 2020.
View at: Publisher Site | Google Scholar
S. Modak, “A new measure for assessment of clustering based on kernel density estimation,” Communications in Statistics - Theory and Methods, 2022.
View at: Publisher Site | Google Scholar
C.-E. Ben Ncir, A. Hamza, and W. Bouaguel, “Parallel and scalable Dunn Index for the validation of big data clusters,” Parallel Computing, vol. 102, Article ID 102751, 2021.
View at: Publisher Site | Google Scholar
Y. Liu, L Zhongmou, X Hui, G Xuedong, and W Junjie, “Understanding of Internal Clustering Validation Measures,” in Proceedings of the 2010 IEEE International Conference on Data Mining, Sydney, NSW, Australia, December 2010.
View at: Google Scholar
J. Kools, 6 Functions for Generating Artificial Datasets, MATLAB Central File Exchange, 2021, https://www.mathworks.com/matlabcentral/fileexchange/41459-6-functions-for-generating-artificial-datasets.
N. Ilc, “Datasets package,” 2013, https://www.researchgate.net/publication/239525861_Datasets_package.
View at: Google Scholar
A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,” ACM Transactions on Knowledge Discovery from Data, vol. 1, no. 1, p. 4, 2007.
View at: Publisher Site | Google Scholar
L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” in Proceedings of the 17th International Conference on Neural Information Processing Systems, pp. 1601–1608, MIT Press, 2004.
View at: Google Scholar
D. Dua and C. Graff, “UCI Machine Learning Repository,” 2021, http://archive.ics.uci.edu/ml.
View at: Google Scholar

Copyright

Copyright © 2022 Ali Şenol. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

686

Downloads

485

Citations

Computational Intelligence and Neuroscience

VIASCKDE Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation

Abstract

1. Introduction

2. Background and Related Works

3. Statement of the Problem

4. Proposed Cluster Validity Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation (The VIASCKDE Index)

4.1. Basic Idea

4.2. Kernel Density Estimation-Based Weighting

4.3. Definitions and Equations

4.4. The Algorithm

4.5. Computational Complexity

5. Experimental Study

5.1. Development Environment

5.2. Used Datasets

5.3. Experimental Procedure

5.4. Experimental Study

5.4.1. The Selection of Density Distribution Estimation Method

5.4.2. The Kernel Selection for KDE

5.4.3. Bandwidth Selection for the KDE

5.4.4. The Tests on Both Synthetic and Real Datasets

6. Evaluation of the Results and Discussion

7. Conclusion and Future Works

Data Availability

Conflicts of Interest

References

Copyright