Abstract
The cluster evaluation process is of great importance in areas of machine learning and data mining. Evaluating the clustering quality of clusters shows how much any proposed approach or algorithm is competent. Nevertheless, evaluating the quality of any cluster is still an issue. Although many cluster validity indices have been proposed, there is a need for new approaches that can measure the clustering quality more accurately because most of the existing approaches measure the cluster quality correctly when the shape of the cluster is spherical. However, very few clusters in the real world are spherical. Therefore, a new Validity Index for ArbitraryShaped Clusters based on the kernel density estimation (the VIASCKDE Index) to overcome the mentioned issue was proposed in the study. In the VIASCKDE Index, we used separation and compactness of each data to support arbitraryshaped clusters and utilized the kernel density estimation (KDE) to give more weight to the denser areas in the clusters to support cluster compactness. To evaluate the performance of our approach, we compared it to the stateoftheart cluster validity indices. Experimental results have demonstrated that the VIASCKDE Index outperforms the compared indices.
1. Introduction
Clustering approaches are unsupervised learning techniques that separate data into groups called clusters according to the similarities and dissimilarities among the data [1, 2]. The DBSCAN [3], kmeans [4], BIRCH [5], Spectral Clustering [6], Agglomerative Clustering [7], HDBSCAN [8], Affinity Propagation [9], and OPTICS [10] are some examples of them, and they are used in many fields such as pattern recognition [11–13], machine learning [14–16], data mining [17, 18], web mining [1, 19], bioinformatics [20, 21], and streaming data mining [22, 23]. On the other hand, measuring the performance of any proposed clustering approach is also an important issue because each algorithm has its special point of view, and the results of each clustering technique vary. Therefore, to overcome this problem, cluster validation analysis or cluster validation indices have emerged. These approaches are generally used for two purposes, which are measuring the performance of clustering algorithms and contributing to clustering algorithms as a guide by finding the optimum number of clusters.
Cluster validation indices are divided into two main categories as internal and external indices. In external indices, true class labels are compared with the labels that are assigned by the proposed algorithm to measure the performance. Therefore, to use these indices, there is a need for true class labels. The Purity [24], Rand Index [25], Adjusted Rand Index [26], Accuracy, Precision and Recall [27], FMeasure [28], and NMI [29] can be given as examples of these types of indices. On the other hand, in the internal indices, we do not need actual class labels to measure the quality of clusters. In these indices, the evaluation of clustering performance is based on how similar the data in the same cluster are to each other, known as compactness, and how dissimilar the data in different clusters are from each other, known as separation. The Silhouette Index (SI) [30], Dunn Index [31], Davies–Bouldin (DB) [32], CalinskiHarabasz (CH) [33], XieBeni (XB) [34], S_Dbw [35], and RMSSTD [36] can be mentioned as primary cluster validity indices. Besides, there are many new cluster validity indices such as the CVNN [37], CVDD [38], DSI [39], SCV [40], and AWCD [41].
The main problem of the majority of stateoftheart cluster validity indices is that they measure the cluster quality correctly when the shapes of the clusters are spherical. As an example, Silhouette Index (SI) uses the means of distances of each data in the cluster to evaluate their quality. Similarly, Davies–Bouldin (DB) uses cluster diameters and cluster centroids, and the CalinskiHarabasz (CH) uses the square of intracluster and intercluster distances. These all calculations are ideal if the shape of the cluster is spherical. However, the shapes of the minority of clusters are spherical in the real world. Additionally, if the shape is arbitrary, these indices cannot measure the cluster quality correctly because the center of gravity of any cluster is in the middle only if the shape is spherical.
Similar to our approach, there is another kernel density estimationbased cluster validation index, named the M_{clus} [42]. In the M_{clus}, the authors used a function of estimation of the mode to assess cluster quality. This mode function allows the index to assess the cluster quality by adopting interpoint distance measures that can be defined to have a probability density function. To evaluate clustering with the number of clusters greater than 1 (K > 1), they applied the mode estimation procedure for interpoint distances that are assumed to have a probability density function between the data members. On the other hand, in this study, we proposed a novel Internal Validity Index for ArbitraryShaped Clusters based on the kernel density estimation (the VIASCKDE Index). We aimed to calculate the cluster quality accurately by using compactness and separation of each data to support arbitraryshaped clusters and the kernel density estimation (KDE) to weight denser regions in the clusters to the compactness of the clusters. Therefore, the advantages of our new approach can be listed as follows:(i)The VIASCKDE Index can evaluate arbitraryshaped clusters correctly(ii)It weights denser regions to support the compactness of clusters(iii)It is suitable for all types of clustering techniques, especially for densitybased algorithms(iv)It can be used for microclusterbased approaches(v)It has greater performance when compared with stateoftheart techniques
The rest of this paper was organized as follows: in Section 2, the related studies were reviewed. In the 3^{rd} section, the problem with existing works and the need for the proposed approach was explained. While details about the VIASCKDE Index were given in the 4^{th} section, the comparison of experimental results with the stateoftheart approaches on real and synthetic datasets was given in the 5^{th} section. After that, the discussion on the results was provided in Section 6. Finally, the conclusion of the study was presented in Section 7.
2. Background and Related Works
As cluster validation techniques, in internal methods, we do not need the actual class labels. The cluster validation operation is done by calculating the similarities in the intraclusters and the differences in the interclusters produced by the model to reveal how consistent the produced clusters are [43]. As mentioned above, in the internal methods, cluster quality is evaluated in the aspects of two concepts [44]:(1)Compactness: it states how much the data, which is in the same cluster, are close to each other. Closer data mean better clustering.(2)Separation: it evaluates how much the clusters are far from each other. In the clustering evaluation, it is expected to be far from each other as much as possible.
The illustration of these two concepts is presented in Figure 1, while the equation is demonstrated in Eq. (1). Here, α and β are the weights.
There are many internal methods proposed in the literature. In this section, we focused on the validation indices that are relevant to our approach. To make definitions shorter and more understandable, the general definitions are as follows:
Let X = {x_{1}, x_{2},…,x_{n}} ∈ R^{d} be a dataset containing n points in a ddimensional space, and x_{i} ∈ R^{d}. X is a set of disjoint k clusters (where C_{i} is a cluster and i = 1,2,3,…,k), and n_{i} data are in the C_{i} cluster. While the cluster center that is the gravity center of cluster C_{i} is the mean of the data that belongs to C_{i} and calculated by , the mean of all datasets is calculated by . In the present study, the mentioned distance is the Euclidean distance; one of each x and y is data of the dataset, and the Euclidean distance between these two data is expressed as d_{e}( x , y ). In light of this information, we can briefly list the main internal cluster validity indices as follows:
Silhouette Index (SI) [30]: as given in Figure 2, the compactness value of one of the data in any cluster is calculated by measuring the distance from the data to each data in the same cluster. Then, the compactness of the cluster, which is notated as a(x), is calculated by measuring the mean of compactness of all the data that the cluster has. The average of the distances from the elements of the nearest cluster, to which the mentioned data do not belong, gives the separation value of that data. After that, the separation value of the cluster is found by calculating the mean of the separation values of all the data of the cluster and it is notated as b(x). From now on, we can calculate the SI value, which is the cluster validity index of the model. The equations to calculate SI, a(x), and b(x) are given in equations (2)–(4), respectively. The SI value is [−1, +1]. While 1 means the worst clustering, +1 means the best clustering.
Dunn Index (DI) [31]: the DI calculates the success of the model based on compactness and the separation between the clusters. To do this, the DI value of a cluster is calculated by the distance to the closest cluster and its own diameter. Let d_{min} be the closest distance between clusters C_{i} and C_{j}, and let diam(C_{l}) be the diameter of the cluster C_{l}, and the values of these two variables are calculated by and . Therefore, by knowing the value of d_{min} and diam(C_{l}), the DI of the model is calculated by equation (5). The larger the result value, the more successful the clustering is.
CalinskiHarabasz (CH) [33]: the CH calculates compactness and separation values via the mean of the squares of the interclass and intraclass distances. The CH index value is calculated by (6). In the CH index, the goal is to make the result as large as possible.
Davies–Bouldin (DB) [32]: the compactness value is calculated over the mean of the variance of the data in each cluster. On the other hand, the separation value is calculated over the distance from the center of the cluster to the center of the closest one. Let avg(C_{i}), which is calculated by (7), be the average of the distances of each data in the cluster i to the cluster center, and the avg(C_{i}) is calculated by (8).
S_Dbw Index [35]: The S_Dbw calculates the compactness value of the clusters over the standard deviations (σ) of the data that the cluster has. On the other hand, it calculates the separation value by the distance between the centers of the clusters. The S_Dbw index is a type of index that considers the density of clusters. Let den be the density of the cluster, and the S_Dbw index value is calculated with the following equations:
Distancebased Separability Index (DSI) [39]: the DSI is another approach that measures the cluster quality by the means of the distances based on intercluster and intracluster. Let C_{i} and C_{j} be two clusters and have N_{i} and N_{j} data points, respectively. The intracluster distance set of cluster C_{i} will be a set as given equation (13). Moreover, the intercluster distance set is measured based on the distances of data pairs of clusters C_{i} and C_{j}. To compute the DSI, the Kolmogorov–Smirnov (KS) test was utilized.
Let be Kolmogorov–Smirnov test of cluster C_{i}, which is calculated as and be of C_{j}, and the DSI of these two clusters is the result of the following equation:
RMSSTD [35]: the rootmeansquare standard deviation (RMSSTD) aims to calculate the clustering quality by measuring the homogeneity of clusters. It is commonly used for hierarchical clustering. Let the dataset consists of k clusters, p be the number of independent variables, be the mean of data in variable j and cluster i, and is the number of data in variable p and cluster k. RMSSTD is measured by equation (12). The lower RMSSTD means better clustering.
3. Statement of the Problem
Although many approaches have been proposed, analysis of the cluster quality is still an issue. Because there are many clustering approaches in the literature, they differ from each other in many aspects. Therefore, no cluster validation technique can evaluate the quality of all produced clusters precisely. However, some approaches have been used in this task including the Silhouette Index, Dunn Index, Davies–Bouldin, CalinskiHarabasz, and S_Dbw. Although these indices have been used commonly, each of them has a specific problem with cluster validation as given in Table 1. For example, a significant part of the proposed cluster validity indices assumes the shapes of clusters are spherical. In fact, the minority of clusters are spherical in the real world as some examples are given in Figure 3. The SI can be given as an example of these kinds of indices. It cannot achieve a good score if the shape of the cluster is not spherical. On the other hand, the DB and the CH identify clusters that are compact and well separated. However, in the real world, very few clusters are in that shape. Similarly, despite being better than the DB and the CH in case of the clusters are not well separated, the DI encounters some issues with computational cost when the number of clusters or dimensionality is high. Besides, it is affected by the noisy data due to increasing diameter. As for the S_Dbw, although it is proposed as a densitysupported validity index and gets a good score with the compact and wellseparated clusters, it is affected by the distribution of the data. In addition, thanks to being a densitybased clustering validity index, the DSI is good at dealing with arbitraryshaped clusters. It can successfully evaluate any cluster quality. However, the DSI is also another cluster validity index that is affected when clusters are too close. Likewise, the RMSSTD is another validity index that encounters some problems when the clusters are close to each other. The examples of the problems on the shapes of clusters that existing indices come across can be increased.
Another problem with existing cluster validation indices is that they assume that all the data in any cluster have a homogeneous distribution. However, data inside the cluster mostly have various regions that have different densities, as seen in Figure 4 (darker areas mean denser regions). Moreover, the data in the same cluster may not have homogeneous distribution as can be seen in Figure 4(b). So, any approach that considers the density of data in the clusters is still needed to support the compactness of the cluster. Although the S_Dbw and the DSI are two examples of cluster validity indices that take into consideration the density of clusters, they do not examine the density areas inside the clusters. These kinds of indices are useful to discover the shapes of clusters. However, maybe, some regions are denser than the other regions inside the cluster, and these indices do not take into account such problems. Giving more weight to denser regions may make the approach more accurate while identifying it because of supporting compactness. In the present study, we proposed a new cluster validity index that can discover the arbitraryshaped clusters and weight the denser regions by using the Kernel. Density estimation was explained in Section 4.2.
4. Proposed Cluster Validity Index: A Novel Internal Cluster Validity Index for ArbitraryShaped Clusters Based on the Kernel Density Estimation (The VIASCKDE Index)
4.1. Basic Idea
In the present study, a new cluster validation index, which has been named shortly the VIASCKDE (the Validity Index for ArbitraryShaped Clusters based on the Kernel Density Estimation) index, was proposed. The VIASCKDE Index is a kind of index that is not affected by cluster shape, and thus, it can make a realistic evaluation of clustering performance regardless of the clusters’ shape. Unlike the existing cluster validation indices, our index calculates the compactness and separation values of the cluster based on calculating the compactness and separation values for each data separately. In other words, it calculates the compactness and separation values of the cluster over the distance of data, independent of parameters such as the cluster center because, in nonspherical clusters, the distance of the data to the closest data is more important than its distance to the cluster center. As can be seen in the example given in Figure 5, the closest data in the cluster that “it belongs to” are used when calculating the compactness value for the data x. Similarly, the separation value of x is calculated by the distance to the closest data of the cluster that “it does not belong.”
As mentioned before, another problem with existing cluster validity indices is to assume that the distribution of the data inside the cluster has homogeneous distribution, even if the shape of the cluster is arbitrary. Therefore, they weight each data of the cluster as the same value, whereas, as presented in Figure 4, the distribution of data that is inside the same cluster may vary. Therefore, we need a new method that considers this situation. To overcome this problem, we proposed the kernel density estimation (KDE), which is detailed in the next section based on weighting method.
4.2. Kernel Density EstimationBased Weighting
In the literature, there are two types of distribution estimation methods that are parametric and nonparametric. In parametric methods, for example, the Gaussian distribution assumes the distribution of any dataset is gathered around the center and the majority of the data is in a circle having a radius of the standard deviation. It means that the curve has only one peak on distribution. It is important to keep in mind that the univariate normal distribution, with mean µ and variance σ^{2}, has the probability density functionwhere x is in ∞ < x < ∞ interval. On the other hand, in nonparametric distribution estimation methods, it is assumed that there may be more than one distribution peaks on the curve. Let be an ndimensional vector that has a multivariate Gaussian (or normal) distribution with the ndimensional mean vector and ∑ be the n x n covariance matrix. The multivariate Gaussian distribution is calculated as follows:
The kernel density estimation (KDE) is a nonparametric density estimator that is used for density estimation. It is also a method that is used to analyze existing data to decide which incoming data is placed correctly in which place. For this ability, it is commonly used in many areas such as data analysis procedures in healthcare services, artificial intelligence applications, the stock market, and many other areas [2]. The bar graph represents the histograms, and the orange line represents the KDE, and it is calculated over the histograms as presented in Figure 6. In analyzing the data and representing its application, it figures out the distribution of data according to various methods, which are given in Figure 7. Each one has its characteristic and equation. In mathematical formulation, the KDE is a functionwhere K(.) is one of the functions, which are given in Figure 7. The most commonly used one is the Gaussian function. These functions are known as smooth functions that control the amount of smoothing where the h > 0. The KDE smooths each data; here, it is X_{i}, one after the other one until reaching the final density estimation.
In addition to estimating the density function of univariate data, as an example given in Figure 6, we can apply the KDE to multivariate datasets. In this case, we have to use a kernel function that could process a multidimensional dataset. To achieve this, the mentioned kernel function should be constructed by a product kernel or a radial basis approach. Let denote a sample of size n from a multivariate random variable with density defined on , and let be an independent random sample drawn from . In the following example, we only considered the twodimensional case without the loss of generality. Thus, is given by , where and denote the x and y coordinates, respectively. The multivariate kernel density estimator at point x is given bywhere K(.) is a multivariate kernel function and h denotes a symmetric positive definite bandwidth matrix.
Although KDE is a nonparametric probability density function to solve the inhomogeneous distribution problem, we can also use it as a weighting function to support the compactness of clusters. As the KDE of any data is the summation of the data around it, it is expected the weight of any data close to the edges of data distribution would be less, while the KDE of the data in the near center would be more. Therefore, the KDE could be used as a weighting function to weight the data. In our approach, doing that will support the compactness of the cluster regardless of its shape. Namely, we used the KDE to weight each data to give more importance to the data in the denser regions. Therefore, we calculated the weight of each data that is W_{KDE} according to obtained KDE value. For example, let us assume we want to find W_{KDE} values for data x_{1} = 30 and x_{2} = 40 in the example of the dataset given in Figure 6. W_{KDE} for x_{1} would be 0.007, while W_{KDE} would be 0.05 for x_{2}, which is very high when compared to the other one. That makes our approach superior when compared with existing clustering validity indices, which ignore the distribution of data in the same cluster. In other densitybased approaches, they would weight x_{1} and x_{2} as equal for this example and this would be incorrect.
4.3. Definitions and Equations
In light of these explanations, let us explain the details of the VIASCKDE Index.
Definition 1. (CoSeD—Compactness and Separation Value of a Data): the CoSeD can be described as the compactness and separation value of any data. To calculate this value, W_{KDE} value of each data, which is explained in Section 4.2, is calculated first. Let a( x ) (compactness) be the distance from x to the closest data of cluster C_{i} in which the data x also belong, and let b( x ) (separation) be the distance from x to the closest data of cluster C_{j} in which the data x do not belong to; therefore, the compactness and separation value of the data x, CoSeD( x ), are calculated by the following equation:
Definition 2. (CoSeC—Compactness and Separation Value of a Cluster): the CoSeC value is the average of the CoSeD values of the data owned by the cluster. The CoSeC value of the cluster C_{i} is calculated by equation (18), where C_{i} is the cluster to which the data x belong, and n is the number of the data that cluster C_{i} possesses.
Definition 3. (the VIASCKDE, the Value of Overall Clustering): let k be the number of clusters, let n_{j} be the number of data that cluster C_{j} possesses, and let CoSeC_{j} be the value of cluster C_{j}, which is calculated in equation (18); therefore, the VIASCKDE Index value is calculated by equation (19). The VIASCKDE value is expected to be in between [−1, +1], where +1 refers to the best possible value, and 1 refers to the worst possible value.
4.4. The Algorithm
Let Gaussian_KDE be a function that calculates the KDE and MinMaxNormalization, which is also a function that normalizes the data to the range of [0, 1]. The CoSeD and CoSeC values were explained in Section 4.3. In light of this information and the equation given in the previous section, the pseudocode of VIASCKDE Index was given in Algorithm 1.

4.5. Computational Complexity
Let k be the number of clusters in the dataset, let n be the number of data that clusters possess, and let d be the number of features each data possesses; therefore, the time complexity of the VIASCKDE Index is calculated as the O(kn^{2}d), since it calculates the distance of each data to all others. This means that the complexity of the proposed approach is the O(n^{2}). This is acceptable when the index is compared with the complexity of other indices given in Table 1.
5. Experimental Study
5.1. Development Environment
To demonstrate the effectiveness of the VIASCKDE Index (https://github.com/senolali/VIASCKDE) on the experimental studies, the data were processed with using the Python language in the Anaconda Spyder environment. Various machine learning libraries of the Scikitlearn library such as the DBSCAN, Spectral Clustering, HDBSCAN, and metrics were used. The dataset was imported with the Pandas library, and mathematical operations were performed with the NumPy library. Visualization processes were also carried out with the matplotlib library. All experiments and comparison operations were performed on a computer with 16 GB RAM, Intel i7 processor, and Windows 11 operating system.
5.2. Used Datasets
To measure the performance of the proposed approach, we performed an experimental study in both synthetic and real datasets. Since the main purpose of our approach is to measure the performance of nonspherical clusters, artificial datasets containing clusters in different shapes were used. In Figure 3, some of the used datasets that contain clusters in different shapes are demonstrated. In addition to these synthetic datasets, real datasets, which are frequently used in the clustering field, were also used for testing. Details of the datasets used in the comparison process are provided in Table 2. Additionally, as given in Figure 8, some imbalanced datasets were used to analyze the performance of our cluster validation index on the imbalanced data distribution.
5.3. Experimental Procedure
For the experimental study, we used the procedure given below. But firstly, to ensure that each data are between the same ranges and to make it easy to determine parameters, the data were normalized using the minmax normalization that was demonstrated in (20). In addition, the ARI (Adjusted Rand Index) was used as the ground truth method to evaluate the performance of cluster validation indices by comparing the cluster labels that were produced by the clustering algorithm with the actual cluster labels. The reason we chose the ARI is that the generated cluster labels do not need to be the same as the actual cluster labels. For example, let us assume the clustering algorithm produced {1,1,1,2,2,2} cluster labels and actual labels are {2,2,2,4,4,4}. The accuracy value for this situation would be 0%, while it would be 100% with the ARI value, which should be the actual result.
The procedure established in the testing process is as follows: Step #1: Select one of the algorithms (DBSCAN,HDBSCAN, and Spectral Clustering) Step #2: Test the algorithm with randomly selected parameters on one of the selected datasets. Step #3: Evaluate the cluster qualities of clusters that were produced by the selected algorithm with clustering validation indices (SI, DI, CH, DB, S_Dbw, DSI, RMSSTD, and VIASCKDE). Step #4: Calculate the VIASCKDE Index via produced clusters and evaluate it to see whether this is the best result so far. If it is, we accept this value as the best one for the VIASCKDE Index. Then, we do the same operation for the other indices. Step #5: To test each index sufficiently, go to Step #2 and repeat the cycle 100 times. If the cycle is completed go to Step #6. Step #6: Calculate the ARI value that corresponds to the most successful value obtained for each of the clustering validity indices including our proposed approach. Step #7: Compare the ARI values calculated by all cluster validity indices. Consider the one with the highest ARI value as the most competent one for this dataset. Step #8: Go to Step 2 and do the same operations for the new dataset. If all datasets are performed, go to Step 9. Step #9: If all algorithms are performed, finish the procedure; otherwise, go to Step 1.
5.4. Experimental Study
5.4.1. The Selection of Density Distribution Estimation Method
We performed some experimental studies on the datasets to decide which data distribution method should be selected, either parametric or nonparametric. For the parametric method, we selected the Gaussian method and the KDE for the nonparametric method. We carried out experiments with the procedure given in Section 5.3, by using the DBSCAN in which the parameters are randomly selected. Besides, the kernel = “Gaussian” and h = 0.05 were the parameters of KDE based on the VIASCKDE Index approach, while the Gaussian was the method of parametric VIASCKDE Index. According to obtained results, while the Gaussianbased method outperformed in 15 datasets, the KDEbased method was the best in 17 datasets, as demonstrated in Table 3. Therefore, we selected the KDEbased method as the weighting function for our approach.
5.4.2. The Kernel Selection for KDE
As mentioned in Section 4.2, there are various kernels in the literature. The Gaussian, cosine, linear, tophat, and exponential can be given as examples, and they affect the smoothness of distribution. We fulfilled the operation with the procedure provided in Section 5.3 where the parameters of DBSCAN algorithm were selected randomly. We performed the experiments by choosing each kernel in each experimental study. As it can be seen in Table 4, the Gaussian kernel was the best in all of the selected datasets, when the bandwidth was 0.05.
5.4.3. Bandwidth Selection for the KDE
One of the most important parameters of KDE is bandwidth (h). It possesses a direct effect on the results. When the h is too small, there would be many wiggly structures on the density curve. On the other hand, when the h is too large, the bumps on the curve would be smoothed out as given in Figure 9. To find which bandwidth is the best for our approach, we fulfilled some experimental studies with the procedure given in Section 5.3 by testing it with different bandwidth values on some datasets, which are provided in Table 2. The best bandwidth was found to be 0.05 as it can be seen in Table 5, when the kernel was the Gaussian.
5.4.4. The Tests on Both Synthetic and Real Datasets
In this section, experimental works were executed on both synthetic and real datasets. To detect nonspherical clusters in the test process, the DBSCAN, Spectral Clustering, and HDBSCAN were used. The DBSCAN algorithm uses two parameters (MinPts: the clustering threshold value, and ε: the accessibility distance) and Spectral Clustering uses one parameter as input (n_clusters: the number of clusters) if the affinity = “nearest_neighbors,” while the HDBSCAN Clustering uses two parameters (min_cluster_size: the number of clusters, and min_samples). To test each algorithm with different parameters, we performed the random search method on the procedure given in Section 5.3. The procedure given above with each cluster validity index was used as the leading method to reach better clustering results. As an example is given in Figure 10, each index proposed various results. It means that the cluster validation performance of each one is also different. According to obtained results, our index was the best one. The performance of each index in all datasets is presented in the following tables for each clustering algorithm (Tables 6–14).
6. Evaluation of the Results and Discussion
In our approach, we used the compactness and separation values of each data to support the arbitraryshaped clusters. In this case, our approach tended to divide the spherical clusters into small partitions. To cope with this issue, we used a density estimation method to support the compactness of clusters. In the literature, there are two types of density estimation methods, parametric and nonparametric methods. To decide which one is the best for our approach, we carried out some experiments on the datasets by using the DBSCAN as the clustering algorithm. According to the experimental study, the nonparametric method was better than the parametric method, and the results of it can be seen in Table 3. After deciding that the nonparametric method was the best for our approach, we selected the kernel density estimation as the nonparametric density estimation method in order to support the multivariate (Table 4).
The second point worth discussing is the selection of parameters of the kernel density estimation. The kernel density estimation has two parameters: the first one is the kernel method and the second one is the bandwidth. To find the best parameters of the kernel density estimation, we conducted some experimental studies. We carried out separate experiments for each parameter by using the procedure given in Section 5.3 by using the DBSCAN with randomly selected parameters. As it can be seen in Tables 4 and 5, the Gaussian was the best kernel method and the h = 0.05 was the best bandwidth. These parameters were the parameters that were used in experimental studies, which were used to compare our approach with the other indices.
One of the advantages of the proposed VIASCKDE Index is that it can realistically evaluate the clustering performance regardless of the cluster shape. To test the success of our index on different cluster types, we used the DBSCAN, Spectral Clustering, and HDBSCAN algorithms with the procedure given in Section 5.3. The highest ARI values found as the best value by each index are given in Tables 11, 12 and 14. As it can be seen in the tables, the VIASCKDE Index reaches the highest ARI values on most of the datasets. The VIASCKDE Index reaches the highest ARI values in 47 of the 60 experiments, as given in Table 15. In addition, the ARI value of our index was very high, even if it was not the index that had the highest ARI value. In addition, when our index was compared with the densitybased two indices, which were the S_Dbw and DSI, better results were obtained, and they are demonstrated in Table 15.
The other important advantage of our approach is that it considers the density of each cluster independently. For example, the Aggregation dataset has a nonhomogeneous density as it can be seen in Figure 4, and each cluster also may have a nonhomogeneous distribution as it was given in Figure 4(b). So, our approach does not assume all data inside any cluster has homogeneous distribution and also does not weight each data equally. It gives more importance to the data in the denser regions by multiplying those data with a coefficient that is detected by the KDE. Doing that supports the compactness of clusters. In other words, this approach made our index got better results.
Since the VIASCKDE Index has a densitybased approach, it can also be used to evaluate the performance of the algorithms that are based on a microcluster structure, which is used by the majority of densitybased clustering algorithms because such algorithms use the center of each of the microclusters as the actual data in the offline phase. Therefore, the VIASCKDE Index can also be used to evaluate the performance of microclusterbased clustering algorithms.
7. Conclusion and Future Works
In the present study, we proposed a cluster validation index, which is called the VIASCKDE Index to validate the clusters quality of both the spherical and nonspherical clusters. Our approach draws its strength from considering the distribution of data inside the clusters by using the KDE. Doing that supports the compactness of clusters irrespective of the cluster center, and thus, the shape of the cluster can be in the form of arbitrary cluster. Most of the cluster validity indices in the literature can only do a realistic cluster quality evaluation when the cluster shape is spherical. However, in many instances, the cluster shape is not spherical. Our proposed approach calculates the compactness and separation values only based on the data. This approach makes it possible to evaluate cluster quality irrespective of its shape. Experimental studies revealed that the VIASCKDE Index reached the highest ARI values in most of the datasets. This means that the approach we proposed is the most successful one among the others. It has been planned to carry out studies to decrease the runtime complexity of the proposed index in the future.
Data Availability
Python implementation of the proposed index is shared on GitHub (https://github.com/senolali/VIASCKDE).
Conflicts of Interest
The authors declared that they have no conflicts of interest.