Density Peaks Clustering Based on Feature Reduction and Quasi-Monte Carlo

Hu, Zhihui; Wei, Xiaoran; Han, Xiaoxu; Kou, Guang; Zhang, Haoyu; Liu, Xueyi; Bai, Yefei

doi:https://doi.org/10.1155/2022/8046620

Scientific Programming

On this page

Abstract Introduction Related Work Analysis Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2022 | Article ID 8046620 | https://doi.org/10.1155/2022/8046620

Density Peaks Clustering Based on Feature Reduction and Quasi-Monte Carlo

Zhihui Hu,¹Xiaoran Wei,²Xiaoxu Han,³Guang Kou,¹Haoyu Zhang,¹Xueyi Liu,⁴and Yefei Bai²

Academic Editor: Jiangbo Qian

Received19 Jul 2021

Revised03 Nov 2021

Accepted07 Dec 2021

Published06 Jan 2022

Abstract

Density peaks clustering (DPC) is a well-known density-based clustering algorithm that can deal with nonspherical clusters well. However, DPC has high computational complexity and space complexity in calculating local density and distance , which makes it suitable only for small-scale data sets. In addition, for clustering high-dimensional data, the performance of DPC still needs to be improved. High-dimensional data not only make the data distribution more complex but also lead to more computational overheads. To address the above issues, we propose an improved density peaks clustering algorithm, which combines feature reduction and data sampling strategy. Specifically, features of the high-dimensional data are automatically extracted by principal component analysis (PCA), auto-encoder (AE), and t-distributed stochastic neighbor embedding (t-SNE). Next, in order to reduce the computational overhead, we propose a novel data sampling method for the low-dimensional feature data. Firstly, the data distribution in the low-dimensional feature space is estimated by the Quasi-Monte Carlo (QMC) sequence with low-discrepancy characteristics. Then, the representative QMC points are selected according to their cell densities. Next, the selected QMC points are used to calculate and instead of the original data points. In general, the number of the selected QMC points is much smaller than that of the initial data set. Finally, a two-stage classification strategy based on the QMC points clustering results is proposed to classify the original data set. Compared with current works, our proposed algorithm can reduce the computational complexity from to , where denotes the number of selected QMC points and is the size of original data set, typically . Experimental results demonstrate that the proposed algorithm can effectively reduce the computational overhead and improve the model performance.

1. Introduction

With the advent of the era of big data, the importance of data mining is increasingly prominent [1]. As an unsupervised learning method, clustering is widely used in many different fields including image processing, medicine, and archaeology. There are various classical clustering algorithms, such as K-means [2], DBSCAN [3], and AP [4]. According to different standards, clustering algorithms are classified into different categories. Generally speaking, clustering algorithms are divided into partition-based methods, hierarchy-based methods, density-based methods, and grid-based methods.

In recent years, a new density peaks clustering (DPC) algorithm has been proposed [5]. It is a typical density-based clustering algorithm with excellent advantages. One advantage is that the DPC relies on the decision graph to select the clustering center. Specifically, DPC draws the decision graph of the data set by defining local density and distance . Then, DPC determines the cluster centers based on the decision graph. The obtained cluster centers have two characteristics: (1) The local density of the cluster centers is large and the density of its neighborhood is not greater than itself. (2) The distance between the cluster centers and other data points with a higher density is relatively large. Hence, the cluster centers are data points with high local density and high distance, which are called density peaks. Another advantage is that DPC can not only deal with clusters of arbitrary shape but also does not need to determine the number of categories in advance.

Although DPC has achieved good performance in many situations, it still has some drawbacks. Firstly, DPC needs to calculate the local density and distance of each data point, which makes the computational complexity . The expensive computational overhead limits the application of DPC in large-scale data sets. To address this issue, the study in [6] proposed a distributed density peaks clustering algorithm (EDDPC). EDDPC aggregates large-scale data sets into MapReduce and integrates local results to approximate the final results. However, EDDPC is a distributed algorithm and is not suitable for single CPU scenarios. The study in [7] proposed a density-based and grid-based clustering algorithm (DGB). Instead of calculating distances between all data, only a smaller number of grid points are calculated. However, DGB is only suitable for dealing with high-dimensional data set. In general, the data distribution in high-dimensional space may be more complex and contain more noise. Although [8, 9] are proposed to filter the noise, additional operations increase the computational overhead.

To address the above problems, an improved density peaks clustering algorithm combining feature reduction and data sampling strategy is proposed in this paper. Firstly, the original data feature space is compressed by some classical feature reduction methods. Then, the low-dimensional feature data are sampled by super-uniformly Quasi-Monte Carlo sequence, and the selected high-density Quasi-Monte Carlo points are used to replace the original data points for clustering. Finally, we perform a two-stage strategy to determine the category for the original data. The proposed method has the following advantages:(1)The proposed algorithm reduces the computational complexity from to , where and represent the number of selected QMC points and the size of original data set, respectively. In general, there is (2)Through feature reduction, the proposed algorithm reduces the noise form the original data and decreases the complexity of high-dimensional feature space(3)Extensive experiments have demonstrated the effectiveness of our proposed algorithm in terms of computational overhead and model performance

2.1. Feature Reduction

Feature reduction indicates mapping the data from the high-dimensional feature space to a low-dimensional space. The features of the high-dimensional data will be extracted by linear or nonlinear transformation. Hence, efficient low-dimensional features of the original data set can be obtained by various feature reduction methods. An ideal low-dimensional feature should retain the classification information as much as possible and filter the noise.

Generally speaking, feature reduction can be divided into linear and nonlinear feature reduction methods. Principal component analysis (PCA) is a classical linear feature reduction method [10]. PCA transforms a group of variables that may correlate with linearly uncorrelated variables by orthogonal transformation. Auto-encoder (AE) and t-distributed stochastic neighbor embedding (t-SNE) are nonlinear feature reduction methods. AE can be regarded as a self-supervised manner that consists of the encoder and the decoder [11]. The input data will be mapped to the hidden layer by the encoder, while the decoder transforms the hidden layer features back to the input. Its goal is to combine some high-order features to reconstruct itself. The t-SNE is a machine learning method basing stochastic neighbor embedding (SNE) for feature reduction [12]. t-SNE maps high-dimensional data to two or more dimensions and alleviates the congestion problem in the process of feature reduction. All the above methods have been applied in many fields [13–15].

2.2. Density Peaks Clustering

Density peaks clustering (DPC) is proposed in [5], and it can efficiently deal with arbitrary shape data sets without specifying the cluster number in advance. The cluster center selected by DPC has two characteristics: (1) the local density of the cluster center should be larger than the local density of its neighbors; (2) Data points with low local density should be far away from other data points with high local density. To describe these characteristics, DPC defines two concepts for each data point : the local density and the minimum distance . The local density is formulated aswhere represents the distance between and . is the intercept, which is the only artificially defined parameter in DPC. In the code provided by [5], is formulated aswhere is the size of the distance matrix, which defines the distance between any data point pairs. When the data set is small, the Gaussian kernel function is used to calculate . is formulated as

In addition, is formulated as

DPC draws the decision graph based on and . Then, DPC selects the data points with both and as the cluster centers and assigns the remaining data points to the nearest class. DPC is a simple and efficient algorithm, and a series of works have been carried out [16–22]. However, DPC requires a huge computational overhead. The computational complexity of the DPC is , which makes it unsuitable for large-scale data set. To address this problem, a feasible strategy is to sample the data set [23]. Our work is based on the sampling strategy to reduce the computational overhead.

2.3. Quasi-Monte Carlo

As a statistical test method, the Monte Carlo method has been widely used in machine learning. The Quasi-Monte Carlo method is similar to the Monte Carlo method, but there are theoretical differences between them. The superiority of the Quasi-Monte Carlo method is to generate the deterministic super-uniformly distributed sequence (called low-discrepancy sequence in mathematics) instead of the pseudo-random sequence generated by the Monte Carlo method. The Quasi-Monte Carlo method has been widely used in the field of machine learning [24, 25]. Specifically, the study in [24] utilizes the Quasi-Monte Carlo method to reduce the computational overhead that occurs in the parameter optimization process of neural networks. The study in [25] generates the Quasi-Monte Carlo sequence to perform the feature map and obtains the low-rank features. Similarity, we generate Quasi-Monte Carlo sequence for data sampling. Next, we briefly describe the Quasi-Monte Carlo sequence.

The Quasi-Monte Carlo Random sequence is a deterministic super-uniformly distributed sequence with low deviation. It has the property that any long subsequences are uniformly distributed in the feature space. Recently, the most widely used Quasi-Monte Carlo Random sequence mainly includes Halton sequence [26], Faure sequence [27], and Niederreiter’s sequence [28]. In our work, the Halton sequence is selected to perform the sampling strategy.

The Halton sequence is one of the standard low-discrepancy sequences, which is used to generate super-uniformly distributed random numbers. Compared with pseudo-random numbers generated by the Monte Carlo method, it is mathematically proved that the volatility of the Halton sequence is smaller. Specifically, the approximate error of the Halton sequence is determined by the degree of difference of the sequence . The approximate error is formulated as the following equation:where is the error term, is the Hardy-Krause variation of the function , and is the deviation of .

Because the order of is , the approximate error order of the Quasi-Monte Carlo method is . Similarly, the error order of the pseudo-random sequence is . Compared with the above error orders, the error order of the Quasi-Monte Carlo method is smaller than that of the Monte Carlo method. Note that the above discussion only gives the upper limit of approximate error. In fact, the convergence rate of the Halton sequence is much faster than the rate obtained by the upper limit. Generally speaking, the Quasi-Monte Carlo method greatly speeds up the convergence compared with the Monte Carlo method, and the random numbers generated by the Quasi-Monte Carlo method are more uniform.

The Monte Carlo method generates the pseudo-random numbers, and the Quasi-Monte Carlo method generates the quasi-random numbers. Figure 1 shows the comparison between the quasi-random numbers and the pseudo-random numbers on a two-dimensional plane. As shown in Figure 1, the pseudo-random numbers are not uniformly distributed in some places. However, the Halton sequence is highly uniformly distributed in the whole space. Intuitively, the Quasi-Monte Carlo method may be more comprehensive, while the Monte Carlo method has more blank areas. Hence, this paper adopts the Halton sequence to sample the original data and further proposes a new density peaks clustering algorithm.

(a)

(b)

3. Description of the Algorithm

In this section, a novel improved density peaks clustering algorithm based on the Quasi-Monte Carlo method (QMC-DPC) is proposed to improve the performance of DPC. Specifically, the proposed method includes two components: the feature reduction module and the data sampling module.

3.1. The Feature Reduction Module

In this module, we aim to reduce the feature dimension of data sets. The original data set will be transformed to by various feature reduction methods, where . Our goal is to retain the original information as much as possible while reducing the dimension of the data.

In practice, we utilize linear and nonlinear feature reduction methods, including PCA, AE, and t-SNE, respectively. Firstly, we perform the zero-mean normalization for . For , we calculate the mean and the standard deviation . Hence, we can obtain the normalized numbers . Then, PCA, AE, and t-SNE are implemented on the normalized data set . For PCA, we choose the number of principal components that are smaller than the original dimension of the data set (except for the two-dimension data set). We keep the original dimension for the two-dimension data set. For AE, we set the AE with three layers, including an encoder, a decoder, and a hidden layer. The dimension of encoder and decoder is equivalent to and the number of hidden layer units is equivalent to . For the input data , we select the hidden layer features as the . For t-SNE, the similarity between data points is measured by probability instead of Euclidean distance. Specifically, the similarity of data points in the original feature space is calculated by Gaussian joint probability, while the heavy-tailed student t-distribution is used in the low-dimension to measure the similarity. Then, we minimize the KL divergence to obtain the reduced features . Figure 2 shows the obtained two-dimensional features of PCA, AE, and t-SNE on Waveform and Landsat. The original dimensions of Waveform and Landsat are more than 20. From Figure 2, it can be seen that the low-dimensional features that map from higher-dimensional data are distinguishable. In Section 4, we will discuss how to select the feature reduction method by experimental analysis.

(a)

(b)

3.2. The Data Sampling Module

Although we compress the feature dimension of data sets through the feature reduction module, the computational complexity of the DPC is still . In this module, we aim to reduce the time overhead of DPC. Hence, an improved Density Peaks Clustering algorithm based on super-uniformly Quasi-Monte Carlo sequence (QMC-DPC) is proposed. In summary, we utilize the super-uniformly Quasi-Monte Carlo sequence to sample the low-dimensional feature space of the data set. Then, the representative Quasi-Monte Carlo points are used to calculate and instead of the original data. Generally speaking, the number of selected Quasi-Monte Carlo points is much smaller than the size of original data set . The detailed description of QMC-DPC is given in the following.

Specifically, we first define two basic concepts as follows:(1)Circular data unit : the circle with the Quasi-Monte Carlo points as the center and radius (2)Unit density : the number of data points contained in circular data unit

Assume that is the low-dimensional feature data set obtained by the feature reduction module. We randomly generate Quasi-Monte Carlo points in the feature space. With the Quasi-Monte Carlo points as the centers, the corresponding are determined under the appropriate (When is small, the parameter after experiments). Then, according to whether contains data points or not, the circular data units are divided into two categories: nonempty unit set and empty unit set, where a nonempty unit set and empty unit set . Next, since the empty unit set indicates that it does not contain any data, empty unit set and corresponding Quasi-Monte Carlo points are eliminated. The effect is shown in Figure 3.

(a)

(b)

As shown in Figure 3, the remaining nonempty Quasi-Monte Carlo points are distributed around the sample points, while the removed empty Quasi-Monte Carlo points are far from the sample points. Hence, the distribution of the original data set can be sampled by nonempty Quasi-Monte Carlo points. Furthermore, the local density of the original data set can be estimated by the unit density . Therefore, it is reasonable to utilize the nonempty Quasi-Monte Carlo points to calculate local density and minimum distance instead of the original data points. Next, for all the nonempty Quasi-Monte Carlo points (assuming that the number of the nonempty Quasi-Monte Carlo points are , there is generally ), the distance of the nonempty Quasi-Monte Carlo point pairs is calculated to obtain the distance matrix :where is a symmetric matrix with diagonal elements that are zero. is the ascending order of all elements in . When is too small, the may be zero which indicates that the function of intercept is eliminated. Hence, we remove the zero elements in and take the first distance from the remaining elements as the . Then, we use equations (3) and (4) to calculate and of each nonempty Quasi-Monte Carlo point and draw the decision graph. Figure 4 shows the decision graph of QMC-DPC and DPC on Waveform.

(a)

(b)

(c)

(d)

As shown in Figure 4, the density peaks obtained by QMC-DPC are easier to distinguish than that of DPC, especially on the low-dimensional features generated by AE and t-SNE. Meanwhile, the number of data points in the decision graph of QMC-DPC is smaller than DPC. Specifically, QMC-DPC (PCA), QMC-DPC (AE), and QMC-DPC (t-SNE), respectively, calculate 2742, 2499, and 2989 data points in the decision graph, while DPC calculates 5000 data points in the decision graph. The above discussion further proves the effectiveness of the Quasi-Monte Carlo sampling method. Specifically, it can be summarized as the following three aspects: (1) Combined with the super uniformity of the Quasi-Monte Carlo sequence, the data sampling is more comprehensive, so as to reduce the bias. This conclusion is described by Figure 1. (2) The number of selected nonempty Quasi-Monte Carlo points is small, which greatly reduces the time and space overhead. This conclusion is described in Figure 3. (3) Based on and , data points located in dense areas are difficult to distinguish, because their and are similar. On the contrary, Quasi-Monte Carlo points essentially sample the local density, and the distinction between selected nonempty Quasi-Monte Carlo points is enlarged. Finally, according to the nearest distance principle, we propose a two-stage classification strategy:(i)The density peaks are selected as the class centers, and the remaining nonempty Quasi-Monte Carlo points are assigned to the nearest density peak. The first step obtains the clustering results of all nonempty Quasi-Monte Carlo points.(ii)The data points of are assigned to the nearest nonempty Quasi-Monte Carlo point. As the feature mapping is unique, the classification result of is equivalent to the classification result of . The second step obtains the final clustering results of all data points of .

After the above discussion, QMC-DPC is depicted in Algorithm 1 and the whole process is shown in Figure 5.

Input:
The Data set:
Output:
Clustering results
Steps:
(1)	Perform feature reduction on to obtain low-dimensional feature data ;
(2)	Generate Quasi-Monte Carlo points and determine circular data unit on ;
(3)	Count the density for each circular data unit and generate and ;
(4)	Calculate the matrix based on and remove the zero elements in . Sort the remaining elements and determine the intercept ;
(5)	Calculate the and for each nonempty Quasi-Monte Carlo points by equations (3) and (4);
(6)	Draw the decision graph to select cluster centers and determine the number of ;
(7)	According to the principle of nearest distance, assign the remaining nonempty Quasi-Monte Carlo Points;
(8)	Assign the data points to the class of the nearest nonempty Quasi-Monte Carlo Points;
(9)	Return the clustering results .

3.3. Algorithm Complexity Analysis

The key of DPC is to draw the decision graph based on and . Our work retains the idea of choosing cluster centers, but QMC-DPC only calculates and for nonempty Quasi-Monte Carlo points after the screening, making the computational complexity far less than DPC.

For the data set , the DPC takes the space complexity of to store the distance matrix. The space complexity of QMC-DPC mainly includes: is required to generate Quasi-Monte Carlo points, is required to retain nonempty Quasi-Monte Carlo points, and is required to store the distance matrix of nonempty Quasi-Monte Carlo point pairs. Therefore, the spatial complexity of QMC-DPC is . When n is large, there is in general. However, when is relatively small, the space complexity of QMC-DPC becomes larger due to generating Quasi-Monte Carlo points.

When calculating and , DPC needs to calculate the distance matrix with the time complexity of . After selecting the cluster centers, the time complexity of classifying data points is also . Therefore, the time complexity of DPC algorithm is . The time complexity of QMC-DPC mainly includes to calculate the unit density of Quasi-Monte Carlo points, is required to calculate the and of nonempty Quasi-Monte Carlo points, and is required to classify the nonempty Quasi-Monte Carlo points and when classifying data points. Therefore, the time complexity of QMC-DPC algorithm is . In general, there are always and , making the time complexity of the QMC-DPC less than that of the DPC. However, when is relatively small, the time cost of QMC-DPC is more than that of DPC. In the experiment, we will further prove that even with the addition of the feature reduction module, the proposed algorithm still has time superiority.

4. Experiment and Analysis

4.1. Experimental Setup

To verify the performance of QMC-DPC, the proposed method is compared with related clustering algorithms, including DPC-KNN-PCA [17], SNN-DPC [18], DLORE-DP [16], DPC [5], AP [4], DBSCAN [3], and K-means [2]. The nearest neighbor number is set to 4 in SNN-DPC. The ratio of low-density points in DLORE-DP is set to 0.2. For DBSCAN, the parameter is set to 3 and is empty. K-means needs to specify the number of classes in advance. The data sets adopted in this section include two major categories: unlabeled data sets and labeled data sets. The details of these data sets are listed in Table 1. In labeled data sets, all data sets are UCI data sets. In unlabeled data sets, Flame, Aggregation, and S2 are Synthetic data sets. KDD is a biological data set, which is used to verify the superiority of our proposed algorithm on large-scale and high-dimensional feature data sets.

Four evaluation criteria are adopted to evaluate the model performance on labeled data sets, i.e, the Accuracy (Acc) and F-measure (F), Normalized mutual information (NMI), and Adjusted rand index (ARI). These evaluation criteria are described as follows: Assume that is the data set. and represent the real labels and the predicted labels, respectively. Acc is denoted aswhere is a permutation mapping function, which uses Hungarian algorithm to match the predicted labels with the real labels.

The F-measure is a harmonic mean of precision and recall . is the ratio between the number of correct positive results and the number of all positive results returned by the classifier. is the ratio between the number of correct positive results and the number of all data that should have been identified as positive. is the set of the number of all data that should have been classified as positive. is the set of the number of all positive results identified by the classifier. , , and F-measure are defined by the following equations:where is a nonnegative real number that is set to 1. For the divided by each real label, the nearest one in is selected as its value:

Then, we use the weighted average of to get the final value:

The Normalized Mutual Information (NMI) measures the information that the predicted labels share with the ground truth . NMI is defined as the following equation:where is the mutual information between clustering result and ground truth. and denote the entropy of clustering result and ground truth, respectively.

The Adjusted Rand Index (ARI) is the extension of Rand Index (RI). ARI is defined as the following equation:where , denotes the data pairs which are in the same class in and in the same class in , denotes the data pairs which are in different classes in and in different classes in . denotes the data pairs which are in different classes in and in the same class in . denotes the data pairs which are in the same class in and in different classes in . The value of ARI is in the range [ 1, 1]. The upper bound of these evaluation criterions is 1. The larger these criterions are, the better the clustering results are.

In the feature reduction module, some parameters are set in advance. For t-SNE, the learning rate is 500, the number of perplexity is 30, and the number of epochs is 800. For AE, the learning rate is 0.01, optimizer is Adam, and the number of epochs is 300.

4.2. Experimental Results on Labeled Data Sets

In this section, 9 UCI data sets in Table 1 are used to verify the performance of QMC-DPC. All data are normalized to between [0, 1]. To avoid extreme cases, each algorithm runs 10 times and records the average results. The values of evaluation criteria are shown in Table 2 and the best values are highlighted in bold. The relevant parameters of the QMC-DPC are recorded in Table 3.

As shown in Table 2, our proposed algorithm is superior to other algorithms on the whole. Acc indicates the ratio of the number of correct predicted samples to the number of total samples. In terms of Acc, QMC-DPC achieves the highest performance on all data sets except Waveform and Landsat. In particular, QMC-DPC is 33.6% and 34.3% higher than DPC on Zoo and Pima, respectively. F-measure indicates the matching degree between the predicted labels and the true labels of the data set, which is the weighted harmonic mean of precision and recall. In terms of F-measure, QMC-DPC achieves the highest performance on nearly half the data set. NMI quantifies the similarity between the predicted labels and the true labels, which measures the robustness of the algorithm. In terms of NMI, QMC-DPC achieves the highest performance on all data sets except Landsat, Pima, and Zoo. In particular, QMC-DPC is 21.4% higher than DPC on Waveform. ARI is used to measure the degree of coincidence of the two data distributions. In terms of ARI, QMC-DPC achieves the highest performance on all data sets except Breast and Landsat. The ARI value of QMC-DPC is 73.2% higher than DPC on Zoo. In addition, the evaluation criterion values of QMC-DPC (PCA), QMC-DPC (AE), and QMC-DPC (t-SNE) are similar, and the model performance is better than that of DPC on the whole. The above results indicate that the combination of the feature reduction module and the feature sampling module can improve the model performance.

4.3. Experimental Results of Unlabeled Data Sets

Since there are no real labels for the unlabeled data sets, the evaluation criteria Acc, F-measure, NMI, and ARI cannot be applied to the unlabeled data sets. To compare the performance on the unlabeled data sets, the evaluation criteria Silhouette Coefficient (SC) and Calinski-Harabasz (CH) are defined. For SC, we first calculate the silhouette coefficient for each data point :where is average dissimilarity between data point and other data points in the same class, is the minimum value of the average dissimilarity between data point and other categories. Next, we obtain the silhouette coefficient for data set based on :where is the number of all data points. The value of SC is in the range [−1, 1]. The larger the SC value is, the better the clustering result is.

CH is defined as follows:where , is the number of data points in class , is the average of data points in class , and is the average of all data points. , is the cluster numbers. The larger the CH value is, the better the clustering result is.

In this section, three synthetic data sets and KDD are selected to verify the performance of QMC-DPC. Flame, Aggregation, and S2 are the classical synthetic data sets. KDD is a large-scale data set with high-dimensional features. Table 4 shows the SC and CH of all algorithms on unlabeled data sets. The best values are highlighted in bold. The relevant parameters of the QMC-DPC are recorded in Table 3.

As shown in Table 4, our proposed method obtains the best clustering results on the whole, especially QMC-DPC (AE). The “—” in Table 4 indicates that the algorithm cannot execute because it exceeds the virtual memory. For the SC, QMC-DPC (t-SNE) is higher than DPC on Flame. And, DPC obtains the same results as our proposed method on Aggregation and S2. DPC-KNN-PCA also obtains the same results as our proposed method on S2. In general, QMC-DPC (AE) and QMC-DPC (t-SNE) achieve the better performance than QMC-DPC (PCA) except KDD. Limited by the t-SNE method, QMC-DPC (t-SNE) fails to perform clustering on KDD. In Section 4.6, we will make further comprehensive analysis. In addition, we visualized the classification results on synthetic data sets. Figure 6 shows the classification results on Aggregation ans S2.

4.4. Experimental Results of Running Time

In this subsection, we further verify that our proposed method can effectively reduce the computational overhead. We select data sets with more than 2000 data points and record the running time in Table 5.

As shown in Table 5, compared with DPC, SNN-DPC, and AP, QMC-DPC achieves the best performance in terms of running time. QMC-DPC is at least 34.47%, 61.80%, 25.59%, and 50.85% lower than DPC on Segment, Waveform, Landsat, and s2, respectively. Generally speaking, the larger the data size, the more the time saved. For KDD, QMC-DPC (PCA) and QMC-DPC (AE) obtain the results, while QMC-DPC (t-SNE) will exceed memory. This is limited by the t-SNE method. In addition, DPC, SNN-DPC, and AP also exceed memory. This further confirms the effectiveness of our method. How to select QMC-DPC (PCA), QMC-DPC (AE), and QMC-DPC (t-SNE) will be discussed in Section 4.6. In addition, it can be seen that the running time of the QMC-DPC (PCA) and QMC-DPC (AE) is close. However, the computational overhead of QMC-DPC (t-SNE) is higher than that of QMC-DPC (PCA) and QMC-DPC (AE). The reason is that t-SNE requires a huge computational overhead, while auto-encoder only has a shallow structure and does not contain a large number of training parameters. Furthermore, we compare the time complexity of our proposed method with that of the baselines methods. The results are recorded in Table 6. In this part, we set the number of data points to , the number of cluster categories to , the number of neighbors to , the number of iteration to , and the number of selected Quasi-Monte Carlo points to . Although the time complexity of QMC-DPC is square, is much smaller than in practice. Hence, the time overhead of QMC-DPC will be significantly reduced and the conclusion can also be proved in Table 5.

4.5. Experimental Results of Sensitivity Analysis

In this section, we conduct parameter sensitivity analysis from multiple aspects, such as how feature dimensions affect model performance and running time. Specifically, we first calculate Acc, F, NMI, and ARI on UCI data sets where the feature dimension is in the range [16, 24]. The final results are recorded in Tables 7–10, respectively.

From Tables 7 to 10, it can be seen that the performance of the model will decrease slightly with the increase of dimension on the whole. This is limited by the loss of information caused by the sampling strategy as the dimension increases. As the dimensions increase, data distribution will become more complex. To address this issue, there are two methods to reduce the information loss caused by sampling: (1) increase the number of Quasi-Monte Carlo points and (2) appropriately increase the radius of the circular data unit. If we adopt the first method, the time complexity of QMC-DPC in generating and storing Quasi-Monte Carlo points is , which increases the time and space overhead i as the number of Quasi-Monte Carlo points increases. If the second method is adopted, the selection of radius is very important. When is too large and contains the entire data set, the QMC-DPC does not perform sampling operation. Since the main purpose of this paper is to reduce the time overhead of DPC, we give priority to the second method.

In addition, we further study the impact of feature dimension on model performance and running time, where the feature dimension is extended to [2, 9]. In this part, KDD is selected and the results are shown in Figure 7. As shown in Figure 7, QMC-DPC (AE) and QMC-DPC (PCA) achieve high performance in terms of SC. On the contrary, QMC-DPC (AE) and QMC-DPC (PCA) have a poor value in terms of CH when the feature dimension is 7. However, the value of CH increases heavily when the feature dimension is 9. The reason is that when we generate more Quasi-Monte Carlo points to execute sampling strategy, the corresponding running time also increases to a great extent. The relevant parameters on KDD are recorded in Table 11.

(a)

(b)

(c)

4.6. Algorithm Summary

Based on the above experiments, we have a comprehensive discussion on QMC-DPC. Specifically, as shown in Tables 2 and 4, it can be found that QMC-DPC achieves the best performance on the whole. On the UCI data sets, QMC-DPC (PCA), QMC-DPC (AE), and QMC-DPC (t-SNE) obtain the highest values of 8, 7, and 9 times, respectively. On the unlabeled data sets, QMC-DPC (PCA), QMC-DPC (AE), and QMC-DPC (t-SNE) obtain the highest values of 0, 6, and 3 times, respectively. Obviously, QMC-DPC combined with nonlinear feature reduction methods achieves better performance, on the whole. In terms of running time, it is obvious that our proposed method has superior performance. Especially when dealing with a large-scale data set, such as KDD, QMC-DPC achieves good performance, while most other baselines cannot be executed due to being out of memory. This further verifies the effectiveness of our method. From Section 4.5, we can find that the feature dimension has an impact on the model performance and various evaluation criteria are affected differently. In general, the model performance will decrease as the feature dimension increases. This is due to the loss of information caused by sampling. In Section 4.5, we propose two methods to overcome this problem, including generating more Quasi-Monte Carlo points and increasing the radius . The purpose of both methods is to expand the sampling area. For our proposed method, we make a trade-off between running time and model performance, which generates fewer Quasi-Monte Carlo points and sets fewer iterations for t-SNE and AE. The above operations will reduce the running time while reducing the model performance. In particular, we also increase the radius to reduce the information loss.

We summarize the following views on QMC-DPC:(i)In general, we choose QMC-DPC combined with nonlinear feature reduction methods, such as QMC-DPC (AE) and QMC-DPC (t-SNE). When dealing with a large-scale data set, we prefer QMC-DPC (AE).(ii)To reduce information loss, we give priority to expanding the radius . Secondly, we consider adding Quasi-Monte Carlo points.

In addition, there are still exploration directions for our proposed algorithm in the future, which are summarized as follows:(i)How to select feature dimensions is a heuristic work. In future work, we hope to build a multi-layer auto-encoder and construct the loss function based on hidden layer features. We aim to design the automatic encoder as a multi-tasks neural network.(ii)We hope to propose a more comprehensive sampling method to reduce the loss of information. We can take the sample point itself as the center for sampling, and then filter out the data samples in the sparse area. Finally, we need to have a strategy for the classification of outliers.(iii)We hope to propose a more comprehensive sampling method to reduce the loss of information. We can take the sample point itself as the center for sampling, and then filter out the data samples in the sparse area. Finally, we need to have a strategy for the classification of outliers.

5. Conclusion

In this paper, a new density peaks clustering algorithm with high computational efficiency is proposed. The original feature space is compressed by different feature reduction methods. We sample the reduced feature space based on the super-uniformly distributed sequence generated by the Quasi-Monte Carlo method. Our work can effectively overcome the high computation overhead of DPC while improving the model performance. Theoretically, the time complexity can be reduced from to , where . The experimental results show that QMC-DPC improves the model performance of the DPC while greatly reducing the time overhead with the increase of data set size.

Data Availability

The data used to support the findings of this study were supplied by https://archive.ics.uci.edu/ml/index.php.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Major Research Project of National Natural Science Foundation of China (No. 91948303).

References

M. Gupta and P. Chandra, “A comprehensive survey of data mining,” International Journal of Information Technology, vol. 12, pp. 1243–1257, 2020.
View at: Publisher Site | Google Scholar
J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, California, Berkeley, January 1967.
View at: Google Scholar
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” KDD, vol. 96, pp. 226–231, 1996.
View at: Google Scholar
B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science (New York, N.Y.), vol. 315, no. 5814, pp. 972–976, 2007.
View at: Publisher Site | Google Scholar
A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
View at: Publisher Site | Google Scholar
S. Gong and Y. Zhang, “EDDPC:An efficient distributed density peaks clustering algorithm,” Journal of Computer Research and Development, vol. 53, no. 6, pp. 1400–1409, 2016.
View at: Publisher Site | Google Scholar
Bo Wu and B. M. Wilamowski, “A fast density and grid based clustering method for data with arbitrary shapes and noise,” IEEE Transactions on Industrial Informatics, vol. 13, no. 4, pp. 1620–1628, 2016.
View at: Publisher Site | Google Scholar
J. Jiang, Y. Chen, X. Meng, L. Wang, and K. Li, “A novel density peaks clustering algorithm based on k nearest neighbors for improving assignment process,” Physica A: Statistical Mechanics and Its Applications, vol. 523, pp. 702–713, 2019.
View at: Publisher Site | Google Scholar
J. Xie, H. Gao, W. Xie, X. Liu, P. W. Grant, and Grant, “Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors,” Information Sciences, vol. 354, pp. 19–40, 2016.
View at: Publisher Site | Google Scholar
M. Jafarzadegan, F. Safi-Esfahani, and Z. Beheshti, “Combining hierarchical clustering approaches using the PCA method,” Expert Systems with Applications, vol. 137, pp. 1–10, 2019.
View at: Publisher Site | Google Scholar
E. G. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
View at: Publisher Site | Google Scholar
L. Van der Maaten and G. Hinton, “Visualizing Data Using T-SNE,” Journal of Machine Learning Research, vol. 9, no. 11, 2008.
View at: Google Scholar
J. K. Chow, Z. Su, J. Wu, P. S. Tan, X. Mao, and Y. H. Wang, “Anomaly detection of defects on concrete structures with the convolutional autoencoder,” Advanced Engineering Informatics, vol. 45, Article ID 101105, 2020.
View at: Publisher Site | Google Scholar
B. Melit Devassy and S. George, “Dimensionality reduction and visualisation of hyperspectral ink data using t-SNE,” Forensic Science International, vol. 311, Article ID 110194, 2020.
View at: Publisher Site | Google Scholar
J. Warmenhoven, N. Bargary, D. Liebl et al., “PCA of waveforms and functional PCA: a primer for biomechanics,” Journal of Biomechanics, vol. 116, Article ID 110106, 2021.
View at: Publisher Site | Google Scholar
D. Cheng, S. Zhang, and J. Huang, “Dense members of local cores-based density peaks clustering algorithm,” Knowledge-Based Systems, vol. 193, Article ID 105454, 2020.
View at: Publisher Site | Google Scholar
M. Du, S. Ding, and H. Jia, “Study on density peaks clustering based on k-nearest neighbors and principal component analysis,” Knowledge-Based Systems, vol. 99, pp. 135–145, 2016.
View at: Publisher Site | Google Scholar
R. Liu, H. Wang, and X. Yu, “Shared-nearest-neighbor-based clustering by fast search and find of density peaks,” Information Sciences, vol. 450, pp. 200–226, 2018.
View at: Publisher Site | Google Scholar
M. Parmar, Di Wang, Ah-H. Tan, C. Miao, J. Jiang, and Y. Zhou, “A novel density peak clustering algorithm based on squared residual error,” in Proceedings of the 2017 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), pp. 43–48, Shenzhen, China, December 2017.
View at: Publisher Site | Google Scholar
M. Parmar, D. Wang, X. Zhang et al., “REDPC: a residual error-based density peak clustering algorithm,” Neurocomputing, vol. 348, pp. 82–96, 2019.
View at: Publisher Site | Google Scholar
M. D. Parmar, W. Pang, D. Hao et al., “FREDPC: a feasible residual error-based density peak clustering algorithm with the fragment merging strategy,” IEEE Access, vol. 7, pp. 89789–89804, 2019.
View at: Publisher Site | Google Scholar
L. Wang, W. Zhou, H. Wang, M. Parmar, and X. Han, “A novel density peaks clustering halo node assignment method based on K-nearest neighbor theory,” IEEE Access, vol. 7, Article ID 174380, 2019.
View at: Publisher Site | Google Scholar
S. Kumar, M. Mohri, and A. Talwalkar, “Sampling methods for the Nyström method,” Journal of Machine Learning Research, vol. 13, no. 1, pp. 981–1006, 2012.
View at: Google Scholar
J. Dick and M. Feischl, “A quasi-Monte Carlo data compression algorithm for machine learning,” Journal of Complexity, vol. 67, no. 2021, Article ID 101587, 2021.
View at: Publisher Site | Google Scholar
W. Zhang, Y. Guo, J. Zhou, H. Jiang, and R. Wang, “A novel kernel clustering with quasi-Monte Carlo random feature map,” in Proceedings of the 2020 7th International Conference on Information, Cybernetics, and Computational Social Systems (ICCSS), pp. 854–857, Guangzhou, China, November 2020.
View at: Publisher Site | Google Scholar
J. H. Halton, “On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals,” Numerische Mathematik, vol. 2, no. 1, pp. 84–90, 1960.
View at: Publisher Site | Google Scholar
H. Faure, “Discrépance de suites associées à un système de numération (en dimension s),” Acta Arithmetica, vol. 41, no. 4, pp. 337–351, 1982.
View at: Publisher Site | Google Scholar
H. Niederreiter, “Point sets and sequences with small discrepancy,” Monatshefte für Mathematik, vol. 104, no. 4, pp. 273–337, 1987.
View at: Publisher Site | Google Scholar
L. Fu and E. Medico, “FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data,” BMC Bioinformatics, vol. 8, no. 1, p. 3, 2007.
View at: Publisher Site | Google Scholar
K. Bache and M. Lichman, UCI Machine Learning Repository, vol. 28, School of Information and Computer Science, University of California, Irvine, CA, USA, 2013, http://archive.ics.uci.edu/ml.
A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,” ACM Transactions on Knowledge Discovery from Data, vol. 1, no. 1, p. 4, 2007.
View at: Publisher Site | Google Scholar
P. Fränti and O. Virmajoki, “Iterative shrinking method for clustering problems,” Pattern Recognition, vol. 39, no. 5, pp. 761–775, 2006.
View at: Publisher Site | Google Scholar
Y. Zhang, S. Chen, and Y. Ge, “Efficient distributed density peaks for clustering large data sets in MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3218–3230, 2016.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Zhihui Hu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

431

Downloads

469

Citations

Scientific Programming

Density Peaks Clustering Based on Feature Reduction and Quasi-Monte Carlo

Abstract

1. Introduction

2. Related Work

2.1. Feature Reduction

2.2. Density Peaks Clustering

2.3. Quasi-Monte Carlo

3. Description of the Algorithm

3.1. The Feature Reduction Module

3.2. The Data Sampling Module

3.3. Algorithm Complexity Analysis

4. Experiment and Analysis

4.1. Experimental Setup

4.2. Experimental Results on Labeled Data Sets

4.3. Experimental Results of Unlabeled Data Sets

4.4. Experimental Results of Running Time

4.5. Experimental Results of Sensitivity Analysis

4.6. Algorithm Summary

5. Conclusion

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright