Abstract
Density peaks clustering (DPC) is a wellknown densitybased clustering algorithm that can deal with nonspherical clusters well. However, DPC has high computational complexity and space complexity in calculating local density and distance , which makes it suitable only for smallscale data sets. In addition, for clustering highdimensional data, the performance of DPC still needs to be improved. Highdimensional data not only make the data distribution more complex but also lead to more computational overheads. To address the above issues, we propose an improved density peaks clustering algorithm, which combines feature reduction and data sampling strategy. Specifically, features of the highdimensional data are automatically extracted by principal component analysis (PCA), autoencoder (AE), and tdistributed stochastic neighbor embedding (tSNE). Next, in order to reduce the computational overhead, we propose a novel data sampling method for the lowdimensional feature data. Firstly, the data distribution in the lowdimensional feature space is estimated by the QuasiMonte Carlo (QMC) sequence with lowdiscrepancy characteristics. Then, the representative QMC points are selected according to their cell densities. Next, the selected QMC points are used to calculate and instead of the original data points. In general, the number of the selected QMC points is much smaller than that of the initial data set. Finally, a twostage classification strategy based on the QMC points clustering results is proposed to classify the original data set. Compared with current works, our proposed algorithm can reduce the computational complexity from to , where denotes the number of selected QMC points and is the size of original data set, typically . Experimental results demonstrate that the proposed algorithm can effectively reduce the computational overhead and improve the model performance.
1. Introduction
With the advent of the era of big data, the importance of data mining is increasingly prominent [1]. As an unsupervised learning method, clustering is widely used in many different fields including image processing, medicine, and archaeology. There are various classical clustering algorithms, such as Kmeans [2], DBSCAN [3], and AP [4]. According to different standards, clustering algorithms are classified into different categories. Generally speaking, clustering algorithms are divided into partitionbased methods, hierarchybased methods, densitybased methods, and gridbased methods.
In recent years, a new density peaks clustering (DPC) algorithm has been proposed [5]. It is a typical densitybased clustering algorithm with excellent advantages. One advantage is that the DPC relies on the decision graph to select the clustering center. Specifically, DPC draws the decision graph of the data set by defining local density and distance . Then, DPC determines the cluster centers based on the decision graph. The obtained cluster centers have two characteristics: (1) The local density of the cluster centers is large and the density of its neighborhood is not greater than itself. (2) The distance between the cluster centers and other data points with a higher density is relatively large. Hence, the cluster centers are data points with high local density and high distance, which are called density peaks. Another advantage is that DPC can not only deal with clusters of arbitrary shape but also does not need to determine the number of categories in advance.
Although DPC has achieved good performance in many situations, it still has some drawbacks. Firstly, DPC needs to calculate the local density and distance of each data point, which makes the computational complexity . The expensive computational overhead limits the application of DPC in largescale data sets. To address this issue, the study in [6] proposed a distributed density peaks clustering algorithm (EDDPC). EDDPC aggregates largescale data sets into MapReduce and integrates local results to approximate the final results. However, EDDPC is a distributed algorithm and is not suitable for single CPU scenarios. The study in [7] proposed a densitybased and gridbased clustering algorithm (DGB). Instead of calculating distances between all data, only a smaller number of grid points are calculated. However, DGB is only suitable for dealing with highdimensional data set. In general, the data distribution in highdimensional space may be more complex and contain more noise. Although [8, 9] are proposed to filter the noise, additional operations increase the computational overhead.
To address the above problems, an improved density peaks clustering algorithm combining feature reduction and data sampling strategy is proposed in this paper. Firstly, the original data feature space is compressed by some classical feature reduction methods. Then, the lowdimensional feature data are sampled by superuniformly QuasiMonte Carlo sequence, and the selected highdensity QuasiMonte Carlo points are used to replace the original data points for clustering. Finally, we perform a twostage strategy to determine the category for the original data. The proposed method has the following advantages:(1)The proposed algorithm reduces the computational complexity from to , where and represent the number of selected QMC points and the size of original data set, respectively. In general, there is (2)Through feature reduction, the proposed algorithm reduces the noise form the original data and decreases the complexity of highdimensional feature space(3)Extensive experiments have demonstrated the effectiveness of our proposed algorithm in terms of computational overhead and model performance
2. Related Work
2.1. Feature Reduction
Feature reduction indicates mapping the data from the highdimensional feature space to a lowdimensional space. The features of the highdimensional data will be extracted by linear or nonlinear transformation. Hence, efficient lowdimensional features of the original data set can be obtained by various feature reduction methods. An ideal lowdimensional feature should retain the classification information as much as possible and filter the noise.
Generally speaking, feature reduction can be divided into linear and nonlinear feature reduction methods. Principal component analysis (PCA) is a classical linear feature reduction method [10]. PCA transforms a group of variables that may correlate with linearly uncorrelated variables by orthogonal transformation. Autoencoder (AE) and tdistributed stochastic neighbor embedding (tSNE) are nonlinear feature reduction methods. AE can be regarded as a selfsupervised manner that consists of the encoder and the decoder [11]. The input data will be mapped to the hidden layer by the encoder, while the decoder transforms the hidden layer features back to the input. Its goal is to combine some highorder features to reconstruct itself. The tSNE is a machine learning method basing stochastic neighbor embedding (SNE) for feature reduction [12]. tSNE maps highdimensional data to two or more dimensions and alleviates the congestion problem in the process of feature reduction. All the above methods have been applied in many fields [13–15].
2.2. Density Peaks Clustering
Density peaks clustering (DPC) is proposed in [5], and it can efficiently deal with arbitrary shape data sets without specifying the cluster number in advance. The cluster center selected by DPC has two characteristics: (1) the local density of the cluster center should be larger than the local density of its neighbors; (2) Data points with low local density should be far away from other data points with high local density. To describe these characteristics, DPC defines two concepts for each data point : the local density and the minimum distance . The local density is formulated aswhere represents the distance between and . is the intercept, which is the only artificially defined parameter in DPC. In the code provided by [5], is formulated aswhere is the size of the distance matrix, which defines the distance between any data point pairs. When the data set is small, the Gaussian kernel function is used to calculate . is formulated as
In addition, is formulated as
DPC draws the decision graph based on and . Then, DPC selects the data points with both and as the cluster centers and assigns the remaining data points to the nearest class. DPC is a simple and efficient algorithm, and a series of works have been carried out [16–22]. However, DPC requires a huge computational overhead. The computational complexity of the DPC is , which makes it unsuitable for largescale data set. To address this problem, a feasible strategy is to sample the data set [23]. Our work is based on the sampling strategy to reduce the computational overhead.
2.3. QuasiMonte Carlo
As a statistical test method, the Monte Carlo method has been widely used in machine learning. The QuasiMonte Carlo method is similar to the Monte Carlo method, but there are theoretical differences between them. The superiority of the QuasiMonte Carlo method is to generate the deterministic superuniformly distributed sequence (called lowdiscrepancy sequence in mathematics) instead of the pseudorandom sequence generated by the Monte Carlo method. The QuasiMonte Carlo method has been widely used in the field of machine learning [24, 25]. Specifically, the study in [24] utilizes the QuasiMonte Carlo method to reduce the computational overhead that occurs in the parameter optimization process of neural networks. The study in [25] generates the QuasiMonte Carlo sequence to perform the feature map and obtains the lowrank features. Similarity, we generate QuasiMonte Carlo sequence for data sampling. Next, we briefly describe the QuasiMonte Carlo sequence.
The QuasiMonte Carlo Random sequence is a deterministic superuniformly distributed sequence with low deviation. It has the property that any long subsequences are uniformly distributed in the feature space. Recently, the most widely used QuasiMonte Carlo Random sequence mainly includes Halton sequence [26], Faure sequence [27], and Niederreiter’s sequence [28]. In our work, the Halton sequence is selected to perform the sampling strategy.
The Halton sequence is one of the standard lowdiscrepancy sequences, which is used to generate superuniformly distributed random numbers. Compared with pseudorandom numbers generated by the Monte Carlo method, it is mathematically proved that the volatility of the Halton sequence is smaller. Specifically, the approximate error of the Halton sequence is determined by the degree of difference of the sequence . The approximate error is formulated as the following equation:where is the error term, is the HardyKrause variation of the function , and is the deviation of .
Because the order of is , the approximate error order of the QuasiMonte Carlo method is . Similarly, the error order of the pseudorandom sequence is . Compared with the above error orders, the error order of the QuasiMonte Carlo method is smaller than that of the Monte Carlo method. Note that the above discussion only gives the upper limit of approximate error. In fact, the convergence rate of the Halton sequence is much faster than the rate obtained by the upper limit. Generally speaking, the QuasiMonte Carlo method greatly speeds up the convergence compared with the Monte Carlo method, and the random numbers generated by the QuasiMonte Carlo method are more uniform.
The Monte Carlo method generates the pseudorandom numbers, and the QuasiMonte Carlo method generates the quasirandom numbers. Figure 1 shows the comparison between the quasirandom numbers and the pseudorandom numbers on a twodimensional plane. As shown in Figure 1, the pseudorandom numbers are not uniformly distributed in some places. However, the Halton sequence is highly uniformly distributed in the whole space. Intuitively, the QuasiMonte Carlo method may be more comprehensive, while the Monte Carlo method has more blank areas. Hence, this paper adopts the Halton sequence to sample the original data and further proposes a new density peaks clustering algorithm.
(a)
(b)
3. Description of the Algorithm
In this section, a novel improved density peaks clustering algorithm based on the QuasiMonte Carlo method (QMCDPC) is proposed to improve the performance of DPC. Specifically, the proposed method includes two components: the feature reduction module and the data sampling module.
3.1. The Feature Reduction Module
In this module, we aim to reduce the feature dimension of data sets. The original data set will be transformed to by various feature reduction methods, where . Our goal is to retain the original information as much as possible while reducing the dimension of the data.
In practice, we utilize linear and nonlinear feature reduction methods, including PCA, AE, and tSNE, respectively. Firstly, we perform the zeromean normalization for . For , we calculate the mean and the standard deviation . Hence, we can obtain the normalized numbers . Then, PCA, AE, and tSNE are implemented on the normalized data set . For PCA, we choose the number of principal components that are smaller than the original dimension of the data set (except for the twodimension data set). We keep the original dimension for the twodimension data set. For AE, we set the AE with three layers, including an encoder, a decoder, and a hidden layer. The dimension of encoder and decoder is equivalent to and the number of hidden layer units is equivalent to . For the input data , we select the hidden layer features as the . For tSNE, the similarity between data points is measured by probability instead of Euclidean distance. Specifically, the similarity of data points in the original feature space is calculated by Gaussian joint probability, while the heavytailed student tdistribution is used in the lowdimension to measure the similarity. Then, we minimize the KL divergence to obtain the reduced features . Figure 2 shows the obtained twodimensional features of PCA, AE, and tSNE on Waveform and Landsat. The original dimensions of Waveform and Landsat are more than 20. From Figure 2, it can be seen that the lowdimensional features that map from higherdimensional data are distinguishable. In Section 4, we will discuss how to select the feature reduction method by experimental analysis.
(a)
(b)
3.2. The Data Sampling Module
Although we compress the feature dimension of data sets through the feature reduction module, the computational complexity of the DPC is still . In this module, we aim to reduce the time overhead of DPC. Hence, an improved Density Peaks Clustering algorithm based on superuniformly QuasiMonte Carlo sequence (QMCDPC) is proposed. In summary, we utilize the superuniformly QuasiMonte Carlo sequence to sample the lowdimensional feature space of the data set. Then, the representative QuasiMonte Carlo points are used to calculate and instead of the original data. Generally speaking, the number of selected QuasiMonte Carlo points is much smaller than the size of original data set . The detailed description of QMCDPC is given in the following.
Specifically, we first define two basic concepts as follows:(1)Circular data unit : the circle with the QuasiMonte Carlo points as the center and radius (2)Unit density : the number of data points contained in circular data unit
Assume that is the lowdimensional feature data set obtained by the feature reduction module. We randomly generate QuasiMonte Carlo points in the feature space. With the QuasiMonte Carlo points as the centers, the corresponding are determined under the appropriate (When is small, the parameter after experiments). Then, according to whether contains data points or not, the circular data units are divided into two categories: nonempty unit set and empty unit set, where a nonempty unit set and empty unit set . Next, since the empty unit set indicates that it does not contain any data, empty unit set and corresponding QuasiMonte Carlo points are eliminated. The effect is shown in Figure 3.
(a)
(b)
As shown in Figure 3, the remaining nonempty QuasiMonte Carlo points are distributed around the sample points, while the removed empty QuasiMonte Carlo points are far from the sample points. Hence, the distribution of the original data set can be sampled by nonempty QuasiMonte Carlo points. Furthermore, the local density of the original data set can be estimated by the unit density . Therefore, it is reasonable to utilize the nonempty QuasiMonte Carlo points to calculate local density and minimum distance instead of the original data points. Next, for all the nonempty QuasiMonte Carlo points (assuming that the number of the nonempty QuasiMonte Carlo points are , there is generally ), the distance of the nonempty QuasiMonte Carlo point pairs is calculated to obtain the distance matrix :where is a symmetric matrix with diagonal elements that are zero. is the ascending order of all elements in . When is too small, the may be zero which indicates that the function of intercept is eliminated. Hence, we remove the zero elements in and take the first distance from the remaining elements as the . Then, we use equations (3) and (4) to calculate and of each nonempty QuasiMonte Carlo point and draw the decision graph. Figure 4 shows the decision graph of QMCDPC and DPC on Waveform.
(a)
(b)
(c)
(d)
As shown in Figure 4, the density peaks obtained by QMCDPC are easier to distinguish than that of DPC, especially on the lowdimensional features generated by AE and tSNE. Meanwhile, the number of data points in the decision graph of QMCDPC is smaller than DPC. Specifically, QMCDPC (PCA), QMCDPC (AE), and QMCDPC (tSNE), respectively, calculate 2742, 2499, and 2989 data points in the decision graph, while DPC calculates 5000 data points in the decision graph. The above discussion further proves the effectiveness of the QuasiMonte Carlo sampling method. Specifically, it can be summarized as the following three aspects: (1) Combined with the super uniformity of the QuasiMonte Carlo sequence, the data sampling is more comprehensive, so as to reduce the bias. This conclusion is described by Figure 1. (2) The number of selected nonempty QuasiMonte Carlo points is small, which greatly reduces the time and space overhead. This conclusion is described in Figure 3. (3) Based on and , data points located in dense areas are difficult to distinguish, because their and are similar. On the contrary, QuasiMonte Carlo points essentially sample the local density, and the distinction between selected nonempty QuasiMonte Carlo points is enlarged. Finally, according to the nearest distance principle, we propose a twostage classification strategy:(i)The density peaks are selected as the class centers, and the remaining nonempty QuasiMonte Carlo points are assigned to the nearest density peak. The first step obtains the clustering results of all nonempty QuasiMonte Carlo points.(ii)The data points of are assigned to the nearest nonempty QuasiMonte Carlo point. As the feature mapping is unique, the classification result of is equivalent to the classification result of . The second step obtains the final clustering results of all data points of .
After the above discussion, QMCDPC is depicted in Algorithm 1 and the whole process is shown in Figure 5.

3.3. Algorithm Complexity Analysis
The key of DPC is to draw the decision graph based on and . Our work retains the idea of choosing cluster centers, but QMCDPC only calculates and for nonempty QuasiMonte Carlo points after the screening, making the computational complexity far less than DPC.
For the data set , the DPC takes the space complexity of to store the distance matrix. The space complexity of QMCDPC mainly includes: is required to generate QuasiMonte Carlo points, is required to retain nonempty QuasiMonte Carlo points, and is required to store the distance matrix of nonempty QuasiMonte Carlo point pairs. Therefore, the spatial complexity of QMCDPC is . When n is large, there is in general. However, when is relatively small, the space complexity of QMCDPC becomes larger due to generating QuasiMonte Carlo points.
When calculating and , DPC needs to calculate the distance matrix with the time complexity of . After selecting the cluster centers, the time complexity of classifying data points is also . Therefore, the time complexity of DPC algorithm is . The time complexity of QMCDPC mainly includes to calculate the unit density of QuasiMonte Carlo points, is required to calculate the and of nonempty QuasiMonte Carlo points, and is required to classify the nonempty QuasiMonte Carlo points and when classifying data points. Therefore, the time complexity of QMCDPC algorithm is . In general, there are always and , making the time complexity of the QMCDPC less than that of the DPC. However, when is relatively small, the time cost of QMCDPC is more than that of DPC. In the experiment, we will further prove that even with the addition of the feature reduction module, the proposed algorithm still has time superiority.
4. Experiment and Analysis
4.1. Experimental Setup
To verify the performance of QMCDPC, the proposed method is compared with related clustering algorithms, including DPCKNNPCA [17], SNNDPC [18], DLOREDP [16], DPC [5], AP [4], DBSCAN [3], and Kmeans [2]. The nearest neighbor number is set to 4 in SNNDPC. The ratio of lowdensity points in DLOREDP is set to 0.2. For DBSCAN, the parameter is set to 3 and is empty. Kmeans needs to specify the number of classes in advance. The data sets adopted in this section include two major categories: unlabeled data sets and labeled data sets. The details of these data sets are listed in Table 1. In labeled data sets, all data sets are UCI data sets. In unlabeled data sets, Flame, Aggregation, and S2 are Synthetic data sets. KDD is a biological data set, which is used to verify the superiority of our proposed algorithm on largescale and highdimensional feature data sets.
Four evaluation criteria are adopted to evaluate the model performance on labeled data sets, i.e, the Accuracy (Acc) and Fmeasure (F), Normalized mutual information (NMI), and Adjusted rand index (ARI). These evaluation criteria are described as follows: Assume that is the data set. and represent the real labels and the predicted labels, respectively. Acc is denoted aswhere is a permutation mapping function, which uses Hungarian algorithm to match the predicted labels with the real labels.
The Fmeasure is a harmonic mean of precision and recall . is the ratio between the number of correct positive results and the number of all positive results returned by the classifier. is the ratio between the number of correct positive results and the number of all data that should have been identified as positive. is the set of the number of all data that should have been classified as positive. is the set of the number of all positive results identified by the classifier. , , and Fmeasure are defined by the following equations:where is a nonnegative real number that is set to 1. For the divided by each real label, the nearest one in is selected as its value:
Then, we use the weighted average of to get the final value:
The Normalized Mutual Information (NMI) measures the information that the predicted labels share with the ground truth . NMI is defined as the following equation:where is the mutual information between clustering result and ground truth. and denote the entropy of clustering result and ground truth, respectively.
The Adjusted Rand Index (ARI) is the extension of Rand Index (RI). ARI is defined as the following equation:where , denotes the data pairs which are in the same class in and in the same class in , denotes the data pairs which are in different classes in and in different classes in . denotes the data pairs which are in different classes in and in the same class in . denotes the data pairs which are in the same class in and in different classes in . The value of ARI is in the range [ 1, 1]. The upper bound of these evaluation criterions is 1. The larger these criterions are, the better the clustering results are.
In the feature reduction module, some parameters are set in advance. For tSNE, the learning rate is 500, the number of perplexity is 30, and the number of epochs is 800. For AE, the learning rate is 0.01, optimizer is Adam, and the number of epochs is 300.
4.2. Experimental Results on Labeled Data Sets
In this section, 9 UCI data sets in Table 1 are used to verify the performance of QMCDPC. All data are normalized to between [0, 1]. To avoid extreme cases, each algorithm runs 10 times and records the average results. The values of evaluation criteria are shown in Table 2 and the best values are highlighted in bold. The relevant parameters of the QMCDPC are recorded in Table 3.
As shown in Table 2, our proposed algorithm is superior to other algorithms on the whole. Acc indicates the ratio of the number of correct predicted samples to the number of total samples. In terms of Acc, QMCDPC achieves the highest performance on all data sets except Waveform and Landsat. In particular, QMCDPC is 33.6% and 34.3% higher than DPC on Zoo and Pima, respectively. Fmeasure indicates the matching degree between the predicted labels and the true labels of the data set, which is the weighted harmonic mean of precision and recall. In terms of Fmeasure, QMCDPC achieves the highest performance on nearly half the data set. NMI quantifies the similarity between the predicted labels and the true labels, which measures the robustness of the algorithm. In terms of NMI, QMCDPC achieves the highest performance on all data sets except Landsat, Pima, and Zoo. In particular, QMCDPC is 21.4% higher than DPC on Waveform. ARI is used to measure the degree of coincidence of the two data distributions. In terms of ARI, QMCDPC achieves the highest performance on all data sets except Breast and Landsat. The ARI value of QMCDPC is 73.2% higher than DPC on Zoo. In addition, the evaluation criterion values of QMCDPC (PCA), QMCDPC (AE), and QMCDPC (tSNE) are similar, and the model performance is better than that of DPC on the whole. The above results indicate that the combination of the feature reduction module and the feature sampling module can improve the model performance.
4.3. Experimental Results of Unlabeled Data Sets
Since there are no real labels for the unlabeled data sets, the evaluation criteria Acc, Fmeasure, NMI, and ARI cannot be applied to the unlabeled data sets. To compare the performance on the unlabeled data sets, the evaluation criteria Silhouette Coefficient (SC) and CalinskiHarabasz (CH) are defined. For SC, we first calculate the silhouette coefficient for each data point :where is average dissimilarity between data point and other data points in the same class, is the minimum value of the average dissimilarity between data point and other categories. Next, we obtain the silhouette coefficient for data set based on :where is the number of all data points. The value of SC is in the range [−1, 1]. The larger the SC value is, the better the clustering result is.
CH is defined as follows:where , is the number of data points in class , is the average of data points in class , and is the average of all data points. , is the cluster numbers. The larger the CH value is, the better the clustering result is.
In this section, three synthetic data sets and KDD are selected to verify the performance of QMCDPC. Flame, Aggregation, and S2 are the classical synthetic data sets. KDD is a largescale data set with highdimensional features. Table 4 shows the SC and CH of all algorithms on unlabeled data sets. The best values are highlighted in bold. The relevant parameters of the QMCDPC are recorded in Table 3.
As shown in Table 4, our proposed method obtains the best clustering results on the whole, especially QMCDPC (AE). The “—” in Table 4 indicates that the algorithm cannot execute because it exceeds the virtual memory. For the SC, QMCDPC (tSNE) is higher than DPC on Flame. And, DPC obtains the same results as our proposed method on Aggregation and S2. DPCKNNPCA also obtains the same results as our proposed method on S2. In general, QMCDPC (AE) and QMCDPC (tSNE) achieve the better performance than QMCDPC (PCA) except KDD. Limited by the tSNE method, QMCDPC (tSNE) fails to perform clustering on KDD. In Section 4.6, we will make further comprehensive analysis. In addition, we visualized the classification results on synthetic data sets. Figure 6 shows the classification results on Aggregation ans S2.
4.4. Experimental Results of Running Time
In this subsection, we further verify that our proposed method can effectively reduce the computational overhead. We select data sets with more than 2000 data points and record the running time in Table 5.
As shown in Table 5, compared with DPC, SNNDPC, and AP, QMCDPC achieves the best performance in terms of running time. QMCDPC is at least 34.47%, 61.80%, 25.59%, and 50.85% lower than DPC on Segment, Waveform, Landsat, and s2, respectively. Generally speaking, the larger the data size, the more the time saved. For KDD, QMCDPC (PCA) and QMCDPC (AE) obtain the results, while QMCDPC (tSNE) will exceed memory. This is limited by the tSNE method. In addition, DPC, SNNDPC, and AP also exceed memory. This further confirms the effectiveness of our method. How to select QMCDPC (PCA), QMCDPC (AE), and QMCDPC (tSNE) will be discussed in Section 4.6. In addition, it can be seen that the running time of the QMCDPC (PCA) and QMCDPC (AE) is close. However, the computational overhead of QMCDPC (tSNE) is higher than that of QMCDPC (PCA) and QMCDPC (AE). The reason is that tSNE requires a huge computational overhead, while autoencoder only has a shallow structure and does not contain a large number of training parameters. Furthermore, we compare the time complexity of our proposed method with that of the baselines methods. The results are recorded in Table 6. In this part, we set the number of data points to , the number of cluster categories to , the number of neighbors to , the number of iteration to , and the number of selected QuasiMonte Carlo points to . Although the time complexity of QMCDPC is square, is much smaller than in practice. Hence, the time overhead of QMCDPC will be significantly reduced and the conclusion can also be proved in Table 5.
4.5. Experimental Results of Sensitivity Analysis
In this section, we conduct parameter sensitivity analysis from multiple aspects, such as how feature dimensions affect model performance and running time. Specifically, we first calculate Acc, F, NMI, and ARI on UCI data sets where the feature dimension is in the range [16, 24]. The final results are recorded in Tables 7–10, respectively.
From Tables 7 to 10, it can be seen that the performance of the model will decrease slightly with the increase of dimension on the whole. This is limited by the loss of information caused by the sampling strategy as the dimension increases. As the dimensions increase, data distribution will become more complex. To address this issue, there are two methods to reduce the information loss caused by sampling: (1) increase the number of QuasiMonte Carlo points and (2) appropriately increase the radius of the circular data unit. If we adopt the first method, the time complexity of QMCDPC in generating and storing QuasiMonte Carlo points is , which increases the time and space overhead i as the number of QuasiMonte Carlo points increases. If the second method is adopted, the selection of radius is very important. When is too large and contains the entire data set, the QMCDPC does not perform sampling operation. Since the main purpose of this paper is to reduce the time overhead of DPC, we give priority to the second method.
In addition, we further study the impact of feature dimension on model performance and running time, where the feature dimension is extended to [2, 9]. In this part, KDD is selected and the results are shown in Figure 7. As shown in Figure 7, QMCDPC (AE) and QMCDPC (PCA) achieve high performance in terms of SC. On the contrary, QMCDPC (AE) and QMCDPC (PCA) have a poor value in terms of CH when the feature dimension is 7. However, the value of CH increases heavily when the feature dimension is 9. The reason is that when we generate more QuasiMonte Carlo points to execute sampling strategy, the corresponding running time also increases to a great extent. The relevant parameters on KDD are recorded in Table 11.
(a)
(b)
(c)
4.6. Algorithm Summary
Based on the above experiments, we have a comprehensive discussion on QMCDPC. Specifically, as shown in Tables 2 and 4, it can be found that QMCDPC achieves the best performance on the whole. On the UCI data sets, QMCDPC (PCA), QMCDPC (AE), and QMCDPC (tSNE) obtain the highest values of 8, 7, and 9 times, respectively. On the unlabeled data sets, QMCDPC (PCA), QMCDPC (AE), and QMCDPC (tSNE) obtain the highest values of 0, 6, and 3 times, respectively. Obviously, QMCDPC combined with nonlinear feature reduction methods achieves better performance, on the whole. In terms of running time, it is obvious that our proposed method has superior performance. Especially when dealing with a largescale data set, such as KDD, QMCDPC achieves good performance, while most other baselines cannot be executed due to being out of memory. This further verifies the effectiveness of our method. From Section 4.5, we can find that the feature dimension has an impact on the model performance and various evaluation criteria are affected differently. In general, the model performance will decrease as the feature dimension increases. This is due to the loss of information caused by sampling. In Section 4.5, we propose two methods to overcome this problem, including generating more QuasiMonte Carlo points and increasing the radius . The purpose of both methods is to expand the sampling area. For our proposed method, we make a tradeoff between running time and model performance, which generates fewer QuasiMonte Carlo points and sets fewer iterations for tSNE and AE. The above operations will reduce the running time while reducing the model performance. In particular, we also increase the radius to reduce the information loss.
We summarize the following views on QMCDPC:(i)In general, we choose QMCDPC combined with nonlinear feature reduction methods, such as QMCDPC (AE) and QMCDPC (tSNE). When dealing with a largescale data set, we prefer QMCDPC (AE).(ii)To reduce information loss, we give priority to expanding the radius . Secondly, we consider adding QuasiMonte Carlo points.
In addition, there are still exploration directions for our proposed algorithm in the future, which are summarized as follows:(i)How to select feature dimensions is a heuristic work. In future work, we hope to build a multilayer autoencoder and construct the loss function based on hidden layer features. We aim to design the automatic encoder as a multitasks neural network.(ii)We hope to propose a more comprehensive sampling method to reduce the loss of information. We can take the sample point itself as the center for sampling, and then filter out the data samples in the sparse area. Finally, we need to have a strategy for the classification of outliers.(iii)We hope to propose a more comprehensive sampling method to reduce the loss of information. We can take the sample point itself as the center for sampling, and then filter out the data samples in the sparse area. Finally, we need to have a strategy for the classification of outliers.
5. Conclusion
In this paper, a new density peaks clustering algorithm with high computational efficiency is proposed. The original feature space is compressed by different feature reduction methods. We sample the reduced feature space based on the superuniformly distributed sequence generated by the QuasiMonte Carlo method. Our work can effectively overcome the high computation overhead of DPC while improving the model performance. Theoretically, the time complexity can be reduced from to , where . The experimental results show that QMCDPC improves the model performance of the DPC while greatly reducing the time overhead with the increase of data set size.
Data Availability
The data used to support the findings of this study were supplied by https://archive.ics.uci.edu/ml/index.php.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the Major Research Project of National Natural Science Foundation of China (No. 91948303).