Abstract

This study proposes a novel method to calculate the density of the data points based on K-nearest neighbors and Shannon entropy. A variant of tissue-like P systems with active membranes is introduced to realize the clustering process. The new variant of tissue-like P systems can improve the efficiency of the algorithm and reduce the computation complexity. Finally, experimental results on synthetic and real-world datasets show that the new method is more effective than the other state-of-the-art clustering methods.

1. Introduction

Clustering is an unsupervised learning method, which aims to divide a given population into several groups or classes, called clusters, in such a way that similar objects are put into the same group and dissimilar objects are put into different groups. Clustering methods generally include five categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods ‎[1]. Partitioning and hierarchical methods can find spherical-shaped clusters but do not perform well on arbitrary clusters. Density-based clustering ‎[2] methods can be used to overcome this problem, which can model clusters as dense regions and boundaries as sparse regions. Three representative approaches of the density-based clustering method are DBSCAN (Density-Based Spatial Clustering of Application with Noise), OPTICS (Ordering Points to Identify the Clustering Structure), and DENCLUE (DENsity-based CLUstEring).

Usually, an objective function measuring the clustering quality is optimized by an iterative process in some clustering algorithms. However, this approach may cause low efficiency. Thus, the density peaks clustering (DPC) algorithm was proposed by Rodriguez and Laio ‎[3] in 2014. This method can obtain the clusters in a single step regardless of the shape and dimensionality of the space. DPC is based on the idea that cluster centers are characterized by a higher density than the surrounding regions by a relatively large distance from points with higher densities. For the DPC algorithm, scholars have done a lot of research. However, DPC still has several challenges that need to be addressed. First, the local density of data points can be affected by the cut off distance, which can influence the clustering results. Second, the number of clusters needs to be decided by users, but the manual selection of the cluster centers can influence the clustering result. Cong et al. ‎[4] proposed a clustering model for high dimensional data based on DPC that accomplishes clustering simply and directly for data with more than six-dimensions with arbitrary shapes. The problem in this model is that clustering effect is not ideal for different classes and big differences in order of magnitude. Xu et al. ‎[5] introduced a novel approach, called density peaks clustering algorithm based on grid (DPCG). But this method also needs to rely on the user experiment in the choice of clustering centers. Bie et al. ‎[6] proposed a fuzzy-CFSFDP method for adaptively but effectively selecting the cluster centers. Du et al. ‎[7] proposed a new DPC algorithm using geodesic distances. And Du et al. ‎[8] also proposed a FN-DP (fuzzy neighborhood density peaks) clustering algorithm. But they cannot select clustering center automatic and the FN-DP algorithm can cost much time in the calculation of the similarity matrix. Hou and Cui ‎[9] introduced a density normalization step to make large-density clusters partitioned into multiple parts and small-density clusters merged with other clusters. Xu et al. proposed a FDPC algorithm based on a novel merging strategy motivated by support vector machines ‎[10]. But it also has the problem of higher complexity and needs to select the clustering center by users. Liu et al. ‎[11] proposed a shared-nearest-neighbor-based clustering by fast search and find of density peaks (SNN-DPC) algorithm. Based on prior assumptions of consistency for semisupervised learning algorithms, some scholars also made assumptions of consistency for density-based clustering. The first assumption is of local consistency, which means nearby points are likely to have similar local density, and the second assumption is of global consistency, which means points in the same high-density area (or the same structure, i.e., the same cluster) are likely to have the same label ‎[12]. This method also cannot find the clustering centers automatically. Although many studies about DPC have been reported, it still has many problems that need to be studied.

Membrane computing proposed by Pǎun ‎[13], as a new branch of natural computing, abstracts out computational models from the structures and functions of biological cells and from the collaboration between organs and tissues. Membrane computing mainly includes three basic computational models, i.e., the cell-like P system, the tissue-like P system, and the neural-like P system. In the computation process, each cell is treated as an independent unit, each unit operates independently and does not interfere with each other, and the entire membrane system operates in maximally parallel. Over the past years, many variants of membrane systems have been proposed ‎[1418], including membrane algorithms of solving global optimization problems. In recent years, applications of membrane computing have attracted a lot of attention from researchers ‎[1922]. There are also some other applications; for example, membrane systems are used to solve multiobjective fuzzy clustering problems ‎[23], solve unsupervised learning algorithms ‎[24], solve automatic fuzzy clustering problems ‎[25], and solve the problems of fault diagnosis of power systems ‎[26]. Liu et al. ‎[27] proposed an improved Apriori algorithm based on an Evolution-Communication tissue-Like P System. Liu and Xue ‎[28] introduced a P system on simplices. Zhao et al. ‎[29] proposed a spiking neural P system with neuron division and dissolution.

Based on previous works, the main motivation of this work is using membrane systems to develop a framework for a density peak clustering algorithm. A new method of calculating the density of the data points is proposed based on the K-nearest neighbors and Shannon entropy. A variant of the tissue-like P system with active membranes is used to realize the clustering process. The new model of the P system can improve efficiency and reduce computation complexity. Experimental results show that this method is more effective and accurate than the state-of-the-art methods.

The rest of this paper is organized as follows. Section 2 describes the basic DPC algorithm and the tissue-like P system. Section 3 introduces the tissue-like P system with active membranes for DPC based on the K-nearest neighbors and Shannon entropy and describes the clustering procedure. Section 4 reports experimental results on synthetic datasets and UCI datasets. Conclusions are drawn and future research directions are outlined in Section 5.

2. Preliminaries

2.1. The Original Density Peak Clustering Algorithm

Rodriguez and Laio [3] proposed the DPC algorithm in 2014. This algorithm is based on the idea that cluster centers have higher densities than the surrounding regions and the distances among cluster centers are relatively large. It has three important parameters. The first one is the local density of data point , the second one is the minimum distance between data point and other data points with higher density, and the third one is the product of the other two. The first two parameters correspond to two assumptions of the DPC algorithm. One assumption is that cluster centers have higher density than the surrounding regions. The other assumption is that this point has larger distance from the points in other clusters than from points in the same cluster. In the following, the computations of and are discussed in detail.

Let be a dataset with data points. Each has attributes. Therefore, is the jth attribute of data point . The Euclidean distance between the data points and can be expressed as follows:

The local density of the data point is defined aswithwhere is the cutoff distance. In fact, is the number of data points adjacent to data point . The minimal distance between data point and any other data points with a higher density is given by

After and are calculated for each data point , a decision graph with on the vertical axis and on the horizontal axis can be plotted. This graph can be used to find the cluster centers and then to assign each remaining data point to the cluster with the shorted distance.

The computation of the local densities of the data points is a key factor for the effectiveness and efficiency of the DPC. There are many other ways to calculate the local densities. For example, the local density of can be computed using (5) in the following [3]:

The way in (5) is suitable for “small” datasets. In fact, it is difficult to judge if the dataset is small or large. When (5) is used to calculate the local density, the results can be greatly affected by the cutoff distance .

Each component on the right side of (5) is a Gaussian function. Figure 1 visualizes the function and two Gaussian functions with different values of . The blue and red curves are the curves of with and , respectively. The curve with a smaller value of declines more quickly than the curve with a larger value of . Comparing the curve of , the yellow dash dotted curve, with the curves of , it can be found that values of are greater than those of when but decay faster when . This means that if the value of the parameter needs to be decided manually in the density calculation, the result of the calculated densities will be influenced by the selected value. This analysis shows that the parameter has a big effect on the calculated results. Furthermore, the density in (5) can be influenced by the cutoff distance . To eliminate the influence from the cutoff distance and give a uniform metric for datasets with any size, Du et al. [30] proposed the K-nearest neighbor method. The local density in Du et al. [30] is given bywhere is an input parameter and is the set of the nearest neighbors of data point . However, this method did not consider the influence of the position of the data point on its own density. Therefore, this current study proposes a novel method to calculate the densities.

2.2. The Tissue-Like P System with Active Membrane

A tissue-like P system has a graphical structure. The nodes of the graph correspond to the cells and the environment in the tissue-like P system, whereas the edges of the graph represent the channels for communication between the cells. The tissue-like P system is slightly more complicated than the cell-like P system. Each cell has a different state. Only the state that meets the requirement specified by the rules can be changed. The basic framework of the tissue-like P system used in this study is shown in Figure 2.

A P system with active membranes is a construct:where(1) is the set of alphabets of all objects which appear in the system;(2) represents the states of the alphabets;(3) is the set of labels of the membranes;(4) are the initial multiple objects in cells 1 to m;(5) is the set of objects present in an arbitrary number of copies in the environment;(6) is the set of channels between cells and between cells and the environment;(7) is the initial state of the channel (i, j);(8) is a finite set of symport/antiport rules of the form with and :(i), where , , and .(Object evolution rules: an object is evolved into another in a membrane).(ii), where and . (Send-in communication rules: an object is introduced into a membrane and may be modified during the process).(iii), where . (Send-out communication rules: an object is sent out of the membrane and may be modified during the process).(iv), where and . (Division rules for elementary membranes: the membrane is divided into two membranes with possibly different labels; the object specified in the rule is replaced by possibly new objects in the two new membranes; and the remaining objects are duplicated in the process).(9) is the output cell.

The biggest difference between a cell-like P system and a tissue-like P system is that each cell can communicate with the environment in the tissue-like P system, but only the skin membrane can communicate with the environment in the cell-like P system. This does not mean that any two cells in the tissue-like P system can communicate with each other. If there is no direct communication channel between the two cells, they can communicate through the environment indirectly.

3. The Proposed Method

3.1. Density Metric Based on the K-Nearest Neighbors and Shannon Entropy

DPC still has some defects. The current DPC algorithm has the obvious shortcoming that it needs to set the value of the cutoff distance manually in advance. This value will largely affect the final clustering results. In order to overcome this shortcoming, a new method is proposed to calculate the density metric based on the K-nearest neighbors and Shannon entropy.

K-nearest neighbors (KNN) is usually used to measure a local neighborhood of an instance in the fields of classification, clustering, local outlier detection, etc. The aim of this approach is to find the K-nearest neighbors of a sample among N samples. In general, the distances between points are achieved by calculating the Euclidean distance. Let KNN() be a set of nearest neighbors of a point and it can be expressed aswhere is the Euclidean distance between and and is the k-th nearest neighbor of . Local regions measured by KNN are often termed K-nearest neighborhood, which, in fact, is a circular or spherical area or radius . Therefore, KNN-based method cannot apply to handle datasets with clusters nonspherical distributions. Therefore, these methods usually have poor clustering results when handling datasets with clusters of different shapes.

Shannon entropy measures the degree of molecular activity. The more unstable the system is, the larger the value of the Shannon entropy is, and vice versa. The Shannon entropy, represented by , is given bywhere is the set of objects and is the probability of object appearing in . When is used to measure the distance between the clusters, the smaller the value of is, the better the clustering result is. Therefore, the Shannon entropy is introduced to calculate the data point density in the K-nearest neighbor method, so that the final density calculation not only considers the distance metric, but also adds the influence of the data point position to the density of the data point.

However, the decision graph is calculated by the product of and . A larger value of makes it easier to choose the best clustering centers. Therefore, the reciprocal form of the Shannon entropy is adopted. The metrics for and may be inconsistent, which directly leads to and playing different roles in the calculation of the decision graph. Hence, it is necessary to normalize and .

The specific calculation method is as follows. First, the local density of data point is calculated:where and are data points and is the density of data point . Next, the density of data point is normalized and the normalized density is denoted as :Finally, the density metric which uses the idea of the K-nearest neighbor method is defined as

To guarantee the consistence of the metrics of and , also needs to be normalized.

3.2. Tissue-Like P System with Active Membranes for Improved Density Peak Clustering

In the following, a tissue-like P system with active membranes for density peak clustering, called KST-DPC, is proposed. As mentioned before, assume the dataset with data points is represented by . Before performing any specific calculation of the DPC algorithm, the Euclidean distance between each pair of data points in the dataset is calculated and the result is stored in the form of a matrix. The initial configuration of this P system is shown in Figure 3.

When the system is initialized, the objects are in membrane for and object is in membrane , where means there is no object. First, the Euclidean distance between the data points and (represented by for ) is calculated with the rule . Note that for are expressed as . The results are stored as the distance matrix, also called the dissimilarity matrix, ,

At the beginning, there are membranes in the P system. After the distances are calculated, objects are placed in membrane for . In the next step, the densities of the data points are calculated by the rule . Then the send-in and send-out communication rules are used to calculate the values of , , and and to put them in membrane for . Next, according to the sorted results of for , the number of clusters can be determined. The rule of the active membranes is used to split membrane into membranes as shown in Figure 4. The cluster centers are put in membranes to , respectively. Finally, the remaining data points are divided and each is put into a membrane with a cluster center that is closest to the data point. Up to this point, the clusters are obtained.

The main steps of KST-DPC is summarized as in Algorithm 1.

Inputs: dataset X, parameter K
Output: Clusters
Step 1: The objects are in membrane for
and object is in membrane ;
Step 2: Compute the Euclidean distance matrix by the rule1;
Step 3: Compute the local densities of the data points by the rule2 and
normalize them using (10) and (11);
Step 4: Calculate and for data point using (12) and (4) in every
membrane , respectively;
Step 5: Calculate for all in membrane and sort them
by descend, and select the top K values as the initial cluster center. So as to
determine the centers of the clusters;
Step 6: Split the membrane to K membranes by the division rules, which membranes can
be number from to ;
Step 7: The clustering centers are put in membranes to , respectively.
Step 8: Assign each remaining point to the membrane with the nearest cluster center;
Step 9: Return the clustering result.
3.3. Time Complexity Analysis of KST-DPC

As usual, computations in the cells in the tissue-like P system can be implemented in parallel. Because of the parallel implementation, the generation of the dissimilarity matrix uses computation steps. The generation of the data points densities needs 1 computation step. The calculation of the final density uses computation steps. The calculation of needs steps. The calculation of uses 1 step. steps are used to sort for . Finally, the final clustering needs 1 more computation step. Therefore, the total time complexity of KST-DPC is . The time complexity of the DPC-KNN is . As compared to DPC-KNN, KST-DPC reduces the time complexity by transferring time complexity to space complexity. The above analysis demonstrates that the overall time complexity of KST-DPC is superior to that of DPC-KNN.

4. Test and Analysis

4.1. Data Sources

Experiments on six synthetic datasets and four real-world datasets are carried out to test the performance of KST-DPC. The synthetic datasets are from http://cs.uef.fi/sipu/datasets/. These datasets are commonly used as benchmarks to test the performance of clustering algorithms. The real-world datasets used in the experiments are from the UCI Machine Learning Repository [31]. These datasets are chosen to test the ability of KST-DPC in identifying clusters having arbitrary shapes without being affected by noise, size, or dimensions of the datasets. The numbers of features (dimensions), data points (instances), and clusters vary in each of the datasets. The details of the synthetic and real-world datasets are listed in Tables 1 and 2, respectively.

The performance of KST-DPC was compared with those of the well-known clustering algorithms SC ‎[32], DBSCAN ‎[33], and DPC-KNN ‎[28, 34]. The codes for SC and DBSCAN are provided by their authors. The code of DPC is optimized by using the matrix operation instead of iteration cycle based on the original code provided by Rodriguez and Laio [3] to reduce running time.

The performances of the above clustering algorithms are measured in clustering quality or Accuracy (Acc) and Normalized Mutual Information (NMI). They are very popular measures for testing the performance of clustering algorithms. The larger the values are, the better the results are. The upper bound of these measures is 1.

4.2. Experimental Results on the Synthetic Datasets

In this subsection, the performances of KST-DPC, DPC-KNN, DBSCAN, and SC are reported on the six synthetic datasets. The clustering results by the four clustering algorithms for the six synthetic datasets are color coded and displayed in two-dimensional spaces as shown in Figures 510. The results of the four clustering algorithms on a dataset are shown as four parts in a single figure. The cluster centers of the KST-DPC and DPC-KNN algorithms are marked in the figures with different colors. For DBSCAN, it is not meaningful to mark the cluster centers because they are chosen randomly. Each clustering algorithm ran multiple times on each dataset and the best result of each clustering algorithm is displayed.

The performance measures of the four clustering algorithms on the six synthetic datasets are reported in Table 3. In Table 3, the column “Par” for each algorithm is the number of parameters the users need to set. KST-DPC and DPC-KNN have only one parameter K, which is the number of nearest neighbors to be prespecified. In this paper, the value of K is determined by the percentage of the data points. It references the method in [34]. For each dataset, we adjust the percentage of data points in the KNN for the multiple times and find the optimal percentage that can make the final clustering reach the best. Because we perform more experiments, we only list the best result in Tables 3 and 4. And in order to be consistent wi other parameters in the table, we directly convert the percentage of data points into specific K values. DBSCAN has two input parameters, the maximum radius Eps and the minimum point MinPts. The SC algorithm needs the true number of clusters. C1 in Table 3 refers to the number of cluster centers found by the algorithms. The performance measures including Acc and NMI are presented in Table 3 for the four clustering algorithms on the six synthetic datasets.

The Spiral dataset has 3 clusters with 312 data points embracing each other. Table 3 and Figure 5 show that KST-DPC, DPC-KNN, DBSCAN, and SC can all find the correct number of clusters and get the correct clustering results. All the benchmark values are 1.00 reflecting the four algorithms all performing perfectly well on the Spiral dataset.

The Compound dataset has 6 clusters with 399 data points. From Table 3 and Figure 6, it is obvious that KST-DPC can find the ideal clustering result; DBSCAN cannot find the right clusters whereas DPC-KNN and SC cannot find the clustering centers. Because DPC has a special assignment strategy [3], it may assign data points erroneously to clusters once a data point with a higher density is assigned to an incorrect cluster. For this reason, some data points belonging to cluster 1 are incorrectly assigned to cluster 2 or 3 as shown in Figures 6(b)6(d). DBSCAN has some prespecified parameters that can have heavy effects on the clustering results. As shown in Figure 6(c), two clusters are merged into one cluster in two occasions. KST-DPC obtained Acc and NMI values higher than those obtained by the other algorithms.

The Jain dataset has two clusters with 373 data points in a 2 dimensional space. The clustering results show that KST-DPC, DBSCAN, and SC can get correct results and both of the benchmark values are 1.00. The experimental results of the 4 algorithms are shown in Table 3 and the clustering results are displayed in Figure 7. DPC-KNN devides some points that should belong to the bottom cluster into the upper cluster. Although all the four clustering algorithms can find the correct number of clusters, KST-DPC, DBSCAN, and SC are more effective because they can put all the data points into the correct clusters.

The Aggregation dataset has 7 clusters with different sizes and shapes and two pairs of clusters connected to each other. Figure 8 shows that both the KST-DPC and DPC-KNN algorithms can effectively find the cluster centers and correct clusters, except that an individual data point is put into an incorrect cluster by DPC-KNN. Table 3 shows that the benchmark values of KST-DPC are all 1.00 and those of DPC-KNN are close to 1.00. SC also can recognize all clusters, but the values of Acc and NMI are lower than those of DPC-KNN. DBSCAN did not find all clusters and could not partition the clusters connected to each other.

The R15 dataset has 15 clusters containing 600 data points. The clusters are slightly overlapping and are distributed randomly in a 2-dimensional space. One cluster lays in the center of the 2-dimensional space and is closely surrounded by seven other clusters. The experimental results of the 4 algorithms are shown in Table 3 and the clustering results are displayed in Figure 9. KST-DPC and DPC-KNN can both find the correct cluster centers and assign almost all data points to their corresponding clusters. SC also obtained good experimental result, but DBSCAN did not find all clusters.

The D31 dataset has 31 clusters and contains 3100 data points. These clusters are slightly overlapping and distribute randomly in a 2-dimensional space. The experimental results of the 4 algorithms are shown in Table 3 and the clustering results are displayed in Figure 10. The values of Acc and NMI obtained by KST-DPC are all 1.00. This shows that KST-DPC obtained perfect clustering results on the D31 dataset. DPC and SC obtained similar results to those of KST-DPC on this dataset, but DBSCAN was not able to find all clusters.

4.3. Experimental Results on the Real-World Datasets

This subsection reports the performances of the clustering algorithms on the four real-world datasets. The varying sizes and dimensions of these datasets are useful in testing the performance of the algorithms under different conditions.

The number of clusters, Acc and NMI are also used to measure the performances of the clustering algorithms on these real-world datasets. The experimental results are reported in Table 4 and the best results of the each dataset are shown in italic. The symbol “--” indicates there is no value for that entry.

The Vertebral dataset consists of 2 clusters and 310 data points. As Table 4 shows, the value of Acc got by KST-DPC is equal to that got by DPC-KNN, but the value of NMI got by KST-DPC is lower than that got by DPC-KNN. No values of Acc and NMI were obtained by SC. As Table 4 shows, all algorithms could find the right number of clusters.

The Seeds dataset consists of 210 data points and 3 clusters. Results in Table 4 show that KST-DPC obtained the best, whereas DBSCAN obtained the worst, values of Acc and NMI. It is obvious that all four clustering algorithms could get the right number of clusters.

The Breast Cancer dataset consists of 699 data points and 2 clusters. The results on this dataset in Table 4 show that all four clustering algorithms could find the right number of clusters. KST-DPC obtained the Acc and NMI values of 0.8624 and 0.4106, respectively, which are higher than those obtained by other clustering algorithms. The results also show that DBSCAN has the worst performance on this dataset, except that SC did not get experimental results on these benchmarks.

The Banknotes dataset consists of 1372 data points and 2 clusters. From Table 4, it is obvious that KST-DPC got the best values of Acc and NMI among all four clustering algorithms. The values of Acc and NMI obtained by KST-DPC are 0.8434 and 0.7260, respectively. Larger values of these benchmarks indicate that the experimental results obtained by KST-DPC are closer to the true results than those obtained by the other clustering algorithms.

All these experimental results show that KST-DPC outperform the other clustering algorithms. It obtained larger values of Acc and NMI than the other clustering algorithms.

5. Conclusion

This study proposed a density peak clustering algorithm based on the K-nearest neighbors, Shannon entropy and tissue-like P systems. It uses the K-nearest neighbors and Shannon entropy to calculate the density metric. This algorithm overcomes the shortcoming that DPC has that is to set the value of the cutoff distance in advance. The tissue-like P system is used to realize the clustering process. The analysis demonstrates that the overall time taken by KST-DPC is shorter than those taken by DPC-KNN and the traditional DPC. Synthetic and real-world datasets are used to verify the performance of the KST-DPC algorithm. Experimental results show that the new algorithm can get ideal clustering results on most of the datasets and outperforms the three other clustering algorithms referenced in this study.

However, the parameter in the K-nearest neighbors is prespecified. Currently there is no technique available to set this value. Choosing a suitable value for is a future research direction. Moreover, some other methods can be used to calculate the densities of the data points. In order to improve the effectiveness of DPC, some optimization techniques can also be employed.

Data Availability

The synthetic datasets are available at http://cs.uef.fi/sipu/datasets/ and the real-world datasets are available at http://archive.ics.uci.edu/ml/index.php.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (nos. 61876101, 61802234, and 61806114), the Social Science Fund Project of Shandong (16BGLJ06, 11CGLJ22), China Postdoctoral Science Foundation Funded Project (2017M612339, 2018M642695), Natural Science Foundation of the Shandong Provincial (ZR2019QF007), China Postdoctoral Special Funding Project (2019T120607), and Youth Fund for Humanities and Social Sciences, Ministry of Education (19YJCZH244).