Abstract

As one of the typical clustering algorithms, heuristic clustering is characterized by its flexibility in feature integration. This paper proposes a type of heuristic algorithm based on cognitive feature integration. The proposed algorithm employs nonparameter density estimation and maximum likelihood estimation to integrate whole and local cognitive features and finally outputs satisfying clustering results. The new approach possesses great expansibility, which enables priors supplement and misclassification adjusting during clustering process. The advantages of the new approach are as follows: (1) it is effective in recognizing stable clustering results without priors given in advance; (2) it can be applied in complex data sets and is not restricted by density and shape of the clusters; and (3) it is effective in noise and outlier recognition, which does not need elimination of noises and outliers in advance. The experiments on synthetic and real data sets exhibit better performance of the new algorithm.

1. Introduction

Clustering is an automatic process of partitioning data set with proper similarity measurement. In the process, data in same group have maximum similarity, while data in different group have minimum similarity. As unsupervised learning, clustering is greatly influenced by similarity measurement and is closely related with priors in application fields. Clustering is extensively applied in fields such as biology [1, 2], computer vision [3, 4], geological exploration [57], and information retrieval [8], because it shows excellent advantages in automatic grouping.

Clustering is developed into many types of algorithms in its successful application in those fields. According to different clustering processes, it is generally classified into two types: agglomerative clustering and partitional clustering. Agglomerative clustering initially sees all data as one group and then conducts the fusion of clusters based on certain principles, thus forming proper data groups. Partitional clustering redistributes the existing data groups and forms proper clusters, such as k-means [9] and fuzzy c-means [57].

According to different similarity measurements, clustering can also be subdivided into many types: local density-based clustering, clustering based on density estimation, clustering based on matrix calculation, clustering based on graph calculation, grid-based clustering, and so on.

Clustering based on local density feature is a typical clustering algorithm. Simple calculation and high efficiency are the advantages of this type of algorithm, such as clustering with density peaks (CDP) [10] and its generalized version [11, 12]. This type of algorithms endows every datum with a local density feature and redefines centroid, or defines cluster merging mechanism, to obtain clustering results. Such algorithm is effective, but it is difficult to find centroid with poor decision graph for some complicated data sets.

Density-based clustering algorithm constitutes a significant proportion of the clustering research, such as mean shift [13] and scale space filtering (CSSF) [14]. This type of algorithms employs nonparameter density estimation to estimate distribution of the data set. Area of high density represents centroids, while area of relatively low density represents boundaries. Then, centroids and boundaries are found with gradient descent or level set, and data are automatically divided into groups. Density-based algorithm is effective in dealing with Gaussian type data sets and is widely applied in protein sequencing and computer vision. Such algorithms depend highly on density estimation and gradient descent, so they are not effective in clustering of data set with arbitrary density feature and arbitrary shape.

One typical algorithm of clustering is spectral clustering [15, 16] and subspace clustering [17]. Spectral clustering constructs similarity Laplace matrix and conducts dimensional reduction through eigenvalue decomposition. Finally, clustering result is obtained through k-means in low-dimensional space. Spectral clustering fully employs characteristic mapping and dimensional reduction, which enables the algorithm to be effective in dealing with manifold structure data set. However, its high computing complexity and sensitivity to parameters make it difficult in its application into large scale data sets.

There are two problems for clustering: one is definition of similarity, including Euclidean and non-Euclidean dissimilarity and the other is similarity treatment. Different algorithms employ different methods for similarity treatment. The first problem is subjective because it is greatly related with priors in application fields. The second problem lies in centroid estimation or boundary description. For most clustering algorithms, some clustering mistakes in the middle of the process will go down to the final clustering result, which means that the ability of error correction needs to be improved.

This paper proposes a heuristic clustering algorithm based on cognitive feature capturing. The proposed algorithm is a centroid learning process and captures data structure features with a new type of similarity measurement to obtain better clustering result. In feature description, three cognitive features are described: neighbourhood, density difference, and connectivity. Similarity measurement (a new kernel function) is established to capture the three mentioned features. In clustering, the paper proposes a heuristic algorithm based on centroid learning. The proposed algorithm possesses great expansibility, which enables priors supplement and misclassification adjusting during clustering process. The proposed algorithm, to some extent, weakens the subjective dependence on similarity measurement and allows misclassification adjusting and prior interference in the process of clustering.

In Section 2, we introduce a new similarity measurement model and explain its way of capturing clustering structures. Section 3 mainly establishes a heuristic clustering model based on local centroid learning. Execution of the algorithm and grouping strategy are given in Section 4. Sections 5 and 6 present experiments and conclusion.

2. Similarity Definition for Cognitive Feature

In clustering, cognitive features are of great significance for identification of clusters, such as neighbourhood, density difference, and connectivity. Samples of closer distance tend to be recognized as in one cluster. Samples of greater density difference tend to be recognized as in different clusters, while samples of little density difference can extend continuously. The capturing of those features helps in the recognition of clusters, which always depends on similarity measurement.

For data set , the similarity measurement of an arbitrary point with its neighborhood can be regarded as probability of belonging to local centroid and is expressed aswhere is normalization coefficient, so as to satisfy . represents the Euclidean distance between two data points. is a function of and is defined as where , is radius of the effective nearest neighbourhood and is the effective nearest neighborhood of :where is nearest neighbourhood of . in Equation (2) is defined asThe exponent section in Equation (1) is a type of radial basis function. Combined with local scale parameter, such function can measure the similarity of neighbours with different density features. With certain , and are of greater similarity when Euclidean distance between them is smaller. Equation (2) gives the affecting region of centroid . When all neighbors duplicate together, scale is set as a nonzero small constant to avoid denominator being zero. Meanwhile, the effective nearest neighbourhood in Equation (2) is able to eliminate the influence from noises and outliers. When , , and has weaker relation with other samples, which means can be recognized as noise and outlier.

In Equation (1), can be expressed aswhere represents significant difference between two scalars. According to significance testing in statistics, is set as or in common. Local density can be obtained through neighbours:Equation (5) can weaken the similarity between points with greater density difference, so as to recognize noises and boundaries, and to continue the connectivity through points with smaller density difference.

Similarity measurement (1) can better capture features like nearest neighbourhood, density difference, and connectivity in clustering and can be applied in heuristic clustering to obtain satisfying results.

3. Procedure of Centroid Estimation

As one of the most important types in clustering algorithms, centroid estimation continuously searches for the centroids of local area or of the whole data set. In data set , an arbitrary sample can be regarded as the initial centroid, and is the probability of the nearest neighbor belonging to centroid . Then, the probability distribution of sample is expressed asIf is independent identical distribution variable, distribution of data set can be expressed asLog likelihood can be expressed asPartial log likelihood with respect to isAlthough is a function in Equation (10), it is computed once in advance and can be viewed as constants. Let ; estimated value of the centroid can be expressed asEquation (11) is weighed sum of the influence on centroid from the nearest neighbors of initial centroid , so as to estimate the centroid. Probability density function in Equation (11) can be regarded as normalization processing and can be expressed aswhere

Equation (12) is a process of estimating local centroids, and Equation (13) is a generalized radial basis function. Equations (12) and (13) can be developed into a kind of heuristic clustering algorithm including non-Euclidean similarity and centroid learning.

4. Heuristic Clustering based on Centroid Learning

Combination of local centroid estimation (12) and similarity measurement (13) can make good use of the information of nearest neighborhood, local density, and connectivity and obtain satisfying clustering result effectively.

The new heuristic clustering algorithm is a process of searching for local density maximum (centroid) and is expressed aswhere is iteration steps, and initial sample . In the process of clustering, each sample is an initial centroid, and each centroid can also be neighbors of other centroids.

With gradient (8), the proposed heuristic clustering algorithm can be expressed aswhere is the step size. Different from the adaptive step size in Equation (14), the step size in Equation (15) has to be set manually. However, the fusion of feature information can be controlled more easily in the heuristic clustering.

Heuristic clustering algorithms (14) and (15) are essentially different expressions of a same method. The two equations are both the process of continuous centroid estimation. The convergence of the algorithms depends on the similarity measurement matrix. Clustering process of (14) can be expressed with matrix aswhere is row normalization matrix based on Equation (13). Since similarity matrix is positive definite matrix, and row sum is 1, the eigenvalue of the matrix satisfiesEach sample can be expressed as in vector space spanned by eigenvector. After calculations, specific sample can be expressed asAfter iterations, distance between every two arbitrary points can be expressed asAccording to Equation (16), when similarity matrix is positive semidefinite matrix,

Heuristic clustering algorithm (14) or (15) based on feature capturing generates clusters gradually in continuous clustering process. The cluster can be expressed as

In clustering, number of clusters reduces constantly until only one or needed number of clusters is reserved. In the process, since the variation of clusters is regular, some of the clusters remain stable for a rather long period, which is called survival period of the cluster and is expressed aswhere (normalized mutual information) [18] is clustering index and can evaluate the clustering result. represents that the two clustering results are the same, or that clustering result remains stable for a long period.

5. Experiments

Heuristic clustering algorithm based on centroid learning is a process of centroid estimation, in which the centroids of local area are searched constantly until the satisfying results are obtained. Centroid learning process makes good use of clustering structure information of the data set and performs better in dealing with data sets of manifold structure and various density structures. In this section, the proposed algorithm is tested with synthetic data sets and real data sets and is also applied to volume rendering to show its good performance.

For heuristic clustering algorithm based on centroid learning, nearest neighbourhood parameter is a key parameter. Although clustering result is not sensitive to , it is influenced by to some extent. If is extremely small, much more clusters will be obtained because of emphasis on detailed structures of data set, while extremely large will form only one big cluster because of ignoring of details. Extensive experiments show that for medium data sets, and is set as 2% of the data set scale for large data sets. Other two parameters are significance testing parameter and small constant . is similar to terminal condition and is an arbitrary constant, and is a parameter in significance test, which is often set as 5%.

The clustering process of a data set is shown in Figure 1, in which the data set contains 5 satisfying clusters. The cluster number decreases in the clustering procedure, while the consideration of detailed structure information is weakened. Usually, better clustering result is often obtained when whole and local structure features are considered with proper compromise. In clustering process of the new approach, 5 is given as the most stable number of clusters, which is in line with the cognitive recognition of the data set. The clustering result with 6 clusters can also be viewed as a stable result, in which an outlier is regarded as a cluster.

Four synthetic data sets of various density and shape structures are employed to test the efficiency of the new algorithm. The four data sets in Figures 2(a), 2(b), 2(c), and 2(d) are named as “Airplane,” “Anchor,” “Ring,” and “Swissroll,” respectively. The nearest neighbourhood is set as 8, 8, 8, and 20. is set as 0 because the algorithm does not need noise and outlier recognition. Figure 2 shows the clustering results of the new algorithm. Table 1 shows the clustering results of other algorithms, including scale space filtering clustering (CSSF) [14], chameleon [19], Spectral-Ng [15], and clustering with density peaks (CDP) [10]. Scale space filtering clustering is based on multiscale feature recognition. Chameleon and spectral algorithms perform better in manifold data set clustering. Although CDP is efficient in clustering of many types of data sets, it is difficult to select the parameters of cut-off, as the parameter is sensitive to decision graph. The comparison of clustering results in Table 1 shows the new algorithm can obtain more satisfying results.

Four real data sets are employed to test the new clustering algorithm, including pen-digit, iris, noise iris, and noise USPS-01. Pen-digit and iris with whole training data set are from UCI machine learning repository, in which pen-digit data set contains 7494 images with 16 attributes and iris data set contains 150 samples with 4 attributes. Noise iris is composed of iris data set and 5% uniform random samples. “USPS-01” data set contains 2200 images which are ‘0’ and ‘1’ classes drawn from USPS handwritten digits, and “noise USPS-01” is “USPS-01” contaminating with 2% residual USPS samples.

The clustering results are compared with those of NRSC [20], Spectral-Ng, STSC [21], and chameleon, to show their robustness to noises. The comparison results measured with NMI are shown in Figure 3. The results in Figure 3 show that the five algorithms are robust to outliers to some extent. The new algorithm has more advantages compared with other 4 algorithms in noise robustness. The chameleon performs well in pen-digit data set. However, the overall performance of the new clustering algorithm is superior to other four algorithms as shown in Figure 3.

Clustering is applied to volume rendering to improve separability of structures with similar attributes. Volume rendering is a technique used to display a 2D projection of a 3D discretely sampled data set, typically a 3D scalar field. A typical 3D data set is a group of 2D slice images acquired by a CT, MRI, or Micro-CT scanner. Clustering can be used in scalar field to separate similar structures, and the clustering result in scalar field is transformed into visualized image.

The new clustering is employed to separate adjacent structures in two volume rendering examples. One is an engine in industry and the other is keen joints in medical treatment. The results of volume rendering with the new clustering are shown in Figure 4. Figure 4(a) shows part of adjacent structure (the part in red rectangle in Figure 4(b)) in scalar field. Scanning techniques leads to outliers around adjacent parts in scalar filed, which is difficult for clustering. Figure 4(b) shows the visualization of separating similar structures, where different parts are labelled with different colours according to clustering results in scalar field. Figure 4(c) shows the visualization of keen joints with CT scanner, and the visualization of separating structures is shown in Figure 4(d). The experiment on volume rendering exhibits good performance of the new approach.

6. Conclusions

The paper proposes a heuristic clustering algorithm based on cognitive feature capturing. The new approach is effective in integrating cognitive features of data set and conducts clustering based on the features. The proposed approach is not restricted by density and shape features of data set, especially manifold structure data after dimensionality reduction. Heuristic clustering shows advantage in integration of data structure and is worth further research concerning feature information feedback and effective usage of data features in clustering process.

Data Availability

The pen-digits and iris data sets are from UCI machine learning repository, and USPS hand written digits data set is from https://cs.nyu.edu/~roweis/data.html. The other data sets used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The study presented in this article is supported by the National Science Foundation of China, Research Grants no. 61305070 and no. 61703001.