Abstract

Structure of data set is of critical importance in identifying clusters, especially the density difference feature. In this paper, we present a clustering algorithm based on density consistency, which is a filtering process to identify same structure feature and classify them into same cluster. This method is not restricted by the shapes and high dimension data set, and meanwhile it is robust to noises and outliers. Extensive experiments on synthetic and real world data sets validate the proposed the new clustering algorithm.

1. Introduction

As one important stage of the data mining process, data clustering is a division of data into groups of similar objects. Each group, called cluster, consists of objects similar among themselves and dissimilar to other groups. Representing data by fewer clusters necessarily loses certain fine details, but achieves compact representation. Clustering is the basis of pattern recognition and knowledge discovery. Tremendous literatures on clustering techniques have appeared since the last century. These techniques are widely applied in image segmentation [1], information retrieval [2], text mining [3], image quantization [4], and so on.

Clustering algorithms can be basically divided into two categories of approaches, the partitional approach [5, 6] and the hierarchical approach [79]. The partitional approach generally obtains a partition of the data set by minimizing a cost function, which is commonly modeled to prefer the clusters with maximal intracluster similarities and minimal inter-cluster similarities. The existing algorithms that belong to this kind of approach include FORGY [10], ISODATA [11], WISH [5, 6], and so forth.

Hierarchical clustering algorithms are the most commonly used clustering techniques specially useful in biology and medicine. The basic idea of such algorithms is to successively cluster a data set into clusters in a hierarchical fashion. Hierarchical clustering algorithms are able to produce multilevel clusters, which is necessary for the applications in biology and other applications. Examples of hierarchical clustering algorithms are CURE [7], ROCK [8], and Chameleon [9].

Density-based algorithm is a kind of classical clustering algorithm. One method of density clustering algorithm is based on the assumption for the whole data set, and the other method is based on the local density feature to mine the structure of data set.

In order to assume the distribution of the whole data set to clustering, the first thing is to suppose different distribution functions on different density zones of data set, and which commonly uses the nonparametric density estimation. Using the nonparametric density estimation with a kernel function to estimate the distribution, the clusters of a data set are obtained by repeatedly moving the data to the local maximal points of the estimated density. There exist density-based clustering algorithms such as famous Mean Shift [1215], CSSF [16], and so on. The computation of the algorithm is simple and the cluster number is not required to be known in advance. However, they tightly depend on the density estimation quality of the data set. Without any prior knowledge on structure of the data set, it is difficult to provide a good density estimation to fit the data set, so the clustering results might not be desirable for real data sets or images.

Describing structure of data set by mining the local density feature, the clustering is to agglomerate the data points with same density feature. For a data set, the density difference of each data point can be measured with -NN neighborhood, in which the larger -NN neighborhood represents that the local density is low or the data points in this region are sparse. The data points classified into same cluster should have same local density value or same radius of -NN neighborhood. The work in [17] and DBScan [18] are of this kind. This kind of method need not to assume the distribution of the whole data set and are not be restricted the shape of data set. However, the algorithms need to compare the density difference to agglomerate the data points.

The present research is a new clustering based on density consistency to cluster data set with filtering process. Through describing the local structure feature of data set, the data points with same feature can be classified into same cluster. This new clustering algorithm is not restricted by the shapes and high dimension data set, meanwhile it is robust to noises and outliers.

The reminder of this paper is organized as follows. In Section 2 we introduce some useful structure features of data set, which is important to clustering problems. Section 3 introduces the clustering process, includes the modeling of filters, the feature extraction process and feature integration process following with top-down process. The experiments and conclusion are given in Sections 4 and 5.

2. Density Feature of Data Sets for Clustering

Structures of data sets and clustering are closely related. It is commonly considered that the clusters are made of some discrete data points according to some structure features, and these structure features contain local and large scale structure features. Local structure is mainly described with the spatial relationship between near data points, which is also called neighborhood relationship. Local spatial relationship is the most original and elementary for clustering problem. As shown in Figure 1(a), there are lots of tiny clusters, which may be extracted at local scale though the data set follows uniform distribution and has only one cluster at large scale.

The large scale structure features mainly contain density, connectedness, and direction, and we will give explanation to these structure features in the following section.

Density Feature
Density characterizes the distribution feature of a data set, which can be measured with the number of data in a certain volume of pattern space. According to the proximity and similarity laws in psychology, a region of data set with higher density is prone to be recognized as a cluster by human eyes. See Figure 1(b) for example, where two regions of data with different densities are very naturally perceived to be two different clusters, despite the fact that they are very close to each other. A uniformly distributed data set is naturally accepted as no feature because no visual difference is observed (see Figure 1(a)).

Connectedness Feature
The neighbor regions with local structure feature may connect each other and form curve or manifold structure as Figure 1(c). By the Gestalt continuity law, the connected data are easily perceived to be a same cluster, but the disconnected data are prone to be perceived as different clusters.

Direction Feature
If the local regions connect each other and have same principal direction, then the principal direction is called the direction feature of the data set, see Figure 1(d) as an example, which has two different direction features.
The local and large scale structure features are the most typical features for valid clustering of data set. Local structure features can form large scale structure features with bottom-up process while the large scale features provides feedback information with top-down process. In next section, we will introduce clustering process to extract and integrate these features into clusters.

3. Clustering Process

Since the data set to be clustered has many native structure features, in this section we propose feature extraction and feature integration methods to obtain valid clustering results.

Let be a pattern matrix and be a pattern. The data set to be clustered can be viewed as a kind of imaginary image with its native structure features, and then the clustering problem is viewed as a cognition problem. To extract the structure features, we employ filtering methods in the theory of vision research. Given a stimuli, the response of a neuron in primary visual system can be measured where is convolution operation, is stimuli (an image or imaginary image), is a series of filters (or kernel function), and is a set of parameters. In (3.1), is the response of the neuron with stimuli , which is the filtering response and is called as a feature of .

Data sets may have many native structure features, which need to be extracted based on (3.1) with different filters (or kernels). These filters should be determined automatically to deal with different structure features of different data sets. Next we introduce how to construct filters automatically based on these salient structural features.

3.1. Modeling of Filters to Extract Local Structure Features

Like image processing, each data point is corresponding to the center of a filter, and for clustering problem here we model a new type of low-pass filter where is -NN of with new special method, and is the th neighbor of data point . The matrix is a covariance which can control similarity measurement according to the neighborhood relationship and structure of data set.

To determine the special neighborhood for each data point is very important for the new clustering algorithm, and the aim is to depict according to density difference. Then we will introduce how to determine the data set for each data point and how to scale the neighborhood according to matrix .

3.1.1. Determination of Neighborhood

Unlike classical -NN strategy, we use a new method to find nearest neighborhood for each data point. First, we initialize nearest neighborhood data points [19] in which each new adding data point is the nearest to origin data set, then we select proper neighbors among the initial data points.

Definition 3.1 (acceptable neighborhood radius). Each data point has initial neighbors, and is the distance between th new adding nearest neighbor while data set containing origin data points. Then the neighborhood radius of the data point is defined as
Take Figure 2(a) as an example, in which we find the nearest 4 data points from . First the data point is the nearest to and its distance is , then and can be viewed as a whole, and the second nearest data point to is rather than , its distance is . So in this way the acceptable neighborhood radius of data point is .

With the definition, each data point is corresponding to a neighborhood radius. In order to select true neighbors for a data point, the neighbors should have similar features as the data point. So the consistency criterion is built up.

Definition 3.2 (internal consistency criterion (ICC)). If data point is neighbor of data point according to Euclidean distance, they must satisfy where is the th adding distance in (3.3) and is a parameter.

The aim of the ICC is to wipe density different data point out. With the criterion, the neighbors of each data point can be determined by the structures of data set, meanwhile the selection mechanism for neighbors can also explore the structures of data set. In Figure 2(b), the neighbors of data point are , and according to classical -NN. However, the data point and are not satisfied with the two-criterion (3.4), so the data point has no neighbor and it is an outlier. The two criteria can also detect the density structure as shown in Figure 2(c), and the initial 6 nearest neighbors are , and . The first three data points and are the neighbors of because they satisfy the criterions. However, the data points , and distance are not satisfied the criterions, so the number of nearest neighbors are 3, . Thus, the data points , and are not its neighbors. Note that the number of nearest neighbors may not be the initial number . Note that the method to select neighbors according to ICC is not related to sequence of nearest neighbors.

3.1.2. Local Scale and Direction of Neighborhoods

Combined with neighbors, data point has a neighborhood which is important to depict the interaction between and each of its neighbors. The matrix in (3.2) cannot only provide the scale restriction, but endow each data point with the direction feature. Then the matrix can be concreted by employing covariance matrix where is the number of neighbors in the neighborhood. Then in (3.2), the local scale matrix can be represented with Whenever is irreversible, we add each diagonal element of by an any small positive constant.

With the local scale matrix in (3.2), the neighborhood of each data point has a local scale which can restrict the similarity (or the relationship) between each pair of data points in a proper scope. Meanwhile, the biggest eigenvector of is the local direction at the point which can provide other agglomerate criterion and feedback information to refine clusters.

3.2. Agglomeration Process of Clusters

In this subsection, we introduce the agglomerate process of clusters, in which different clustering feature can be extracted, from local features into large scale features and then clusters. The main process contains three steps: feature extraction with local similarity, feature cognition with large scale structure features and the reidentification with top-down process. In the procedure of feature extraction, the data set is successively filtered (convolved) with multiscale filters, yielding various feature representation of the data set. In the procedure of feature cognition, the clustering result is obtained by integrating various large scale features. In top-down process, the clusters are rechecked to form satisfied ones, such as noisy data points, manifold clusters, clusters with principle direction, and so on.

3.2.1. Extraction of Local Scale Features with Filtering Process

Local scale features embody the local structures of data sets which are based on the local similarity. Neighbor data points with close relationship can form local structure features, such as the density feature of the neighborhood, shape, and principal direction of the neighborhood and so on. In order to extract these local scale features automatically, we use a series of self-tuning filters to agglomerate local neighborhood data points.

Let be the feature of data set extracted at -layer, and simply corresponds to . Then can be expressed as where is a data-driven filtering matrix according to(3.2) The filtering matrix is a sparse matrix, in which each element represents the relationship between and . If they are not neighbors, the element is zero.

The representation (3.7) defines a scale space , which we call the discrete scale space (DSS) of the data set deduced from its feature. For definiteness, we assume that the feature extraction procedure is accomplished within layers of processing. Furthermore, it can be proved that defined by (3.7) will be finally stabilized as . So we can also assume that is so large that reaches to a stationary state when , when the data points with strong relationship will shrink into a same destination to form a cluster.

3.2.2. Integration of Large Scale Feature Clusters

At local scales, the data points with close relationship will be grouped together to form local structure features, and these structure features are rudiments of clusters. These rudiments of clusters then can be grouped according to some aggregative properties to final clusters. Next we introduce the integration of large scale features.

The two similar regions (clusters) merge together, with the precondition that the two regions (clusters) are close, and the average distances of each one are almost equal. The distance of sets is represented with . As for the distance between the two clusters, we use the modified distance of sets. minimal distances of clusters are selected and the average is set to be the intracluster distance. In this paper, the value is set as 3 to make the algorithm more robust. Then the average distance of cluster is defined as follows.

Definition 3.3 (average distance of cluster). is the th cluster obtained with filtering process and , the neighborhood of is , then the average distance of cluster is defined as
As the discussion above, the two clusters and merge together when the two criteria are satisfied as
With the definitions of average distance and the intradistance between two close clusters, also with the agglomerating criterions (3.10) the local structure features will integrate into large scale structure features, such as connectedness and direction features. This procedure is a rough agglomerating and a bottom-up procedure, in which different local structure features may merge together. Then the Top-down process should be considered to analyze whether these local features should merge or not.

3.2.3. Top-Down Procedure

The identification of noisy data points and large scale structure features of data set are difficult for clustering problem and are parallel with clustering algorithm. Here the identification process embeds Top-down procedures because the agglomerating process is only a bottom-up procedure without any large scale feature information. In this paper, the identification of noisy data points, connectedness, and direction features are considered.

Identification of Noisy Data Points or Outlier
Lots of noisy data points or outliers may appear with the local features agglomeration in the filtering process. These data points strongly depend on the local similarity and may not be true noises, and whether they are noisy data points or not depends on the large scale structure features after agglomeration.
Some outliers appear in the seeking of neighbors for each data point, and then some of these outliers may agglomerate in clusters in the filtering process. To avoid this, the outliers should be identified whether they are noisy or not, or whether they should be classified into close clusters according to the statistical properties (such as mean and variance of Gaussian). An example is given in Section 3.2.4 to demonstrate the identification procedures.

Identification of Connectedness and Direction Feature
With the whole agglomeration procedure, fine clusters merge into coarse clusters, and the number of clusters decrease. The coarse clusters are made of small fine clusters which have their own local principal directions. The fine clusters are regions with various shapes, and the region of a fine cluster can be measured with covariance matrix where is the mean of the local cluster. The principal direction of is the local direction of the cluster. In agglomerating procedure into coarse clusters, if the directions of subagglomerates are almost the same, the coarse cluster has a direction feature.

3.2.4. Algorithm Outline

The clustering algorithm includes three procedures, the local structure feature extraction with filtering process, integration of large scale features, and the top-down process to refine clustering results. The clustering algorithm is summarized by an outline in Algorithm 1.

Input: Data set , in which each row is a pattern; parameters: , alpha, beta.
Output: Label vector , in which each element is cluster label.
Step  (1): Find nearest neighbors for each data point according to
      Euclidean distance.
Step  (2): Find nearest neighbors for each data point according to
      Definition 3.2.
Step  (3): Construct filtering matrix based on (3.2) and (3.8).
Step  (4): Filtering process , obtain elementary clusters, which are
      the element with various kind of structure features. Top-down
      process is included to identify connectedness and direction.
Step  (5): Integrate elementary clusters with same structure features into
      meaningful clusters according to (3.9).
Step  (6): Top-down process to identify noisy data points and outliers.

An example is shown in Figure 3, where two Gaussian data sets have different mean and variance. From left to right, and from the top to down are the agglomerating procedure of clustering. The second subfigure is the result of local features integration under filtering procedure, and then the second subfigure is the result after the integration of large scale structure features. And the last is the result with top-down process, where some noisy data points attribute to the two final clusters according to the variance of the two main clusters.

3.2.5. Complexity of the Algorithm

The high complexity of clustering algorithm is difficulty to deal with large datasets, and meanwhile the preprocessing origin datasets is expensive. So in the scope of this paper, we do not intend to solve efficiently the preprocessing. The KNN procedure (typically ) is used in step (1) of the new algorithm. In low dimension, KNN has complexity of , while in high dimension the complexity is , where is the number of data points.

The complexity of our clustering algorithm is . In the filtering process, the complexity is not more than because the filtering matrix is a sparse matrix, in which the number of each nonzero element in each row is not more than (typically ), and is the dimension of each pattern. So the complexity of the filtering process is n . In the integration procedure, the complexity is and is the number of elementary clusters. In practice , and the total complexity of the algorithm is linear in the size of the data set, which is much lower than CSSF [20] and VClust [21] ().

4. Experiment

We provide a series of experiments and applications in this section to demonstrate the effectiveness of the proposed clustering algorithm. The test data sets include synthetic data sets and a set of real world data sets from public domains that are extensively used in testing the performance of classical clustering algorithms. The synthetic data sets are specially generated to exhibit complex structure features, such as the unbalanced and complex manifold shaped. The real world data sets are those derived from image segmentation and a runway detection task from remote sensing image.

We have applied the new algorithm to the above data sets with comparison to several well-known clustering algorithms. These algorithms are known to be representatives of the existing clustering approaches. The method of clustering by scale space filtering-(CSSF-) [20] in the vision-simulation-based approach, the Chameleon [9] method in the graph-based approach, the spectral-Ng method [22] in the spectrum-based approach, Mean Shift algorithm [15] in the density-based algorithm, and the latest LEGClust algorithm [23] from the information entropy-based approach.

In experiments, we have used the normalized mutual information (NMI) [24], as the criterion of measuring the performance of a clustering algorithm, whenever the data sets are of high dimension. There are three parameters in the algorithm: initial neighbor parameter , agglomerating criterion parameters and . These parameters are very stable for the algorithm, (usually ) is only an initial number of neighbors for each data point and proper neighbors are less than , as the same, the other two parameters are very stable. Usually in this paper, , and .

The three toy data sets and the results of different clustering algorithms are shown in Figure 4, where each row is a synthetic data set, and each column contains the results of three data sets with a same algorithm. We have employed 50 people who are familiar with clustering problem to evaluate these results with the figures, and most acceptable results are labeled with red box. From the results with each cluster algorithm (each column), the new algorithm and VClust are better than the others. The only advantage of the new algorithm is that the computation load is lower () than VClust ().

We further compare the new algorithm with other clustering algorithms by applying to real world problems. The real data sets are from UCI and USPS, where the noisy data points follow uniform distribution. In Iris data set, the 30 noisy data points are set while in USPS-01 the 100 uniform noisy ones are set. Due to the uniform unsatisfying performance of the density-based algorithms in afore conducted simulations, we compare the the algorithm only with the graph-based algorithms whose results are more better. Thus, as representatives, Chameleon, spectral-Ng, noise robust spectral clustering (NRSC) [24], and self-tuning spectral clustering (STSC) [25] are selected to apply, as compared with the new algorithm. All the algorithms are applied to the four real data sets, shown in Table 1.

The clustering algorithms are applied in two natural images, which are shown in Figure 5. The images are composed of regions with different textures and gradually changed colors. Two well-known algorithms, which are very successful in image segmentation, are normalize cut [26] and mean shift [15]. We apply the new algorithm and CSSF in the chosen images, and compare the results with normalized cut and mean shift. The yielded segmentation results are shown in Figure 5. Note that the second column shows one kind of acceptable segmentation results through manual labeling, and the result of mean shift is similar as normalized cut.

We also apply the new algorithm to the following identification problem to detect/locate the airport runway from remotely sensed image shown as in Figure 6(a), which comes from LANDSAT thematic mapper (TM) data acquired over a suburb area in Hangzhou, China. The region contains the runway and parking apron of a certain civilian aerodrome. The image consists of a finite rectangular lattice of pixels. Similar to [27], we selected Band 5 as a feature variable. A feature subset of data shown in Figure 6(b) can then be extracted by using a simple technique which selects a pixel point when its gray-level value is above a given threshold (e.g., 250).

In [27], regression class mixture decomposition (RCMD) method was proposed to mine line objects in the remotely sensed image. Using a feature subset of data in Figure 6(b), RCMD method successfully detected two runways, which can be viewed as two regression classes. At level, two line equations identified by RCMD method are and . The detected two runways are depicted in Figure 6(c). Here we take this detection problem as a clustering problem, and then apply the new clustering algorithm. The obtained clustering results are shown in Figure 6(d), where each cluster corresponds to a runway. In the procedure of clustering with the new clustering algorithm, two principal directions have been obtained by finding the mean principle directions from the anisotropy matrices in each cluster. Therefore the detected two runways with the new algorithm can be described by and , as shown in Figure 6(e). It is seen that the new clustering algorithm has been applied to yield an almost similar result with RCMD. Nevertheless, we observe that the new clustering algorithm provides a much more simpler approach to the identification problem than RCMD does.

5. Conclusion

We have suggested a new clustering by identifying structure feature based on density consistence. Through viewing a data set as an (imaginary) image and explaining the data clustering as a cognition problem, it solves the problem via mimicking the mechanism of visual information processing. First, the results with the new algorithm are most close to the cognition of people for low dimension complex structure data sets. Second, it can deal with various structures of subspace clusters and is not restricted by shapes and high dimension. Third, almost no parameters has to be considered and it is easily to use. Fourth, the computation load is much lower than CSSF and VClust.