Abstract

Constructing a rational affinity matrix is crucial for spectral clustering. In this paper, a novel spectral clustering via local projection distance measure (LPDM) is proposed. In this method, the Local-Projection-Neighborhood (LPN) is defined, which is a region between a pair of data, and other data in the LPN are projected onto the straight line among the data pairs. Utilizing the Euclidean distance between projective points, the local spatial structure of data can be well detected to measure the similarity of objects. Then the affinity matrix can be obtained by using a new similarity measurement, which can squeeze or widen the projective distance with the different spatial structure of data. Experimental results show that the LPDM algorithm can obtain desirable results with high performance on synthetic datasets, real-world datasets, and images.

1. Introduction

As an unsupervised classification technique, clustering has been successfully applied to exploratory data analysis, such as image segmentation [13], data mining [4, 5], signal analysis [6], gene expression analysis [7], sport activities analysis in sport domain [8], and other subjects [911]. During the last decade, a number of clustering algorithms were adequately developed. Nonetheless, many of these algorithms are not effective for data classification when applied on nonconvex data space. Compared with the classical clustering, spectral clustering (SC) [12] has been successfully used to identify irregularly shaped datasets and is supported by linear algebra theory [13].

SC can be regarded as a type of partition problem for undirected graph [14]. In partition problem, a set of data points are represented by a similarity graph. Each data point can be considered as a vertex, and the similarities between pairs of data points can be considered as the edge weights in a graph. Through a partition of the graph, the data points can be clustered into different subgraphs such that the edges between different subgraphs have relatively low weights, in comparison to those within the same subgraph. It is well known that inchoate graph partition methods, such as min-cut [15], tend to generate unbalanced solutions and they are extraordinarily sensitive to noise [16]. In order to overcome these drawbacks, many spectral clustering algorithms were proposed, such as normalized cut [17], ratio cut [18], min-max cut [19], and Ng-Jordan-Weiss (NJW) [20], which employed diverse criteria for spectral clustering to optimize the quality of graph partition. In these methods, as we know, Gaussian kernel function is chosen as similarity function, but it cannot offer any local spatial structure of the dataset.

To further improve the performance of spectral clustering algorithms, Chen and Feng [16] presented a semisupervised SC based on Near Strangers or Distant Relatives model, which is a generalization of SC algorithm. In [21], Li and Guo proposed a novel affinity matrix generation approach, which can adaptively adjust the similarity measure of data points, based on the spatial structure of dataset. To obtain insensitive scaling parameter, Zelnik-Manor and Perona [22] developed a self-tuning SC algorithm to estimate the scaling parameter. Via M-estimation statistics, Chang and Yeung [23] proposed a robust path-based SC algorithm. In [24], the local density can be employed to adjust the scaling parameter. Nevertheless, the method needs to set the parameter empirically [13]. To solve the problem, Yang et al. [25] proposed a density sensitive function, which can either elongate the similarity measure or shorten it in different density regions.

From a graph-cut perspective, based on minimizing the sum of edge weights between data points, SC can partition an undirected weighted graph into disjoint components. The information of the adjacency relations between data points is contained within the affinity matrix . Most of existing methods exploited the local structure of dataset to construct a rational affinity matrix, which is one of the key issues of SC and greatly affects partition results.

As we know, data points with high similarity should have uniform density and consistent spatial characteristic [26]. Therefore, the key to estimate whether a pair of data points belong to a specific cluster is how to use the data information between them. The local projection distance measure (LPDM) presented in the current paper can reflect the local spatial structure of dataset in more depth, by which the goal of rational partition for synthetic datasets, images, and most of real-world datasets may be well achieved.

The main contributions of the current paper are threefold. The concept of Local-Projection-Neighborhood is introduced, which is a spatial area among data points and is an important source to obtain the local spatial structure of datasets. A measure for local projection distance is presented, which facilitates embodying an accurate local structure of dataset. A novel similarity measure is defined that can adaptively adjust the measure of similarity based on the local spatial structure of datasets and is insensitive to the parameter on UCI datasets.

The outline of the rest of this paper is as follows. In Section 2, spectral clustering algorithm is briefly discussed as preliminary. Section 3 introduces the LPDM algorithm. The performance of the presented approach is evaluated in Section 4. Section 5 is the conclusion.

2. Spectral Clustering

SC algorithms can be regarded as solving graph-cut problems which are extensively applied to exploratory data analysis. In this section, as a preliminary, we will briefly review spectral clustering, which is closely related to LPDM algorithm.

In this paper, SC can be considered as an undirected weighted graph-cut problem. For a dataset , the weights of the graph can be constructed by the adjacency matrix . Specifically, the element of the adjacency matrix is formulated aswhere is the scaling parameter determining the neighborhood and is the distance of points and . If element , this means that there is no link between them. The diagonal degree matrix is constructed as . SC uses the similarity information to group data points into predefined clusters.

The steps of NJW approach are listed as follows.(1)Calculate the similarity of data points by (1) to construct the affinity matrix and degree matrix .(2)Compute the normalized affinity matrix .(3)The first largest eigenvalues and the corresponding eigenvectors are calculated to construct the matrix with the column vector.(4)Normalize each row of the matrix (i.e., ) to construct the matrix .(5)Group data points by -means method in a new space, which is spanned by the rows of the matrix .

Remark 1. Via spectral clustering, data points are mapped into a convex dataset in another space, and then -means clustering can be used to group the image data points in the new space. In SC methods, NJW is widely applied to data analysis; thus, it is adopted in this paper.
The affinity matrix of classical spectral clustering is usually constructed by the Gaussian kernel function, but it could not well represent the space structure of datasets and lead to irrational results. Aiming at the problem, the LPDM algorithm is devised in the next section.

3. Affinity Matrix Construction for Spectral Clustering through Local Projection Distance Measure

As the most important part of the paper, some general problems on three similarity measurement algorithms are briefly discussed in the first part of this section. In the second part, to overcome these problems, a novel LPDM algorithm will be introduced.

3.1. Similarity Function Analysis

As we know, the Gaussian kernel function is employed in most of the existing SC methods. In most cases, since the scaling parameter is fixed and has to be set manually, Gaussian kernel function cannot objectively reflect the local spatial structure of datasets and reasonably figure out the similarity between data points, especially as the similarity function is applied to complex datasets. Figure 1 illustrates the high impact on the clustering by . It is evident that the results of NJW algorithm are to be affected greatly by the scaling parameter in Gaussian kernel function.

Unlike setting a fixed scaling parameter, the parameter in self-tuning spectral clustering (SC-ST) [22] can be calculated based on the neighborhood of point aswhere is the th neighbor of point . Unfortunately, the affinity matrix in SC-ST is still constructed by Gaussian kernel function which is less valid in many cases [24]. In Figure 2, the method cannot yield better clustering result, failing to classify the Three-Spiral-Arms datasets.

Zhang et al. [24] addressed the Common-Near-Neighbor for spectral clustering (SC-DA) and defined a novel similarity function aswhere represents the local density of the overlapped area in data space and the region can be determined by the data points and with radius . The result of SC-DA algorithm on the Three-Spiral-Arms dataset is shown in Figure 3. It is evident that the algorithm produces the correct clustering result.

Utilizing CNN, the parameter can be adjusted adaptively. However, the approach requires setting the parameter manually for the correct clustering [13]. Figure 4 shows that SC-DA fails to classify the dataset and the unrevealed structures in this dataset cannot be discovered. It reveals that, in some cases, the similarity of data points cannot be properly reflected by Euclidean distance.

It is generally considered that if data points fall into the same cluster, the distribution of points should have similar patterns and concordant density. Nevertheless, in some cases, CNN could not correctly estimate the local density of complex datasets. Let us survey the synthetic dataset. In Figure 5, it is easy to find that the data points , , , and belong to the same cluster.

For three pairs of data points, the parameters of the CNN, the Euclidean distance (), the similarity () by (1), and the novel similarity () by (3) are calculated, respectively, and the results are summarized in Table 1.

From Table 1, we can find that the between data point and others are approximately the same. That is to say, those similarities between each of the three pairs of points are similar and they can be reflected by the parameter of . The parameter of CNN reflects the local density among data points according to SC-DA and it can be used to estimate the similarity of point pairs. However, as can be seen from Table 1, the CNN and of the pair and are much larger than others’ and this implicates that the two points probably do not belong to the same cluster. Apparently these four data points locate in the same cluster. CNN can adaptively adjust the scaling parameter in Gaussian kernel function and reduce the impact of the fixed to some extent. Nonetheless, CNN merely reflects the local density of the geometric center between two data points. Therefore, the local structure of dataset cannot be fully described by CNN.

Combining the analysis of the three SC methods in this subsection, it is clear that, in some sense, the similarity of the correlative points could not be rightly reflected, based on Gaussian kernel function. How to obtain local spatial structure of datasets and construct an appropriate affinity matrix will be addressed in next subsection.

3.2. Local Projection Distance Measure

The motivation of LPDM algorithm originated from the idea that, in order to construct an appropriate affinity matrix, we should know spatial structure about the neighborhoods of the correlative points as much as possible. Therefore, in this subsection, we define the Local-Projection-Neighborhood (LPN) and propose a novel density sensitive similarity measure.

With a dataset given in , the LPN() of the pair and is the overlapped region with specified Euclidean radius , around the center points . The center points of the region can be calculated as where is a center point of the sphere region and is the radius. Here, the three points , , and form an equilateral triangle with the side lengths . Therefore, the center point can be obtained by solving (4).

The idea of LPDM algorithm is to discover the unrevealed configuration patterns of local dataset, with the spatial structure of the data points in LPN. Thus, how to obtain the points in LPN is a problem. Because the data points in LPN are located in the overlapped area between two sphere regions, they can be obtained as

Now consider the data points which are dispersedly located in LPN. How to achieve the similarity measurements from the spatial structure of these points is a key for LPDM algorithm. In this study, the point in LPN is projected onto the straight line connecting the points and , where the projective point is denoted by . The th coordinate of point can be calculated aswhere is the th coordinate of point .

Evidently, the straight line connecting the points and can be divided into several line segments by these projective points. As we know, data points being close in space and with uniform density might possibly tend to belong to the same cluster. The Euclidean distance between projective points in LPN represents the local similarity among the data points. If data points are located in the same cluster, the consistent projection distances exist in LPN. The structure information of local dataset can be reflected by the quantity and the length of the line segment.

Remark 2. We know that this method stresses the local spatial structure and can avoid seeking the shortest path that connects any two nodes in an undirected graph.
In this subsection, the local structure of datasets being reflected by projection distances in LPN may seem obvious. How to obtain a meaningful similarity measure among data points by applying projective distance is crucially important in LPDM algorithm. Here, a novel similarity measure is addressed, which is motivated by the discussions in [25].

The new adjustable projection distance is defined aswhere is the Euclidean distance between the points and and is the flexing factor. The Euclidean distance can be enlarged or shortened by the nonlinear function (7).

As we know, the similarity of the pair of points can be reflected by the distance between the two points. Nevertheless, the pair of points with a longer distance might still belong to the same cluster with a large number of points uniformly distributed between them. Therefore, the length of the line segment connecting projective image points in LPN can be adjusted by the nonlinear function (7). According to the spatial structure of the local dataset, a new measure of distance of the pair of points can be obtained through a summation of the values of the length of these line segments. The new distance between two points and can be obtained aswhere is the number of projective points in LPN. Notice that the points () and () are and , respectively.

The similarity of pair points is inversely proportional to their distance. Therefore, the similarity could be computed asThe novel similarity metric can highlight the diversity of the local structure of datasets and avoid seeking the shortest path in graph.

Here, the following example will be exhibited to illustrate the processes of LPDM algorithm. Consider four data points , , , and , shown in Figure 6.

The similarity of the pair of points and will be calculated by the following steps.

Step 1. Compute the center of points and with specified Euclidean radius by (4):The results are and .

Step 2. Find the data points in LPN by (5). According to the distances we can find that only the point belongs to LPN.

Step 3. Compute the projective point of the point by (6)

Step 4. Calculate the distance between points and by (7) and (8): where is set to 1.

Step 5. The similarity of the pair of points and can be estimated by (9): It is worth mentioning that the -nearest neighbor is adopted to construct the affinity matrix in LPDM. Employing the structural information about the neighborhoods of the correlative points and novel density sensitive similarity measure, the LPDM algorithm achieves high accuracies of spectral clustering. The clustering results on the synthetic datasets achieved via LPDM algorithm are shown in Figure 7. The algorithm can obtain desired cluster results for this dataset.

In order to reduce the computational complexity, -nearest neighbor is used to construct the affinity matrix [27]. According to the neighbor propagation principle [21], it is unnecessary to obtain all affinity relationships among data points because neighbor propagation could fully describe the structure of the dataset.

In this subsection, the LPDM algorithm is presented. Contrary to the classical method of similarity measure based on Gaussian kernel function, the similarity among data points achieved via LPDM algorithm can be constructed directly and the value of the measure reflects more the local spatial structure of datasets. The performance of this algorithm will be further illustrated in Section 4.

4. Experimental Results and Analysis

In the section, a number of experiments are conducted to evaluate the performance of the LPDM algorithm and the sensitivity of the parameters of LPDM algorithm will be further analyzed. The experimental results distinctly manifest the advantage of LPDM algorithm. In order to illustrate the procedure of the experiment, our experiments are conducted in the following subsection. Firstly, four SC algorithms are applied to several synthetic datasets and real-world datasets. The clustering accuracies of different algorithms can be examined with two small-size datasets. Then, LPDM algorithm is executed on larger datasets to evaluate the performance of our method. All experiments are implemented in Matlab 7.12 environment on a PC with Intel CPU 1.6 GHz and 4 GB memory.

In our clustering experiments, clustering accuracy (Acc) [28] and Rand Index (RI) [29] are used to assess the performance of LPDM algorithm. The Acc is defined aswhere and are the true clustering results and experimental result of original data, respectively. is the quantity of data points that constitute both the true clustering result and the practical cluster . is a function which can map all cluster labels to the corresponding labels.

It is a known fact that there exist potential pairwise decisions to estimate whether each pair of data points belong to the same cluster, where is the size of dataset. RI is used to evaluate clustering accuracy and its value is proportional to the clustering performance, which is defined aswhere CD denotes the quantity of correct decisions and TD denotes the quantity of total decisions.

4.1. Parameter Selection

For NJW, SC-DA, SC-ST, and LPDM algorithm, the parameters of SC algorithm need to be set for the above experiments. In order to obtain a reasonable scale parameter, Iris dataset from UCI datasets is used to evaluate the quality of the scale parameter. In NJW and SC-DA algorithm, the scale parameter needs to be set. In Figures 8 and 9, we can find that NJW and SC-DA gain better performances, when scale parameters are set to 0.2 and 0.01, respectively. is set to 7 in all our experiments in accordance with SC-ST in [22]. In LPDM, the flexing factor is set to 12 and 7% of the size of dataset is adopted as the parameter of neighborhood size when LPDM is implemented on each dataset.

4.2. Synthetic Data Experiments

In this subsection, four synthetic datasets of arbitrary shape and various densities are used to test the accuracy of the four SC algorithms. Experimental results are presented in Figure 10.

The first row of synthetic data is the Two-Moon dataset in Figure 10. It is evident that the data points of each moon should belong to the same cluster. Since the dataset includes nonconvex separate clusters and the two “moons” are very close, the classification of the dataset is a difficult task for SC algorithms. Figure 10 shows that SC-ST cannot rationally classify the Two-Moon dataset whereas the other SC approaches correctly identify the genuine clusters. In Figure 10, the second row of toy data includes diverse density of data and it is a challenging clustering problem. According to the results, we can find that SC and SC-DA cannot classify them effectively, implicating that both SC and SC-DA are less suitable for the multiscale clustering problem. For the remaining two synthetic datasets, four SC algorithms can obtain the expected clusters. In conclusion, the rational classifications can be obtained for all these synthetic datasets by applying LPDM. Thus, the algorithm can well handle different clustering problems.

4.3. Real Datasets Experiments

As we know, the UCI [30] datasets and the MNIST handwritten digits database [31] have been widely used for testing SC algorithms in the clustering problem.

In this subsection, both datasets are used in our experiments to evaluate the performance of proposed approach. In the UCI databases, we perform experiment on five datasets including Wilt, Wine Quality, Ionosphere, Zoo, and Abalone. The dimension of the data is the number of attributes varying from 6 to 34. Table 2 describes the characteristics of these datasets. Unlike the toy data, the dimension of MNIST database is much higher. Each image of the handwritten digit has been normalized and centered to gray-level image. In this experiment, four subsets {6, 9}, {1, 6}, {1, 2, 3}, and {0, 1, 3, 4} are selected to test the LPDM and 200 examples for each digit are randomly chosen from the MNIST training dataset. The basic characteristics of these datasets are summarized in Table 3.

For the UCI datasets, the clustering results are summarized in Figure 11. From Figure 11, we can find that LPDM outperforms others in accordance with Acc and RI. Taking the Wine Quality dataset as the example, one will see that the method can obtain the accuracy of 0.8000 in accordance with Acc, and the others are 0.3833, 0.5000, and 0.3500, respectively. For the Abalone dataset, the clustering accuracy of LPDM is seemingly lower than other datasets, but the performance of LPDM is still superior to other methods.

The experimental results on MNIST datasets are summarized in Figure 12. As can be seen in the figure, the accuracy of LPDM is higher than SC, SC-ST, and SC-DA. For the subset {0, 1, 3, 4}, the accuracy of the four methods is 0.630, 0.6825, 0.5425, and 0.7125 by Acc, respectively. For the four subsets, despite the similarity of accuracies between SC-ST and LPDM, the accuracy of SC-ST is a little lower than the LPDM. It proves that a more reasonable affinity matrix can be constructed by LPDM.

4.4. Image Segmentation Experiments

Image segmentation is one of the applications of SC. The SC algorithm can be easily evaluated by the results of image segmentation and we can learn whether the results “look good,” whether an algorithm works only on small-size datasets, and so on. Here, LPDM algorithm is applied to image segmentation and its ability can be intuitively evaluated by observation. In Figure 13, two original images (a) and (d) with the size of and in jpg format are used in this experiment, which are chosen from [20]. To reduce the cost of computation and memory space, we resize the image (a) to the size of and the sizes of the two images (a) and (d) are 12288 pixels and 3072 pixels, respectively. As we know, it is difficult for SC to segment the salient objects from the complicated background, especially for images with large number of pixels. In contrast, as can be seen from Figure 13, the child and the fire hydrant are partitioned successfully from the backgrounds of images (a) and (d).

4.5. Parameter Sensitiveness

In the last part of the experiments, the parameter sensitiveness of the LPDM approach is studied and the stability of the algorithm depends on its two parameters: the flexing factor and the neighborhood size . In this algorithm, two parameters and are required to be adjusted for clustering. The setting of parameters and is the crucial problem of LPDM. Here, Wilt, Wine Quality, Ionosphere, Zoo, and Abalone datasets are applied to evaluate the sensitiveness of two parameters.

For the parameter , the algorithm is evaluated with . The parameter interval of is . Figures 14(a) and 14(b) show the Acc rate and RI rate of LPDM on the five datasets. We can see that changes in the different intervals of the parameter have less impact on the Acc rate and RI rate, respectively. Apparently, the algorithm could work well under in the interval . Figures 14(c) and 14(d) show that LPDM is insensitive to , except the Wine Quality dataset. Hence, it is necessary to adopt the value of in the interval . Experimental results show that, in most cases, LPDM is insensitive to the parameters and in the different parameter intervals recommended in this subsection.

5. Conclusion

A local projection distance measurement for spectral clustering is proposed in this paper, which utilizes projective data points in LPN to detect the local spatial structure of the distribution of datasets. Employing a novel density sensitive similarity measure, local spatial structural information of datasets can be exploited and converted into the similarity measure of a pair of data points. Meanwhile, -nearest neighbor sparse strategy is adopted to reduce both the computational difficulties and memory assumptions. The numerical results presented show that the local projection distance measure approach is able to correctly cluster many synthetic datasets, UCI datasets, MNIST handwritten digits database, and images and is less sensitive to parameters than other classical SC approaches.

There are still many problems awaiting us to offer solutions. For instance, how to automatically and effectively set several specific parameters in our algorithm is to be dealt with as the future work.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Grant 81360229), the Research Fund for the Doctoral Program of Higher Education (Grant 20116201110002), the Open Project Program of the National Laboratory of Pattern Recognition (Grant 201407347), and the Natural Science Foundation of Gansu Province (Grants 1308RJZA225 and 145RJ2A065).