Abstract

The Wu-Huberman clustering is a typical linear algorithm among many clustering algorithms, which illustrates data points relationship as an artificial “circuit” and then applies the Kirchhoff equations to get the voltage value on the complex circuit. However, the performance of the algorithm is crucially dependent on the selection of pole points. In this paper, we present a novel pole point selection strategy for the Wu-Huberman algorithm (named as PSWH algorithm), which aims at preserving the merit and increasing the robustness of the algorithm. The pole point selection strategy is proposed to filter the pole point by introducing sparse rate. Experiments results demonstrate that the PSWH algorithm is significantly improved in clustering accuracy and efficiency compared with the original Wu-Huberman algorithm.

1. Introduction

Traditional data mining approaches can be categorized into two categories [1]: one is supervised learning, which aims to predict the labels of any new data points from the observed data-label pairs. Typical supervised learning methods include the support vector machine and the decision trees; the other one is unsupervised learning. The goal is just to organize the observed data points with no labels. Typical unsupervised learning tasks include clustering [2] and dimensionality reduction [3]. In this paper, we will focus on the clustering problem, which aims to divide data into groups with similar objects. From a machine learning perspective, clustering is to learn the hidden patterns of the dataset in an unsupervised way. From a practical perspective, clustering plays a vital role in data mining applications such as information retrieval, text mining, web analysis, marketing, and computing biology [47].

In the last decades, many methods [812] have been proposed for clustering. Recently, the graph-based clustering has attracted many interests in the machine learning and data mining community [13]. The cluster assignments of the dataset can be achieved by optimizing some criteria defined on the graph. For example, the spectral clustering is one kind of the most representative graph-based clustering approaches, and it aims to optimize some cut values (e.g., [14, 15]) defined on an undirected graph. After some relaxations, these criteria can usually be optimized via eigen decompositions, and the solutions are guaranteed to be globally optimal. In this way, the spectral clustering efficiently avoids the problems of the traditional -means method.

Wu and Huberman proposed a clustering method based on the notation of voltage drops across the network [16]. The algorithm uses a statistical method to avoid the “poles problem” instead of solving it. The idea randomly picks two poles, then applies the algorithm to divide the graph into two communities, and repeats in this way for many times. The algorithm uses a majority vote to determine the communities [16]. However, after making some experiments, we have found that the choice of the pole points affects the accuracy of some of the clustering so seriously that the majority voting result is degraded. The specific details will be presented in Section 4.1 (Figure 1).

In order to overcome the above disadvantages of the Wu-Huberman algorithm, in this paper, first we construct a graph in terms of data points. Then we propose a novel strategy for pole point selection. After that, we iteratively solve the Kirchhoff equation to perform clustering. Finally, we get the clustering result. In this paper, we consider only the 2-community clustering case and will leave the case of cluster problem into the future research.

The Wu-Huberman algorithm exhibits the graph as an electric circuit. The purpose is to classify points in the graph into two communities, that is, clusters. We denote a graph by , where is the point set of graph and is the edge set. The set of voltages of points is . Suppose points and have been known to belong to different communities, and , respectively. By solving Kirchhoff equations the voltage value of each point can be obtained, which of course should lie between 0 and 1. A point belongs to or , which can be decided by voltage value of the point [17]. The graph is regarded as an electric circuit by associating a unit resistance to each of its edges. Two of the nodes, assumed to be node 1 and node 2, without losing the generality, in the graph are given a fixed potential difference. The Wu-Huberman method is based on an approximate iterative algorithm that solves the Kirchhoff equations for node voltages in linear time [16, 18].

The Kirchhoff equations of -point circuit can be written as where is the degree of point and is the adjacency matrix of the graph. After the convergence, each community, that is, cluster, is defined as the nodes with a specific voltage value within a tolerance. Without loss of generality, the algorithm has labeled the point in such a way that the battery is attached to point 1 and 2, which are termed as pole points.

Because of the complexity, the algorithm does not solve the Kirchhoff equations exactly rather solves it iteratively. The algorithm initially sets . In the first round, the algorithm starts updating from point 3 to the th point in the following way. When the th point, the voltage of it is substituted by the average value of its neighbors according to (1). The updating process ends when the algorithm gets to the last point , at which a round is finished. After repeating the updating process for a finite number of rounds, each point reaches voltage value that satisfies approximately the Kirchhoff equations within a certain precision. Then the algorithm finds community results by a threshold decision.

The Wu-Huberman algorithm inherits the superiority of the graph-based clustering. The final cluster solutions is global optimal. Especially, the running time of the algorithm is linear. However, the algorithm does not always work in many cases [16]. Besides, there is still one critical problem which seriously affects the accuracy and efficiency in real applications. That is, the accuracy and efficiency are greatly affected by the poles, that is, node 1 and node 2 selected. Therefore, it is most important to improve the method of selecting poles. In this paper, we present the PSWH algorithm to improve the accuracy and effectiveness of the algorithm by presenting the pole point selection strategy.

3. The PSWH Algorithm

3.1. Graph Construction

Let be an undirected graph with point set and edge set . The degree of point is defined as , which is the edge number connecting with point .

Constructing nearest neighborhood graph is to model the local neighborhood relationships between the data points. Given data points , we link and with an undirected edge if is among the nearest neighbors of or if is among the nearest neighbors of . We define and to be adjacent if or , , and is the neighbor of and , respectively. is the similarity between and . is computed in the following way: , where is a dataset-dependent parameter.

3.2. The Pole Point Selection Strategy

The Wu-Huberman algorithm selects pole point randomly. Based on plenty of experiments, we find that clustering results are very sensitive to the choosing of pole points. It may produce wrong clustering results if inappropriate points are chosen as the poles. Figure 1 gives us an intuitive illustration of such a problem.

For solving this problem, in this paper, we introduce a concept that is termed as “sparse points.” There is the maximal diameter between the sparse point and its neighborhoods. The existence of sparse points will bias the final clustering results. An important fact of our experimental results is that if we choose sparse points as the pole points the Wu-Huberman algorithm will become less accurate. For this reason, the sparse points should not be selected as the pole points. Therefore, we propose the following sparse rate to discriminate the sparse points from the others. Additionally, in order to exclude the impact of the distribution in the similarity and degree, the averaging similarity of the neighbors and the similarity summation of the neighbors should be taken in the sparse rate . That is, where is the maximum diameter between the th point and its neighborhoods; = , to , and are the neighborhoods of the , and are from 1 to , is the feature number of , and is the th attribute feature in the th neighborhood of . Here is the similarity (weight) summation of ’s neighborhood, , to . is the average weight of ’s neighborhood, .

Figure 1(e) shows the sparse rate of each point in Figure 1. A point can be determined as the pole point whose sparse rate is significantly larger than those of the most other points. Sparse points are far from other points between two different clusters, so they should not be chosen as the pole points.

We define an extent to describe the range of allowed sparse points’ number. For example, an extent of 5% in the two-moon example means that the allowed sparse point number is the number of points * extent = 100 * 5% = 5. That is to say, we choose top 5 points upon the sparse rate as the sparse points. The specific experimental details are shown in Section 4.1.

3.3. Iteratively Solving the Kirchhoff Equations

We will illustrate the computation procedure for iteratively solving the Kirchhoff equations by using an example. According to the results of (2), we get that the pole points are 1st and th points. That is to say, . Then use (1) to obtain the voltage value of each point excluding the pole points, at which the voltage values are fixed. That is, the value of each point is the similarity average of its neighbor point. The updating process ends when we go through 2th to -1th points. Repeat this process till voltage value converges within stable error range. In our experiments, we set 0.001 as the terminative conditioning of the iteration.

3.4. The Procedure of the PSWH Algorithm

Input. Dataset and the neighborhood size .

Output. The cluster membership of each data point.

ProcedureStep construct the nearest neighborhood.Step compute sparse rate using (2) and apply the extent to determine the pole points. Then exclude the sparse points in graph and choose randomly two other points as the pole points.Step obtain the voltage value of each data point based on (1).Step output the cluster assignments of each data point.

4. Experimental Results

In this section, we will use the well-known two-moon example to illustrate the effectiveness of PSWH algorithm. The original dataset is a standard benchmark for machine learning algorithms [19] and is generated according to a pattern of two intertwining crescent moons. This benchmark is online available at http://www.ml.uni-saarland.de/GraphDemo/GraphDemo.html. In the experiments, the Gaussian noise with mean 0 and variance 0.01 has been added. The number of data points is set as 100 for the two moons.

4.1. Pole Points’ Influence on the Clustering Accuracy

In the Wu-Huberman algorithm, the choice of the pole points affects significantly the clustering results. Taking the two-moon dataset as an example, we set as 0.5 and as 5. In Figure 1(e), the sparse points are the 3rd, 20th, 35th, 45th, and 83rd points. In order to improve the clustering accuracy, we do not choose the sparse points as the poles. The clustering accuracy is 100%. Figure 1(c) illustrates that no matter what threshold is chosen, the cluster accuracy is low. That is to say, the choice of the poles has great effect on the clustering results.

4.2. Pole Points’ Influence on the Iterate Number

In the experiment, we find that the choice of the pole points has an impact on the iterate number. The two-moon dataset is taken as an example. All of the experiments are conducted in the same parameter conditions: such as , the iterate error is 0.001, and the maximum iterate number is 100.

We first construct the KNN graph of original dataset. Then the degree of each point was computed and displayed in Figure 2(b). Next, we obtain the sparse rate of each point based on the degree distribution, which is the same as Figure 1(e). Finally, we choose the poles based on the sparse rate, compute (1) to obtain the voltage value of each point, and, respectively, display the iterate number of each point in Figures 2(c) and 2(d) when different poles are chosen.

In Figure 2, we can draw a conclusion that the greater degree of the poles corresponds to the more iterate number for convergence. Therefore, in order to decrease the iterate number of the algorithm, we should choose the points with smaller sparse degree as the poles. The clustering accuracy of Figure 2 is 100%.

4.3. Comparison with Other Algorithms

We compare the PSWH algorithm with other algorithms on the UCI repository, which is available at http://archive.ics.uci.edu/ml/.

From Table 1, we can find that the PSWH algorithm does slightly better than other algorithms in most dataset. However, in some conditions, the PSWH algorithm is lower than LCLGR algorithm. Considering the complexity of algorithm is linear, which is lower than LCLGR algorithm. Therefore, in general, the PSWH algorithm is an excellent algorithm than the others.

5. Conclusions and Future Work

In this paper, we propose PSWH algorithm for enhancing the clustering accuracy and efficiency of the Wu-Huberman algorithm, which can extend the applicability and increase the robustness of the algorithm. The concept of sparse points and selection procedure are presented to obtain the suitable pole points for the algorithm. The experimental results showed that the PSWH algorithm is very effective and stable when applied to clustering problems. In the future, we will give the theoretical analysis of the new algorithm and employ the new algorithm to more general and larger datasets. Furthermore, we will try to extend the new algorithm to textual, image, and video retrievals.

Acknowledgments

This work was supported by the key project of the National Social Science Fund (11AZD089) and Educational Commission Scientific Project of Liaoning Province (no. L2012381).