Abstract

Clustering data has a wide range of applications and has attracted considerable attention in data mining and artificial intelligence. However it is difficult to find a set of clusters that best fits natural partitions without any class information. In this paper, a method for detecting the optimal cluster number is proposed. The optimal cluster number can be obtained by the proposal, while partitioning the data into clusters by FCM (Fuzzy c-means) algorithm. It overcomes the drawback of FCM algorithm which needs to define the cluster number in advance. The method works by converting the fuzzy cluster result into a weighted bipartite network and then the optimal cluster number can be detected by the improved bipartite modularity. The experimental results on artificial and real data sets show the validity of the proposed method.

1. Introduction

Clustering is the unsupervised classification of data points into groups or clusters, such that samples in the same group are similar to each other, while patterns in different groups are dissimilar. In past decades, a number of clustering algorithms have been proposed, attempting to classify the given data set into groups by using different similarities. Generally, these algorithms can be classified into three cases: hard (crisp) cluster, soft (fuzzy) cluster, and possibilistic clustering [13]. The classical clustering algorithms should be -means and fuzzy -means and their various improved versions. However, these algorithms require the user to specify the number , of clusters which the user does not usually know in advance or may not want to specify. As a consequence, the clustering result often depends on the choice of parameter. Therefore a very challenging problem in cluster analysis is the cluster validity problem in the literatures, which consists of finding the optimal value for .

In fact, clustering validation is a technique to find a set of clusters that best fits natural partitions (number of clusters) without any class information. Based on FCM algorithm and bipartite network, in this paper, a method for finding the optimal cluster number is suggested. A weighted bipartite network is first constructed by using of membership matrix and centers of each clusters obtained by FCM algorithm. Then the bipartite modularity of the bipartite network is extended to the case of the weighted bipartite network. Last, the optimal cluster number is obtained by optimal community structures. To validate the effectiveness of the proposed method, one has done the experiments on seven artificial data sets and seven real data sets. As a result, the optimal cluster number has been detected for most of data sets.

2.1. Some Cluster Validity Indices

The purpose of clustering algorithm is to divide the given data set into groups . The partition matrix of size may be represented as , and , where is the membership of sample to cluster . In the case of crisp partitioning, the following condition should be satisfied: if ; otherwise, [4].

Formally, the goal of clustering analysis can be represented as follows:

In the case of fuzzy clustering, the membership is in interval . The crisp clustering may be viewed as a special case of fuzzy clustering.

No matter what, we take crisp partitioning method or fuzzy clustering algorithm to separate data set into groups, the generated partitions may not reflect the desired clustering of the data because of inappropriate choice of algorithmic parameters. Thus, it is a key problem how to determine the cluster number , especially for real data sets. Clustering is an unsupervised classification process and therefore it has no a priori information of a data set. That is to say that user has no prior knowledge about the number of classes. For two-dimensional data or data set with small size, it is possible to visually verify the validity of the results. However, the visualization of the data set in high dimensional space or with huge size would be difficult. Thus it is difficult to partition the given data set into clusters which best fit the natural distribution of the data set.

To find the optimal cluster number , the researchers proposed some validity indices based on fuzzy cluster result obtained by FCM algorithm. Once the partition is obtained by a clustering method, the validity function can help us to validate whether it accurately presents the structure of the data set or not. In Table 1, some popular validity indices are listed. The earliest proposed fuzzy cluster validity functions associated with FCM are the partition coefficient () and partition entropy () [5, 6]. and , modification of the and , proposed by Dunn and Roubens [7, 8] in succession. These indices only use the membership values and have the advantage of being easy to compute. Considering the compactness and separation of data set, Fukuyama and Sugeno [9] and Xie and Beni [10] proposed and indices, respectively. Kwon extended Xie and Beni’s index to eliminate its tendency to monotonically decrease by introducing a punishing function in Xie and Benis original validity index [11]. Based on the concepts of hypervolume and density, Gath and Geva [12] put forward the fuzzy hypervolume validity . The introduction of the concepts of fuzzy compactness and fuzzy separation to the traditional validity indices allows Zahid et al. to propose a novel index to evaluate the fitness of partitions produced by fuzzy clustering algorithms [13]. Wu and Yang use the factors from a normalized partition coefficient and an exponential separation measure for each cluster and then pool these two factors to create the PCAES validity index [14]. They also discussed the problem that the validity indexes face in a noisy environment. By solving the maximum value of index, Pakhira et al. [15] measured the goodness of clustering on different partitions of a data set.

In general, an optimal cluster number can be found by solving maximum or minimum of theses indices, respectively, to produce the best clustering performance for the data set .

2.2. The Bipartite Network

Networks have attracted a burst of attention in the last decade, with applications to natural, social, and technological networks. To understand the nature of a complex system, a common approach is to map the interconnected objects in the complex system to a complex network and study the structure of the complex network. Many real-world networks display natural bipartite structure. Examples include the actors-films network and the papers-scientists network [16, 17]. Because one-mode projection of bipartite networks will bring some drawbacks and affect the properties of the networks, it is an important problem to directly analyze the original bipartite networks. In bipartite networks, there are two nonoverlapping sets of nodes called top nodes and bottom nodes. The edges only connect a pair of vertices which belong to different sets.

One of the common characteristics of many bipartite networks is their community structures which are seen as groups of nodes within which connections are dense and between which connections are sparse. Although the notion of community structure is straightforward, construction of an efficient algorithm for identification of the community structure in a bipartite network is highly nontrivial. A number of algorithms for detecting the communities have been developed, each of them attempts to uncover a reasonable good partition of the network. To measure the division quality of a network, various modularity functions have been suggested. Using the modularity, the quality of any assignment of vertices to modules can be assessed. Guimerà et al. [18] proposed a projection based method. It transforms the bipartite network into an one-mode network and uses method for one-mode network to discover communities. They strikingly demonstrated that the analysis of a projection can give incorrect results in a model network. It will affect the properties including the community structures of the networks. Barber [19] extended the Newmans modularity to bipartite network and proposed a method searching communities by minimizing the modularity. The algorithm is based on the idea that the modules in the two parts of the network are dependent, with each part mutually being used to induce the vertices for the other part into the modules. To resolve communities in bipartite networks, Xu et al. [20] proposed an MDL 21 criterion for identifying a good partition of a bipartite network. Recently, a new measurement for community extraction from bipartite networks is proposed which is a straightforward generalization of Newman’s modularity [21].

3. The Fuzzy -Means Clustering Algorithm

The most commonly used fuzzy clustering method should be fuzzy -means algorithm. The purpose of fuzzy clustering is to divide the data set into distinct clusters. The fuzzy -means (FCM) algorithm is proposed by Dunn [22] and then extended by Bezdek [23]. Formally, the FCM algorithm can be expressed as follows.

Let be a data set with points in -dimensional feature space , . The FCM clustering algorithm partitions into fuzzy groups by minimizing objective function which is the weighted sum of squared errors within groups and is defined as follows: and subject to where is a fuzzy partition matrix composed of the degree of membership of data point to th cluster; is a vector of unknown cluster prototype (centers), . A norm matrix defines a measure of similarity between a data point and the cluster prototypes. The parameter controls the fuzziness of membership of each datum. The cluster centroids and the respective membership functions, which are solutions of the constrained optimization problem in (2) are given by the following equations: Equations (4) and (5) constitute an iterative optimization procedure.

The FCM algorithm is executed in the following steps

Step 1. Given a preselected clustering centroids and fuzzy factor , initialize the fuzzy partition matrix .

Step 2. Calculate the fuzzy clustering centroid using (5).

Step 3. Use (4) to update the fuzzy membership matrix .

Step 4. If the improvement in is less than a certain threshold , then stop; otherwise, go to Step 2.

4. The Improved Bipartite Modularity and the Proposed Method

4.1. The Bipartite Modularity

Very recently, Murata [21] introduced a bipartite modularity to assess the partitioning quality of a given bipartite network. Murata’s bipartite modularity can be depicted as follows.

Suppose is an unweighted bipartite network, whose vertices is divided into two disjoint sets, and , such that every edge in the network connects a node from one set from the other. Here, a bipartite network is denoted as and top nodes and bottom nodes are called as and , respectively. Conveniently, we denote the top nodes by -vertices , and -vertices the bottom nodes .

Let us suppose that is the number of edges in a bipartite network. Consider a particular division of the bipartite network into -vertex communities and -vertex communities, and the numbers of the communities are and , respectively. and are the individual community that belong to the sets and , respectively. is an adjacency matrix of the network whose element is equal to 1 if vertices and are connected and is equal to 0 otherwise.

Under the condition that the vertices of and are different types (which means (), (the fraction of all edges that connect vertices in to vertices in ) and (its row sums) are defined as follows: The bipartite modularity is defined as

High value indicates strong community structure in a bipartite network.

4.2. The Proposed Method

For the given data set, the FCM algorithm with a proper distance norm is employed to partition it into clusters. After we have obtained the membership degree matrix, one can construct a weighted bipartite network whose top nodes consist of all clustering centers and bottom nodes are made up of all samples. The membership degrees can be thought as its weighted edges. Obviously, it will produce a fully connected bipartite network.

Example 1. If one partitions the given data set described in Figure 1(a) by FCM algorithm for , a membership degree matrix is obtained. In terms of the idea mentioned above, a weighted bipartite network will easily be constructed. The top nodes consist of three clustering centers and bottom nodes are made up of ten samples. The numbers in Figure 1(b) stands for the weights on each edge in network.

In this case, the bipartite modularity defined in (7) can not be used to measure the quality of partitioning network because the in (7) is only useful to the unweighted bipartite network. Therefore, it is necessary to improve so as to deal with weighted bipartite network.

If we define one can successfully extend the bipartite modularity to the weighted bipartite network which is constructed by FCM algorithm.

In what follows, an algorithm on how to partition a given data into ideal groups is described.Input: data set ; threshold (used in FCM algorithm);Output: the optimal clustering result.Step 1: execute FCM algorithm for the given data set;Step 2: construct a weighted bipartite network by matrix of membership degree and clustering centroids according to (8).Step 3: calculate the improved bipartite modularity .

From to , repeat the above process. Among these values, it is easy to find the maximum value of . Generally, is unknown. Conventionally, one can take which is widely used in clustering analysis.

5. Experiments

To validate the introduced method, we would like to test it on seven artificial data sets [2729] and seven well-known real data sets [30, 31] which are widely used to test the validity of various indices and clustering algorithm. Meanwhile, we also compare the results obtained by the proposal with those achieved by a number of popular validation indices on these data sets. The experimental platform is based on Windows 7 with Intel Core 2 Duo CPU T9550 @ 2.66 GHz 2.67 GHz and 4.00 GB memory. The clustering parameters and the default minimum amount of improvement in FCM are taken 2.0 and , respectively.

The names of seven artificial data sets imply the dimension of the samples and the actual cluster numbers in the data sets. For example, in the elliptical_2_10 data set, the samples with 10 classes are in 2-dimensional space. The information about seven artificial data sets and seven well-known real data sets are listed in Table 2. The authors can also browse the web page http://www.isical.ac.in/~sanghami/data.html and http://archive.igbmc.fr/projets/fcm/#datasets to see them in detail.

The optimal cluster numbers detected by our method and the other four indices are listed in Table 3. As it can be seen that 11 ideal optimal cluster numbers on 14 data sets can be found. The data set spherical_2_6 and spherical_3_4 are well separated so that the practical class numbers 6 and 4 can be found by all indices. The fact that there are no apparent borders among clusters for the rest of the artificial data sets leads to produce the improper cluster numbers, especially for y_2_3 and spherical_2_5. From Figure 2, one can find that the distinct change of values of five indices occur when . This case exactly corresponds to the natural distribution of samples in spherical_2_6. The scatter diagram of y_2_3 in Figure 3(a) shows that three clusters overlap each other. That is to say that there is no clear borderline among these three classes. Thus it is difficult to partition them into three clusters. The correct cluster number 3 is discovered by our method and index NPC.

The selected seven real data sets are typical data sets which are usually employed to test the validity of clustering/classifying algorithms or cluster index. The dimensions of these data sets range from 4 to 32. That is to say that it is impossible to estimate whether the obtained cluster results are good or not by visualizing method. Furthermore, the clusters of some data sets are heavily overlapped. Therefore, it is difficult to detect the optimal cluster number of these seven data sets by the validity index. More information about these data sets can be found in [31].

In Figure 4, the comparison results are obtained by the proposed method and other four indices on these real data sets. It is easily to see that there exists the distinct difference of value between the maximum value and others on bupa_6_2, ionosphere_32_2, wbcd_8_2, and wall-following_4_4. Thus the optimal cluster numbers are unanimously detected by the proposed method. This is also true for the other four indices except wall-following_4_4. The value on iris_4_3, glass_9_6, and pima_8_2 are close to each other when , , and . This implies that it is reasonable, while one partitions these three data sets into 2 or 3 or 4 clusters. Iris data set is a typical example where there are three classes, but looking at the data without the class information, one could also argue that there are two and not three clusters. Fortunately, the optimal cluster number 3 is found for Iris data set. These examples also show that the proposal is superior to other four indices.

6. Conclusions

In this paper, a method is proposed to detect an optimal cluster number for a given data set without any class information. The key idea of this paper is to convert the problem of detecting the optimal cluster number on a data set into the one of finding the optimum community structure in a weighted bipartite network. One of the advantage of the introduced method is to implement the automatic learning process of un-supervised classifying algorithm. The experimental results show the validity of the proposed method. The comparison results indicate that our proposal is superior to the other four indices which are only related to membership obtained by FCM algorithm. However, nonspherical cluster of data set and overlap structure will still affect the cluster results. Therefore it is worthwhile to try to construct or propose new method or index to find the optimal class structure of data set.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the Education Fund of Liaoning Province (no. L2012381), Liaoning province education science “twelfthfive-year” plan project (no. JG12CB114) and NSFC (under Grant no. 61373127).