Abstract

A small amount of prior knowledge and randomly chosen initial cluster centers have a direct impact on the accuracy of the performance of iterative clustering algorithm. In this paper we propose a new algorithm to compute initial cluster centers for -means clustering and the best number of the clusters with little prior knowledge and optimize clustering result. It constructs the Euclidean distance control factor based on aggregation density sparse degree to select the initial cluster center of nonuniform sparse data and obtains initial data clusters by multidimensional diffusion density distribution. Multiobjective clustering approach based on dynamic cumulative entropy is adopted to optimize the initial data clusters and the best number of the clusters. The experimental results show that the newly proposed algorithm has good performance to obtain the initial cluster centers for the -means algorithm and it effectively improves the clustering accuracy of nonuniform sparse data by about 5%.

1. Introduction

Clustering is an important discovery technique of exploratory data mining and a common technique for statistical data analysis. Iterative clustering algorithm is one kind of the clustering algorithms. And -means is the most popular and the fast method in iterative clustering algorithms. Because of the simplicity of -means algorithm, it is used in many fields, including machine learning, medicine, image analysis, pattern recognition, information retrieval, bioinformatics, and computer. For example, in the medical field, cancer genomics [1], cell signaling [2], and viral genomes [3] use -means as a data analysis tool; in the bioinformatics field, bioanalytical chemistry [4], the vibrational spectra of biomolecules [5], and the nervous system [6] use -means to mine potential information; in the image analysis field, imaging techniques [7] use -means to partition a given set of points into homogeneous groups; in the pattern recognition field, automatic system for imbalance diagnosis in wind turbines [8] uses -means to suggest the optimum number of groups. Reference [9] uses -means to analyze network data. Reference [10] generates profiles by -means to group together days with a similar pattern of request arrivals.

Although -means algorithm has been developed to solve a wide range of different problems, it has three major drawbacks:(1)It needs to predetermine the cluster number by user. In practice, due to little prior knowledge, value is generally difficult to determine.(2)It is sensitive to selection of the initial cluster centers. That is, -means selects different initial cluster centers with different results. Because of randomly chosen initial clusters centers, populations are generally composed of low quality individuals exclusively.(3)The effect of -means algorithm for nonuniform sparse data processing is not good.

To overcome these drawbacks, many evolutionary algorithms such as GA, TS, and SA have been introduced. Kao et al. have proposed a hybrid technique based on combining the -means algorithm [11]. Bahmani Firouzi et al. have introduced a hybrid evolutionary algorithm based on combining PSO, SA, and -means to find optimal solution [12]. Niknam and Amiri have proposed a hybrid algorithm based on a fuzzy adaptive PSO, ACO, and -means for cluster analysis [13]. Niknam et al. have purposed a novel algorithm that is based on combining two algorithms of clustering: -means and Modified Imperialist Competitive Algorithm [14]. Evolutionary algorithms require large amounts of data to study; however, many real-world problems are like black boxes; hence no sufficient data about their internals is available.

In addition, to solve the problem of selection of the initial cluster centers, Bianchi et al. have proposed two density-based -means initialization algorithms for nonmetric data clustering [15]. Tunali et al. have proposed an improved clustering algorithm for text mining: multicluster spherical -means [16]. Tvrdík and Křivý have proposed a new algorithm combining differential evolution and -means [17]. Rodriguez and Laio proposed selection of the initial cluster centers by density peak. Khan and Ahmad [18] have proposed the cluster center initialization algorithm for -means clustering. But the computing of the above methods is laborious.

In order to solve the situation that a small amount of prior knowledge and randomly chosen initial cluster centers have a direct impact on the accuracy of the performance of iterative clustering algorithm. In this paper, we propose a new algorithm for nonuniform sparse data clustering based on cascade entropy increase and decrease. It designs Euclidean distance sparse degree of aggregation density control factor, determines the initial cluster center of nonuniform sparse data, and groups initial data clusters by multidimensional diffusion data distribution density. Multiobjective clustering approach is adopted to compensate the clustering error of initial data clusters. The experimental results show that the new data clustering algorithm can effectively improve the clustering accuracy of nonuniform sparse data clusters.

2. Nonuniform Sparse Data Clustering Cascade Algorithm Based on Dynamic Cumulative Entropy

In order to obtain the optimal clustering results of nonuniform sparse data, in this paper, we use multidimensional diffusion density distribution to obtain the initial data clusters, while the Euclidean distance control factor based on aggregation density sparse degree is put forward to solve the problem that multidimensional data is easy to misjudge. The initial data clusters are more than the real cluster. So we need to select optimal initial cluster centers by decision graph and then execute -means on complete data set based on multiobjective clustering approach.

2.1. Initial Data Clustering Using Multidimensional Diffusion Density Distribution

In iterative clustering algorithms, choosing initial cluster centers is extremely important as it has a direct impact on the formation of final clusters. It is dangerous to select some samples as initial centers, which are away from normal samples. In this paper, first we define the multidimensional diffusion density distribution of samples and the comprehensive distance, according to which we get the initial data clusters.

2.1.1. Multidimensional Diffusion Data Normalization

Different attributes of multidimensional data have different units of measurement and value ranges, which has a serious impact on clustering formation. So in order to avoid the above situation at first we need to normalize clustering data to make all attributes have the same weights. Normalization is done using the following formula.Calculate the average of the absolute deviation: is the set of data elements described with attributes where is the number of attributes and all attributes are numeric. is the measured value of the first attribute belonging to the first data:where

2.1.2. Multidimensional Diffusion Density Distribution

In iterative clustering algorithms the function adopted for density is the cut-off kernel [19]: wherewhere is the cut-off distance determined by user and .

is the th data with attributes and is the th data with attributes . measures the Euclidean distance between and . So is the number of data whose distance to the data is less than .

The clustering results of multidimensional data have shortcoming using the above functions adopted for density. Some multidimensional data are misjudged for the functions without considering the difference among attributes. If few attributes of some data change a lot but other attributes are close to other data, these mutating data can be misjudged into the cluster that they are not similar to for the changing attributes are ignored. In order to solve the above problems, we put forward the Euclidean distance control factor based on aggregation density sparse degree :where is the th attribute of data about which and have the maximum distance and is the attribute of data about which and have the minimum distance.

According to the above views, we propose the optimized density formula:

Formula (6) makes distinction among different attributes using the Euclidean distance control factor based on aggregation density sparse degree . The attributes with large differences are given more weights’ computing density, which reduces the risk of misjudged data. The optimized density is multidimensional diffusion density:where is the average of multidimensional diffusion density and is the standard deviation of multidimensional diffusion density. is standard value of multidimensional diffusion density. Let the data whose multidimensional diffusion density is bigger than in a collection named :

is the distance between and , and both of them are from collection :where is the average of distance between the data from collection . And is the standard deviation of distance between the data from collection . is standard value of distance. Let the data whose distance is bigger than be in a collection named . Reference [19] shows that the initial cluster centers have the large density and the far distance. So the data in collection are likely to be initial cluster centers. We choose them as the initial cluster centers of initial data clusters.

2.1.3. Obtaining Initial Data Clusters by Multidimensional Diffusion Density Distribution

In this subsection we present execution steps of our proposed initial data clustering using multidimensional diffusion density distribution for -means clustering. This algorithm consists of two parts. The first part deals with the the initial cluster centers of the initial data. Then we execute the second part of the algorithm to group the data.

, which is the set of data elements described with attributes where is number of attributes and all attributes are numeric. Compute of each data, mean , standard deviation , and . Choose the data whose is bigger than , and put them into collection . Compute of each piece of data in collection, , standard deviation , and . Choose the data whose is bigger than and put them into collection, is the set of the initial cluster centers of the initial data, and is the number of the initial cluster centers:

measures the Euclidean distance between a pattern and its cluster center that is in the collection.

-means algorithm minimizes the function which is defined as object function [18]:

The -means algorithm groups the data iteratively as follows. Choose the data in collection as the initial cluster centers and the number of them is , the number of initial data clusters. Decide membership of the patterns in one of the -clusters according to the minimum distance from cluster center criteria. Then calculate new centers as:

is the number of data from which the th cluster center of. Repeat the previous steps till there is no change in cluster centers. The clusters are the initial data clusters. Let be a clusters set of the initial data clusters.

2.2. Multiobjective Clustering Approach Based on Dynamic Cumulative Entropy

Obtaining the initial data clusters by multidimensional diffusion density distribution is the primary data processing. To get the accurate clustering results we need to cluster again on the basis of it, and the initial data clusters are the basic elements.

2.2.1. Multiobjective Clustering Function Based on Dynamic Cumulative Entropy

In the -means algorithm the object function is JMSE. But it is not suitable for optimizing the primary clusters of data and determining , the number of clusters because JMSE decreases monotonously for the increase of the number of clusters. When JMSE reduced to the global minimum, the initial cluster centers are putted away from their data, which makes single data a cluster. So proposed objective function refers to the information of each initial data cluster in the process of the deepening clustering.

The paper tries to solve the above problems by studying the clusters’ structure by informational entropy theory and the principle of Maximum Informational Entropy. We propose a multiobjective clustering function based on dynamic cumulative entropy to determine the final clustering results. Firstly, we define the information entropy formula of cluster:where is the number of data in the th initial data clusters, is the number of data to be clustered, and is the number of clusters.

According to the principle of Maximum Informational Entropy, if clusters have no elements, the information entropy is 0, . When clusters tend to be stable and meet the condition that the numbers of elements belonging to different clusters are very similar, the information entropy is maximum; . Based on the information entropy formula (14) of cluster, we define the cluster structure equilibrium degree :

is time-varying entropy-to-maximum entropy ratio, which shows the balance degree of clusters and . When , cluster is in the most uneven condition. When clusters are in ideal equilibrium state:

is the degree of equilibrium gain, which reflects the cluster occupancy of data.

Finally, we get the multiobjective clustering function based on dynamic cumulative entropy :where

is the average distance among the clusters. The multiobjective clustering function considers not only the distance among the cluster centers and its data influence on the clustering results, but also the amount of information. JMES decreases monotonically with the decreasing of distance among the cluster centers and its data influence on the clustering results. On the contrary, increases monotonically with the increasing of the number of the global clusters. So as restricting factor prevents the number of clusters more than the real for JMES decreasing. It improves the clustering accuracy and reliability when the number of clusters is not determined.

2.2.2. Multiobjective Clustering Approach Based on Dynamic Cumulative Entropy

The initial data clusters by multidimensional diffusion density distribution need to be clustered again. The initial cluster centers are determined again by the distance-information decision graph.

The less information of a cluster has, the smaller its uncertainty is and the more likely it is to be the final cluster. At the same time, the global classification is more stable. Formula for information of the initial data clusters is

Calculate the distance among the initial data clusters:where , is the data from which the th cluster center of.

By the nature of the clustering, we can know that the cluster’s uncertainty should be small and the distance among clusters should be large. According to this, we can draw distance-information decision graph to determine and the initial cluster centers. Take two-dimensional data for example, such as shown in Figure 1. Multidimensional diffusion density distribution obtains initial data clusters , , , , , , , , , , and . Figure 2 is distance-information decision graph with the inverse of information of the initial data cluster as the horizontal axis () and the distance among initial data clusters as the longitudinal axis ().

By the distance-information decision graph we can see that initial data clusters , , , and have compared low information and large distance among clusters because the larger value can expand the searching scope of solution space, which avoids ignoring the sparse clusters. The selection of the initial cluster centers can be expanded according to the distance-information decision graph. Select the respective average values of initial data clusters , , , , , and as the initial cluster centers and as 6; minimize the multiobjective clustering function with result . Select the respective average values of initial data clusters , , , , and as the initial cluster centers and as 5; minimize the multiobjective clustering function with result . Select the respective average values of initial data clusters , , , and as the initial cluster centers and as 4; minimize the multiobjective clustering function with result . Then compare , , and ; the minimum is the best choice.

3. Experimental Results and Analysis

3.1. Experimental Objects and Related Settings

All experiments were performed on Intel® Core™ i5 with 3.30 GHz CPU and 4.00 GB of random access memory (RAM). All programs were coded by standard MATLAB language and operating system was Windows 7. To show the accuracy of the proposed algorithm, it has been applied to the two types of data sets. One is the AR artificial data set from experimental data and the other is UCI data. To define the quality of proposed algorithm we use 4 indicators. They are accuracy, Adjust Rand Index, MSE, and BIC. The “Adjusted Rand index” is the best clustering validity evaluation criterion [20]. And BIC is often used as an the accurate evaluation [21]. Because the value of MSE is large, the results of it are scaled down. The MSE is different from other indicators; it is the smaller the better.

In this paper, an experimental system is used to evaluate the effectiveness of the proposed approach. A wireless information collection system of field soil temperature and humidity is used in this work to obtain the real data sets. The measurement system is shown in Figure 3. The field soil temperature is measured by TM-100 and humidity is measured by SP40A. The TM-100 measuring range is from 0% to 100%, and the SP40A measuring range is from −20°C to 60°C. The JN5148 is used for receiving and transmitting the soil temperature and humidity data. Then the data is controlled and recorded digitally by a pc. The measurements were performed in four fields. So the record data are from four real data sets. The collecting places are shown in Figure 4. Table 1 shows the measurements data named artificial data AR. The artificial data AR that sets two-dimensional distribution is shown in Figure 5.

The AR artificial data set clusters by the proposed algorithm in this paper. Multidimensional diffusion density distribution obtains initial data clusters. The proposed of AR is 37, and the of AR is 2.97. The data samples with bigger than and bigger than are putted into . And , , , , , , , , , , , . The elements in are the initial cluster centers of the initial data. We obtain initial data clusters , , , , , , , , , , and by using multidimensional diffusion density distribution. The results are shown in Figure 6.

The optimal clustering result is obtained by using the multiobjective clustering approach based on dynamic cumulative entropy. The information and the distance among the initial data clusters are obtained by the proposed formula.

The distance-information decision graph is shown in Figure 7. From the graph we can clearly see that the uncertainty of initial data clusters , , , and is low and the distance among them is large. So we select the respective average values of initial data clusters , , , , and as the initial cluster centers and value as 5 and calculate the minimum of . Then we select the respective average values of initial data cluster , , , as the initial cluster centers and value as 4 and calculate the minimum of . From the proposed formula , , so the optimal number of clusters is 4. And the optimal clustering result is shown in Figure 8.

We get the cluster centers of AR. They are (10.2, 26.2), (23.7, 26.4), (17.8, 9.1), and (36, 10.5). And the real cluster centers are (10, 25), (24, 25), (18, 9),and (35, 10). The similarity between our cluster centers and the real cluster centers is 95.582%, 97.628%, 99.337%, and 96.968%. The average value of similarity of cluster centers is 94.72%, the clustering accuracy is 92.14%, Adjust Rand Index is 0.7840, and MSE (Mean Squared Error) is 3.82.

The AR artificial data is clustered by [11, 15]. The comparisons of cluster center’s similarity and clustering result’s accuracy and Adjust Rand Index and MSE, using our algorithm and literature [11, 15], for AR data sets, are shown in Figures 9, 10, 11, 12, and 13.

The comparison of similarity of initial cluster centers computed using our algorithm and [11, 15], for the AR data set, is shown in Figure 9. The algorithm with high value will have a good selection of cluster centers. The result of our algorithm has the highest value of similarity of initial cluster centers. What is more, for the other numerous measures used for clustering validation, the accuracy, Adjust Rand Index MSE, and BIC, our algorithm has better performance than other algorithms.

Obviously, our algorithm can find the optimal clustering number and the optimal cluster center to get the best clustering result, which provides the accuracy of clustering.

3.2. Experimental Results and Comparative Analysis of UCI Data Sets

In order to verify the effectiveness of the our algorithm, this paper selects 8 data sets of UCI data to carry out the experiment. They are Iris , Soybean • small , Segmentation , WDBC , Pima Indians Diabetes , Wine , Ionosphere , and Yeast . Clustering results use the clustering time, clustering accuracy, the “Rand index Adjusted,” and MSE for evaluation. Table 2 shows the comparison of average of the running time of clustering algorithms; the unit is . Figures 14, 15, 16, and 17 are the comparisons about the clustering accuracy, the “Rand index Adjusted,” MSE, and the BIC of our algorithm and [11, 15].

According to clustering results of UCI data we can see that for all the tested data sets, the proposed algorithm gets improved and consistent clusters for all data sets in comparison to [11, 15]. Its accuracy for complicated data sets also was high. However, the algorithm of [11] is quicker than the proposed algorithm.

4. Conclusion

We have presented an algorithm for iterative clustering algorithm with a small amount of prior knowledge for nonuniform sparse data. This procedure is based on the dynamic cumulative entropy. However, the outliers are more susceptible to a change in cluster membership. We propose multidimensional diffusion density distribution in computing initial cluster centers, which generate the initial data clusters that may be more than the number of desired clusters. So we need to cluster again on the basis of them for getting the accurate clustering results by multiobjective clustering approach based on dynamic cumulative entropy. Experimental results show improved and consistent cluster structures. And the experimental results show that compared with full search, our proposed method may got production up by 5 percent for nonuniform sparse data set from field soil temperature and humidity. The superiority of our method over other algorithms is more remarkable, when a data set with higher data dimension and larger number of clusters is used. And our method handles clustering with little prior knowledge, so our method is at high complexity.

Following our paper, several interesting problems deserve to be explored, for example, how to reduce computation and enhance the algorithm speed with a similar approximation ratio. To generalize our method to the setting of arbitrary partition is also challenging but useful in some scenarios.

Competing Interests

The authors declare that they have no competing interests.