Mathematical Problems in Engineering

Volume 2016 (2016), Article ID 5707692, 10 pages

http://dx.doi.org/10.1155/2016/5707692

## Nonuniform Sparse Data Clustering Cascade Algorithm Based on Dynamic Cumulative Entropy

Beijing University of Posts and Telecommunications, Electronic Engineering School, Beijing, China

Received 15 June 2016; Revised 23 September 2016; Accepted 9 October 2016

Academic Editor: Rita Gamberini

Copyright © 2016 Ning Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A small amount of prior knowledge and randomly chosen initial cluster centers have a direct impact on the accuracy of the performance of iterative clustering algorithm. In this paper we propose a new algorithm to compute initial cluster centers for -means clustering and the best number of the clusters with little prior knowledge and optimize clustering result. It constructs the Euclidean distance control factor based on aggregation density sparse degree to select the initial cluster center of nonuniform sparse data and obtains initial data clusters by multidimensional diffusion density distribution. Multiobjective clustering approach based on dynamic cumulative entropy is adopted to optimize the initial data clusters and the best number of the clusters. The experimental results show that the newly proposed algorithm has good performance to obtain the initial cluster centers for the -means algorithm and it effectively improves the clustering accuracy of nonuniform sparse data by about 5%.

#### 1. Introduction

Clustering is an important discovery technique of exploratory data mining and a common technique for statistical data analysis. Iterative clustering algorithm is one kind of the clustering algorithms. And -means is the most popular and the fast method in iterative clustering algorithms. Because of the simplicity of -means algorithm, it is used in many fields, including machine learning, medicine, image analysis, pattern recognition, information retrieval, bioinformatics, and computer. For example, in the medical field, cancer genomics [1], cell signaling [2], and viral genomes [3] use -means as a data analysis tool; in the bioinformatics field, bioanalytical chemistry [4], the vibrational spectra of biomolecules [5], and the nervous system [6] use -means to mine potential information; in the image analysis field, imaging techniques [7] use -means to partition a given set of points into homogeneous groups; in the pattern recognition field, automatic system for imbalance diagnosis in wind turbines [8] uses -means to suggest the optimum number of groups. Reference [9] uses -means to analyze network data. Reference [10] generates profiles by -means to group together days with a similar pattern of request arrivals.

Although -means algorithm has been developed to solve a wide range of different problems, it has three major drawbacks:(1)It needs to predetermine the cluster number by user. In practice, due to little prior knowledge, value is generally difficult to determine.(2)It is sensitive to selection of the initial cluster centers. That is, -means selects different initial cluster centers with different results. Because of randomly chosen initial clusters centers, populations are generally composed of low quality individuals exclusively.(3)The effect of -means algorithm for nonuniform sparse data processing is not good.

To overcome these drawbacks, many evolutionary algorithms such as GA, TS, and SA have been introduced. Kao et al. have proposed a hybrid technique based on combining the -means algorithm [11]. Bahmani Firouzi et al. have introduced a hybrid evolutionary algorithm based on combining PSO, SA, and -means to find optimal solution [12]. Niknam and Amiri have proposed a hybrid algorithm based on a fuzzy adaptive PSO, ACO, and -means for cluster analysis [13]. Niknam et al. have purposed a novel algorithm that is based on combining two algorithms of clustering: -means and Modified Imperialist Competitive Algorithm [14]. Evolutionary algorithms require large amounts of data to study; however, many real-world problems are like black boxes; hence no sufficient data about their internals is available.

In addition, to solve the problem of selection of the initial cluster centers, Bianchi et al. have proposed two density-based -means initialization algorithms for nonmetric data clustering [15]. Tunali et al. have proposed an improved clustering algorithm for text mining: multicluster spherical -means [16]. Tvrdík and Křivý have proposed a new algorithm combining differential evolution and -means [17]. Rodriguez and Laio proposed selection of the initial cluster centers by density peak. Khan and Ahmad [18] have proposed the cluster center initialization algorithm for -means clustering. But the computing of the above methods is laborious.

In order to solve the situation that a small amount of prior knowledge and randomly chosen initial cluster centers have a direct impact on the accuracy of the performance of iterative clustering algorithm. In this paper, we propose a new algorithm for nonuniform sparse data clustering based on cascade entropy increase and decrease. It designs Euclidean distance sparse degree of aggregation density control factor, determines the initial cluster center of nonuniform sparse data, and groups initial data clusters by multidimensional diffusion data distribution density. Multiobjective clustering approach is adopted to compensate the clustering error of initial data clusters. The experimental results show that the new data clustering algorithm can effectively improve the clustering accuracy of nonuniform sparse data clusters.

#### 2. Nonuniform Sparse Data Clustering Cascade Algorithm Based on Dynamic Cumulative Entropy

In order to obtain the optimal clustering results of nonuniform sparse data, in this paper, we use multidimensional diffusion density distribution to obtain the initial data clusters, while the Euclidean distance control factor based on aggregation density sparse degree is put forward to solve the problem that multidimensional data is easy to misjudge. The initial data clusters are more than the real cluster. So we need to select optimal initial cluster centers by decision graph and then execute -means on complete data set based on multiobjective clustering approach.

##### 2.1. Initial Data Clustering Using Multidimensional Diffusion Density Distribution

In iterative clustering algorithms, choosing initial cluster centers is extremely important as it has a direct impact on the formation of final clusters. It is dangerous to select some samples as initial centers, which are away from normal samples. In this paper, first we define the multidimensional diffusion density distribution of samples and the comprehensive distance, according to which we get the initial data clusters.

###### 2.1.1. Multidimensional Diffusion Data Normalization

Different attributes of multidimensional data have different units of measurement and value ranges, which has a serious impact on clustering formation. So in order to avoid the above situation at first we need to normalize clustering data to make all attributes have the same weights. Normalization is done using the following formula. Calculate the average of the absolute deviation: is the set of data elements described with attributes where is the number of attributes and all attributes are numeric. is the measured value of the first attribute belonging to the first data:where

###### 2.1.2. Multidimensional Diffusion Density Distribution

In iterative clustering algorithms the function adopted for density is the cut-off kernel [19]: wherewhere is the cut-off distance determined by user and .

is the th data with attributes and is the th data with attributes . measures the Euclidean distance between and . So is the number of data whose distance to the data is less than .

The clustering results of multidimensional data have shortcoming using the above functions adopted for density. Some multidimensional data are misjudged for the functions without considering the difference among attributes. If few attributes of some data change a lot but other attributes are close to other data, these mutating data can be misjudged into the cluster that they are not similar to for the changing attributes are ignored. In order to solve the above problems, we put forward the Euclidean distance control factor based on aggregation density sparse degree :where is the th attribute of data about which and have the maximum distance and is the attribute of data about which and have the minimum distance.

According to the above views, we propose the optimized density formula:

Formula (6) makes distinction among different attributes using the Euclidean distance control factor based on aggregation density sparse degree . The attributes with large differences are given more weights’ computing density, which reduces the risk of misjudged data. The optimized density is multidimensional diffusion density:where is the average of multidimensional diffusion density and is the standard deviation of multidimensional diffusion density. is standard value of multidimensional diffusion density. Let the data whose multidimensional diffusion density is bigger than in a collection named :

is the distance between and , and both of them are from collection :where is the average of distance between the data from collection . And is the standard deviation of distance between the data from collection . is standard value of distance. Let the data whose distance is bigger than be in a collection named . Reference [19] shows that the initial cluster centers have the large density and the far distance. So the data in collection are likely to be initial cluster centers. We choose them as the initial cluster centers of initial data clusters.

###### 2.1.3. Obtaining Initial Data Clusters by Multidimensional Diffusion Density Distribution

In this subsection we present execution steps of our proposed initial data clustering using multidimensional diffusion density distribution for -means clustering. This algorithm consists of two parts. The first part deals with the the initial cluster centers of the initial data. Then we execute the second part of the algorithm to group the data.

, which is the set of data elements described with attributes where is number of attributes and all attributes are numeric. Compute of each data, mean , standard deviation , and . Choose the data whose is bigger than , and put them into collection . Compute of each piece of data in collection, , standard deviation , and . Choose the data whose is bigger than and put them into collection, is the set of the initial cluster centers of the initial data, and is the number of the initial cluster centers:

measures the Euclidean distance between a pattern and its cluster center that is in the collection.

-means algorithm minimizes the function which is defined as object function [18]:

The -means algorithm groups the data iteratively as follows. Choose the data in collection as the initial cluster centers and the number of them is , the number of initial data clusters. Decide membership of the patterns in one of the -clusters according to the minimum distance from cluster center criteria. Then calculate new centers as:

is the number of data from which the th cluster center of. Repeat the previous steps till there is no change in cluster centers. The clusters are the initial data clusters. Let be a clusters set of the initial data clusters.

##### 2.2. Multiobjective Clustering Approach Based on Dynamic Cumulative Entropy

Obtaining the initial data clusters by multidimensional diffusion density distribution is the primary data processing. To get the accurate clustering results we need to cluster again on the basis of it, and the initial data clusters are the basic elements.

###### 2.2.1. Multiobjective Clustering Function Based on Dynamic Cumulative Entropy

In the -means algorithm the object function is JMSE. But it is not suitable for optimizing the primary clusters of data and determining , the number of clusters because JMSE decreases monotonously for the increase of the number of clusters. When JMSE reduced to the global minimum, the initial cluster centers are putted away from their data, which makes single data a cluster. So proposed objective function refers to the information of each initial data cluster in the process of the deepening clustering.

The paper tries to solve the above problems by studying the clusters’ structure by informational entropy theory and the principle of Maximum Informational Entropy. We propose a multiobjective clustering function based on dynamic cumulative entropy to determine the final clustering results. Firstly, we define the information entropy formula of cluster:where is the number of data in the th initial data clusters, is the number of data to be clustered, and is the number of clusters.

According to the principle of Maximum Informational Entropy, if clusters have no elements, the information entropy is 0, . When clusters tend to be stable and meet the condition that the numbers of elements belonging to different clusters are very similar, the information entropy is maximum; . Based on the information entropy formula (14) of cluster, we define the cluster structure equilibrium degree :

is time-varying entropy-to-maximum entropy ratio, which shows the balance degree of clusters and . When , cluster is in the most uneven condition. When clusters are in ideal equilibrium state:

is the degree of equilibrium gain, which reflects the cluster occupancy of data.

Finally, we get the multiobjective clustering function based on dynamic cumulative entropy :where

is the average distance among the clusters. The multiobjective clustering function considers not only the distance among the cluster centers and its data influence on the clustering results, but also the amount of information. JMES decreases monotonically with the decreasing of distance among the cluster centers and its data influence on the clustering results. On the contrary, increases monotonically with the increasing of the number of the global clusters. So as restricting factor prevents the number of clusters more than the real for JMES decreasing. It improves the clustering accuracy and reliability when the number of clusters is not determined.

###### 2.2.2. Multiobjective Clustering Approach Based on Dynamic Cumulative Entropy

The initial data clusters by multidimensional diffusion density distribution need to be clustered again. The initial cluster centers are determined again by the distance-information decision graph.

The less information of a cluster has, the smaller its uncertainty is and the more likely it is to be the final cluster. At the same time, the global classification is more stable. Formula for information of the initial data clusters is

Calculate the distance among the initial data clusters:where , is the data from which the th cluster center of.

By the nature of the clustering, we can know that the cluster’s uncertainty should be small and the distance among clusters should be large. According to this, we can draw distance-information decision graph to determine and the initial cluster centers. Take two-dimensional data for example, such as shown in Figure 1. Multidimensional diffusion density distribution obtains initial data clusters , , , , , , , , , , and . Figure 2 is distance-information decision graph with the inverse of information of the initial data cluster as the horizontal axis () and the distance among initial data clusters as the longitudinal axis ().