Abstract

As an important data analysis method in data mining, clustering analysis has been researched extensively and in depth. Aiming at the limitation of -means clustering algorithm that it is sensitive to the distribution of initial clustering center, Glowworm Swarm Optimization (GSO) Algorithm is introduced to solve clustering problems. Firstly, this paper introduces the basic ideas of GSO algorithm, -means algorithm, and good-point set and analyzes the feasibility of combining them for clustering optimization. Next, it designs a clustering method of improved GSO algorithm based on good-point set which combines GSO algorithm and classical -means algorithm together, searches data object space, and provides initial clustering centers for -means algorithm by means of improved GSO algorithm and thus obtains better clustering results. Major improvement of GSO algorithm is to optimize the initial distribution of glowworm swarm by introducing the theory and method of good-point set. Finally, the new clustering algorithm is applied to UCI data sets of different categories and numbers for clustering test. The advantages of the improved clustering algorithm in terms of sum of squared errors (SSE), clustering accuracy, and robustness are explained through comparison and analysis.

1. Introduction

As an unsupervised data analysis method, clustering analysis is widely applied in such fields as data mining, pattern recognition, machine learning, and artificial intelligence [1]. Different from classification, clustering algorithm realizes categorization by gathering data objects through certain similarity metric and clustering criterion without any prior knowledge. As a branch of statistics, clustering analysis has been studied extensively. Clustering method can be mainly classified into division method, hierarchy method, and density-based method. The -means algorithm proposed by James Macqueen is a typical clustering algorithm based on division [2]. However, the clustering result of -means algorithm is greatly affected by initial clustering center point and is very sensitive to outliers. Literature [3] optimizes the -means algorithm by integrating the coding, crossing, and aberrance thoughts of genetic algorithm (GA) with the local optimizing ability of -means clustering algorithm and proposes the -means clustering algorithm based on GA. Hierarchy-based clustering methods mainly include CURE algorithm [4] and Chameleon algorithm [5], of which one cluster is represented by multiple points in CURE algorithm, making the processing of nonspherical data sets better. Representative algorithms of density-based clustering methods include DBSCAN algorithm [6], which is able to effectively identify class cluster of any shape, but is very sensitive to the setting of artificial parameters (e.g., radius). Rodriguez and Laio put forward a new density-based density peaks clustering (DPC) algorithm [7] in 2014. In this algorithm, density peaks (i.e., clustering centers) are selected manually through “decision diagram” first, and then, residual data points are allocated to each clustering center on this basis to obtain corresponding clustering result. It is noteworthy that, in recent years, some scholars have started introducing the heuristic swarm optimization algorithm into clustering analysis of different fields and improving clustering effect by virtue of the global searching ability of swarm optimization algorithm. A clustering analysis method combining PSO and -means is proposed in literature [8] through the global searching ability of particle swarm algorithm. In addition, Cuckoo algorithm, artificial bee colony algorithm, artificial fish swarm algorithm, and so forth [912] are also started to be introduced in the research of clustering algorithm.

The GSO algorithm [13] proposed by Krishnanand and Ghose is a new swarm intelligence optimization algorithm, which is more efficient in solving multimodal problems compared with traditional swarm intelligence optimization algorithms [14]. Aljarah and Ludwig put forward a new clustering based GSO algorithm in 2013. In this algorithm, the GSO algorithm is adjusted to solve the data clustering problem to locate multiple optimal centroids [15]. An new approach for cluster analysis based on GSO algorithm and -means has been proposed by Onan and Korukoglu [16]. Due to the multimodal nature of multimedia data, Pushpalatha and Ananthanarayana proposed the GSO algorithm based Multimedia Document Clustering (GSOMDC) algorithm to group the multimedia documents into topics in 2015 [17]. A fuzzy clustering algorithm based on GSO algorithm (GSO-KFCM) is proposed by Cheng and Bao in 2017. In this algorithm, the GSO algorithm obtains the optimal solution as the initial clustering center of the kernelized fuzzy mean clustering algorithm [18].

This paper introduces GSO algorithm into clustering analysis, regards each glowworm as a feasible solution in clustering center of data object space, searches data object space through the optimization process of glowworm, and solves clustering center by obtaining multiple extreme points. In this way, it combines GSO algorithm and -means algorithm together, provides initial clustering centers for -means algorithm by means of GSO algorithm, solves the problem that -means algorithm is sensitive to initial clustering centers, and thus obtains better clustering effects. Meanwhile, considering the effect of the initial distribution of glowworm swarm on clustering center search, this paper optimizes the initial glowworm swarm distribution in GSO algorithm by introducing the theory and method of good-point set [19, 20], which improves the global searching performance of GSO algorithm. The research in this paper mainly includes 3 parts. Section 2 gives explanations on relevant algorithms and theories, which puts forward the optimization idea for clustering analysis-oriented GSO algorithm. Section 3 introduces improved GSO algorithm based on good-point set, combines improved GSO algorithm with -means algorithm together, and designs the algorithm framework and implementation steps for new clustering method (GSOK_GP algorithm). Section 4 selects UCI data sets of different categories and numbers for clustering experiment and analysis for the GSOK_GP algorithm designed in this paper.

2. Description of Relevant Algorithms

2.1. -Means Clustering Algorithm
2.1.1. Basic Ideas of -Means Clustering Algorithm

Basic ideas of -means clustering algorithm: select data points at random in the data objects to be clustered to act as initial clustering center points, and allocate other data points to corresponding clustering center points based on their similarity with such initial clustering center points. After one round of allocation, recalculate the clustering centers of each category based on the clustering result of the round, and then, allocate residual data points to obtain the clustering result of the new round. Repeat this process for given times or until the convergence of data center points.

2.1.2. Steps of -Means Clustering Algorithm

(1) Problem Description. represents a given data object, where represents data vector point. Divide into several disjoint clusters , where .

(2) Related Definitions

Definition 1. Euclidean distance between data points

Definition 2. SSE of clustering results where is the cluster center of . SSE is taken as an important indicator for evaluating clustering result in general.

(3) Implementation Steps of -Means Algorithm

Step 1. Randomly select samples as initial clustering centers.

Step 2. Allocate other data points in data object to existing clustering center as per given principles (e.g., shortest Euclidean distance).

Step 3. Recalculate clustering center and , as per clustering result, where is the data point allocated to clustering center point .

Step 4. If , that is, the new clustering center is different from the original one, turn to Step 2 for iteration again, until the convergence of clustering center points or reaching maximum iterations.

It can be learnt from the steps above that initial clustering centers have significant effect on the clustering result and operating efficiency of -means clustering algorithm and may lead to premature local optimum of -means clustering algorithm, which causes clustering results with large difference in turn.

2.2. Main Ideas and Steps of GSO Algorithm

In GSO algorithm, each glowworm is deemed as a feasible solution of target problem in space. Glowworms gather towards high brightness glowworm through mutual attraction and location movement, and multiple extreme points are found out in the solution space of a target problem. In this way, the problem is solved. Its main ideas can be described as follows.

Step 1. Initialize glowworm swarm . Glowworm number in swarm, step , fluorescein initial value , fluorescein volatilization rate , domain change rate , decision domain initial value , domain threshold , and other parameters related need to be initialized and assigned in the initialization.

Step 2. Calculate glowworm fitness based on objective function. Calculate the fitness of each glowworm at its location based on specific objective function .

Step 3. Calculate the moving direction and step of glowworm. Each glowworm searches for glowworms with higher fluorescein value within its own decision radius , and determine the next moving direction and step based on fluorescein value and distance.

Step 4. Update glowworm locations. Update the location of each glowworm based on determined moving direction and step.

Step 5. Update the decision domain radius of glowworm.

Step 6. Judge whether the algorithm has converged or reached the maximum iterations (itmax) and determine whether to enter the next round of iteration.

It can be learnt from the steps above that algorithm execution efficiency can be improved and premature local optimum of algorithm can be avoided by optimizing the initial distribution of glowworm swarm.

2.3. Basic Theory of Good-Point Set

Basic definition and structure of good-point set are as follows:

(1) Assume is a unit cube in -dimensional Euclidean space, which is expressed as

(2) Assume is a point set with the number of in , which is expressed as

(3) Assume is a given point in and is the number of points not satisfying the inequality below in point set .

, where , and is known as the deviation of point set .

(4) Assume is the deviation of and meets the requirements below:

, where is a constant related to and , .

is known as a good-point set and a good point.

It has been proved by applicable theorems that, with respect to approximate integration, the order of deviation is only relevant to and irrelevant to the space dimensions of the sample. Therefore, good-point set can provide better support for the calculation in high-dimensional spaces [20]. Meanwhile, as for a point set object whose distribution is unknown, the deviation of points obtained by virtue of good-point set is significantly superior to points obtained by random method. Therefore, a better initial distribution scheme can be provided for the swarm distribution in swarm intelligence algorithm based on this feature of good-point set.

3. Design of GSOK_GP Algorithm

This paper proposes an improved GSO algorithm based on good-point set to solve clustering problems on the basis of analysis of relevant algorithms above and characteristics of clustering problems. Its main ideas can be described as firstly, optimize the initial distribution of glowworm swarm through good-point set, so as to optimize GSO algorithm. Secondly, optimize the initial clustering centers in clustering data objects, and obtain characteristics of multiple extreme points and a clustering center point set with optimized GSO algorithm. Thirdly, select extreme points as the initial clustering center of -means algorithm in the clustering center point set as per maximum distance principle. Fourthly, execute the -means algorithm with initial clustering center to figure out the clustering result. The algorithm framework is shown as Figure 1. Where means the iterations are no greater than maximum iterations, means the number of extreme points is greater than the number k of initial clustering centers required.

3.1. Initial Swarm Optimization Based on Good-Point Set

Optimization of initial distribution of glowworm swarm is to represent the characteristics of solution space more scientifically utilizing glowworm swarm in essence. Randomly generated glowworm swarm cannot cover all conditions of solution space in most cases. Therefore, uniform distribution of glowworm swarm in solution space is an effective strategy. More uniform distribution of swarm can be realized with the theory and method of good-point set above.

Assume the initial glowworm swarm number is n; select n points in s-dimensional space to act as glowworm locations. Select the good-point set composed of n good points in s-dimensional space with good-point set theory. There are mainly three methods:

(1) Square root sequence method: , where are different primes.

(2) Cyclotomic field method: , where is the smallest prime meeting .

(3) Exponential sequence method: .

Assuming and , construct good-point set (i.e., initial glowworm swarm distribution) with exponential sequence method. Figures 1 and 2 show the data points (glowworms) distribution under random condition and when applying exponential sequence method, respectively.

The comparison between Figures 2 and 3 indicates that the data point distribution in exponential sequence method is more uniform, which is able to cover the solution space better. In the meantime, the structure of its good-point set is more stable; that is, the distribution effect is consistent when is unchanged. Therefore, a better initial glowworm distribution can be obtained by applying good-point set in initial glowworm swarm distribution.

3.2. Flow of GSOK_GP Algorithm

Glowworm individuals are deemed as the feasible solutions of a clustering center point when combining improved glowworm algorithm with -means algorithm to solve clustering problems. In view of the characteristic that clustering center points are surrounded by data points of data objects, the density of clustering center points is represented by an extreme value of various data point densities within one domain. Therefore, take the density of glowworm individuals in data object set as their fitness, and obtain a superior initial clustering center point set through optimizing of density extreme value by glowworms. The main algorithm flow is as follows.

Step 1. Initialize with the glowworm swarm based on good-point set. As for the data set needing to be clustered, initialize and assign glowworm number in swarm, initial location of glowworm, step , fluorescein initial value , fluorescein volatilization rate , domain change rate , decision domain initial value , domain threshold , and other parameters related in the Euclidean space where is limited.

Step 2. Calculate glowworm fitness, namely, the number of data points in data set in the domain where glowworm distance is .

Step 3. Update glowworm fluorescein. represents the fluorescein value of glowworm in round of iteration.

Step 4. Determine moving direction. Glowworm searches the glowworm with higher fluorescein in decision domain and selects the glowworm with higher fluorescein through roulette approach, which acts as the moving direction of the next step. represents the glowworm set in the domain, represents the glowworm set with higher fluorescein in the domain, and represents the probability of each glowworm to be selected. Choose the glowworm with the maximum probability to act as the moving direction of glowworm .

Step 5. Update location. Glowworm moves by the step towards the direction of glowworm to complete location update of all glowworms.

Step 6. Update decision domain. represents the decision radius of glowworm in round iteration, represents the threshold of glowworm number in the domain, and represents the glowworm number within the decision radius.

Step 7. Judge the termination condition of glowworm search and enter iteration of the next round.

Step 8. Glowworm algorithm terminates, and extreme points are output to act as the initial clustering center points for -means algorithm.

Step 9. Execute -means algorithm and output clustering result.

3.3. Key Strategies in GSOK_GP Algorithm
3.3.1. Density-Based Fitness Function

Cluster center is a glowworm data point surrounded by adjacent points of low local density in GSOK_GP algorithm; therefore, cluster center can be interpreted as a local optimal point on fitness.

3.3.2. Weighted Euclidean Distance

Since there is large difference in value range of the data object in different dimensions, partial attributes with a large value range may have greater effect on the Euclidean distance between data points if only Euclidean distance is applied, which will cause greater effect on the clustering result. Therefore, calculation of Euclidean distance needs to be adjusted through different weights allocation in the process of initial clustering center search by the glowworm if assuming each dimension of the data object has the same effect on the clustering result without prior knowledge.

Assumption 3. Value range of data object in each dimension is expressed as follows:Set , .
represents the weight to be assigned to different dimensions:Improved Euclidean distance calculation method is redefined in this way.It should be noted that adjustment for Euclidean distance calculation method is only applied in the process of searching initial clustering center in GSO algorithm, and general Euclidean distance calculation approach needs to be adopted in algorithm evaluation, so as to compare and analyze with other algorithms.

3.3.3. Selection of Extreme Point

A relatively large distance between cluster centers is necessary in clustering algorithm. Therefore, select centers in multiple cluster centers to constitute the initial clustering centers of -means algorithm; that is, selecting extreme points in extreme point set to act as the initial clustering centers of -means algorithm needs to follow distance maximization principle. When , the basic steps for selecting extreme points are as follows:

(1) Firstly, select the glowworm with the highest fitness to act as the first clustering center point.

(2) Secondly, calculate the distances from other clustering center points to the first clustering center point, and select the one with the largest distance to act as the second clustering center point.

(3) Repeat step (2) to calculate the sum of the distances from other clustering center points to clustering centers selected, and select the one with the largest distance to act as the next clustering center point until clustering center points are obtained.

4. Experiment and Analysis

4.1. Experimental Environment

Matlab is employed to compile GSOK_GP algorithm and two UCI data sets shown in Table 1 are selected to test its effectiveness in this paper. Design parameters of GSO algorithm referring to relevant literatures, and select relevant parameters of M-GSO algorithm as follows based on actual clustering problems: , , , , , , and , with maximum iterations: 100.

SSE, clustering accuracy, and robustness are used to evaluate clustering effect of algorithm in this paper. SSE employs the sum of the Euclidean distances from all data objects to their cluster center points. The calculation approach is as follows:

, where is the cluster center point of .

The clustering accuracy proposed by Gan et al. is taken as one of the clustering effect evaluation standards in this paper [21]. Clustering accuracy refers to the proportion of accurately classified samples to total samples. The definition of clustering accuracy is as follows:

where represents the number of categories of data sets, represents the total number of samples in the data set, represents the number of samples accurately classified into Category .

In addition, the robustness indicators proposed in literature [22] are used to identify the algorithm stability in this paper. The algorithm robustness in this paper is calculated with the mean square error of results of multiple experiments as per the calculation formula below:

where is the optimal value of clustering accuracy and is the average value of clustering accuracy obtained by operating the algorithm multiple times. The smaller the is, the higher the algorithm robustness will be.

4.2. Experimental Results and Analysis

The data of executing GSOK_GP algorithm 20 times for Iris and Glass data sets, respectively, and independently is shown in Tables 2, 3, and 4. The data of executing -means algorithm and PSOK algorithm 20 times is cited from literature [9].

There are 150 sample objects in Iris data set, each of which has 4 attributes, which can be classified into 3 categories in total. The experimental results of Iris data set are shown in Table 2.

There are 214 data sets in Glass data set; each object has 9 attributes, which can be classified into 6 categories in total. The experimental results of Glass data set are shown in Table 3.

It can be learnt from Tables 2 and 3 that GSOK_GP algorithm is superior to traditional -means algorithm and PSOK algorithm on SSE and average accuracy.

Calculation results based on comparing the robustness of traditional -means algorithm, PSOK algorithm, and GSOK_GP algorithm are shown in Table 4.

Table 4 indicates that the operation results of 20 independent operations of GSOK_GP algorithm for Iris data set are consistent, which proves significant stability. And the fluctuation in the operation results of 20 independent operations for Glass data set is obviously smaller than that of -means algorithm and PSOK algorithm. Therefore, GSOK_GP algorithm has better robustness in the experiments.

5. Conclusion

Traditional -means clustering algorithm is widely used due to its simple principle and high execution efficiency. However, -means algorithm relies on initial clustering centers, which leads to large difference in the clustering result, low accuracy, and lack of stability of traditional -means algorithm. In this paper, the initial clustering centers in -means algorithm are optimized with improved glowworm algorithm based on good-point set, and the clustering effect is improved.

The GSOK_GP algorithm proposed in this paper is mainly applied to solving data object clustering problems under unsupervised learning conditions. The difference between the GSOK_GP algorithm and traditional clustering methods is that it combines GSO algorithm and -means algorithm together to improve the clustering effect. In particular, as for the effect of initial clustering centers on clustering results, this paper provides more scientific descriptions for data object space by introducing the theory and method of good-point set and obtains superior initial clustering center points with the searching ability of GSO algorithm. Through comparison and analysis, GSOK_GP algorithm is proved to have better clustering effect and stability.

In addition, the adverse effect of computing efficiency of GSOK_GP algorithm for glowworm density in case of large data object has also been noticed, which means that the convergence of GSOK_GP algorithm needs to be improved further, so as to apply it better when addressing clustering problems under large data volume.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The work was supported by National Natural Science Foundation of China (nos. 91546108, 71271071, 71490725, and 71521001), fund of Provincial Excellent Young Talents of Colleges and Universities of Anhui Province (no. 2013SQRW115ZD), fund of Support Program for Young Talents of Colleges and Universities of Anhui Province, fund of Natural Science of Colleges and Universities of Anhui Province (no. KJ2016A162), fund of Social Science Planning Project of Anhui Province (no. AHSKYG2017D136), and fund of Scientific Research Team of Anhui Economic Management Institute (no. YJKT1417T01).