An Adaptive Multiobjective Genetic Algorithm with Fuzzy <svg xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" style="vertical-align:-0.2724395pt" id="M1" height="8.14084pt" version="1.1" viewBox="-0.0657574 -7.8684 7.1246 8.14084" width="7.1246pt"><g transform="matrix(.017,0,0,-0.017,0,0)"><path id="g113-100" d="M383 397C383 424 344 448 299 448C244 448 172 409 132 375C66 319 23 227 23 146C23 42 74 -12 146 -12C208 -12 298 30 359 103L343 124C315 95 248 48 192 48C145 48 111 85 111 163C111 228 129 294 151 330C171 363 201 401 241 401C275 401 302 384 325 356C332 347 339 344 348 348C373 360 383 381 383 397Z"/></g></svg>-Means for Automatic Data Clustering

Dong, Ze; Jia, Hao; Liu, Miao

doi:https://doi.org/10.1155/2018/6123874

Mathematical Problems in Engineering

On this page

Abstract Introduction Conclusion Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2018 | Article ID 6123874 | https://doi.org/10.1155/2018/6123874

An Adaptive Multiobjective Genetic Algorithm with Fuzzy -Means for Automatic Data Clustering

Ze Dong,¹Hao Jia,¹and Miao Liu¹

Academic Editor: Thomas Hanne

Received24 Nov 2017

Revised27 Mar 2018

Accepted05 Apr 2018

Published13 May 2018

Abstract

This paper presents a fuzzy clustering method based on multiobjective genetic algorithm. The ADNSGA2-FCM algorithm was developed to solve the clustering problem by combining the fuzzy clustering algorithm (FCM) with the multiobjective genetic algorithm (NSGA-II) and introducing an adaptive mechanism. The algorithm does not need to give the number of clusters in advance. After the number of initial clusters and the center coordinates are given randomly, the optimal solution set is found by the multiobjective evolutionary algorithm. After determining the optimal number of clusters by majority vote method, the Jm value is continuously optimized through the combination of Canonical Genetic Algorithm and FCM, and finally the best clustering result is obtained. By using standard UCI dataset verification and comparing with existing single-objective and multiobjective clustering algorithms, the effectiveness of this method is proved.

1. Introduction

Clustering is a common unsupervised learning method in the field of machine learning. It has been widely used in many fields, such as data mining, pattern recognition, information retrieval, and other fields. In these fields, most of the data is unlabeled data with multiple attributes. In this high-dimensional space, it is very difficult for us to get the required information through the graph. Therefore, the ultimate goal of clustering is to achieve unsupervised classification of these few labeled or unlabeled complex data.

Fuzzy -means algorithm (FCM) is a widely used clustering algorithm in the field of machine learning. It was proposed by Bezdek et al. in 1984 [1]. It is different from the traditional non-A or B hard clustering methods. By introducing the fuzzy membership matrix, the fuzzy -means algorithm allows data points to belong to multiple classes according to their fuzzy membership degree. We choose the class with the highest value of the current data point in the fuzzy membership matrix as the final clustering result. This method solves the problem of clustering overlap of traditional hard clustering algorithms (such as -means method [2]), and more in line with the actual situation in data clustering. However, FCM has two serious shortcomings. Firstly, it easily falls into local minima. Secondly, it is necessary to specify the number of clusters and the algorithm is very sensitive to the initial center [3, 4].

To overcome the first shortcoming, some global optimization techniques have been introduced to deal with data clustering problems in the past years, for example, simulated annealing- (SA-) based [5], particle swarm optimization- (PSO-) based [6–8], genetic algorithms- (GA-) based [9–11], and quantum genetic algorithms- (QGA-) based techniques [12]. In recent years, genetic algorithms have been used to automatically determine the number of clusters by using variable-length strings [13, 14]. In [15], FCM combined with genetic algorithm, which has been successfully applied to remote sensing imagery in [16].

Most optimization based clustering algorithms are single-objective optimization algorithms, because only one validity measure of effectiveness is optimized. Noted that a single validity measure can only reflect some of the inherent segmentation attributes, for example, the compactness of clusters, the spatial separation between the clusters, and the cluster’s symmetry. If there are several classes of geometric shapes present in the same dataset, the clustering algorithms that use a single clustering validity index will not be able to process such datasets. Therefore, it is necessary to optimize several clustering validity indexes that can capture different data features at the same time. Based on this consideration, data clustering should be considered as a multiobjective optimization problem.

In recent years, multiobjective optimization problems have received extensive attention. Many scholars have conducted extensive researches on multiobjective evolutionary algorithms and have achieved extensive applications in feature extraction, data classification, and clustering.

In [17], a new method of classification feature extraction is proposed. A probability-based encoding technology and an effective hybrid operator, together with the ideas of the crowding distance, the external archive, and the Pareto domination relationship, are applied to PSO. By using this way to improve the search capability of the algorithm, the experimental comparison proves the effectiveness of the algorithm. In [18], a new multilabel feature selection algorithm is proposed to use an improved multiobjective particle swarm optimization (PSO), with the purpose of searching for a Pareto set of nondominated solutions (feature subsets). Two new operators are proposed to improve the performance of the proposed PSO-based algorithm. Finally, the effectiveness of the algorithm is verified by experiments.

Multiobjective evolutionary algorithms (MOEAs) have been proven to provide promising solutions to the problem of single-objective clustering algorithms that provide efficient search performance [19]. In [20], a multiobjective clustering technique, MOCK, is proposed to recognize the appropriate partitioning from the data sets that contain either hyperspherical shaped clusters or well-separated clusters. In [21], a multiobjective clustering technique is proposed, called VAMOSA. The algorithm optimizes two clustering validity indices simultaneously, so that the algorithm can evolve proper partitioning from the clustering data set with any shape, size, or convexity. In [22], a fuzzy clustering algorithm named MOmoDEFC based on improved multiobjective differential evolution was proposed. By using XB Index [23] and FCM measure (Jm) as objective functions, the algorithm can optimize both the compactness and separation of clusters simultaneously. It also improved the clustering effect.

Based on the above consideration, we developed a fuzzy clustering algorithm by using the multiobjective optimization framework, combined the knowledge of FCM and general genetic algorithm. The algorithm is aimed to achieve the functions as follows: automatically determine the number of clusters; improve clustering performance.

The rest of this paper is arranged as follows. Section 2 introduces the related theories, including the FCM algorithm and the multiobjective genetic algorithm NSGA-II. Section 3 introduces the improved multiobjective optimization framework and the adaptive multiobjective dynamic fuzzy clustering algorithm ADNSGA2-FCM. In Section 4, experiments results were carried out by using some standard UCI datasets. The experimental results were compared with many clustering algorithms in detail. Finally, conclusions are drawn in Section 5.

2. Theoretical Basis

2.1. Fuzzy -Means Algorithm

Suppose that fuzzy -means (FCM) partitions a set of data objects into fuzzy clusters, where each object has attributes. Let be a set of cluster centers. Let be a matrix of membership degrees in which is the membership degree of th object to the th cluster center. The matrix satisfies the conditions

The FCM algorithm uses the objective function to solve the optimal clustering, which is a clear difference from the hard clustering algorithm. The objective function can be defined as follows:

In (2), is the fuzzification coefficient, representing the fuzzy degree of clustering. Define . means the Euclidean distance between the th data points and the first th cluster centers, which represents the in-class similarity. A good clustering algorithm should ensure that the distance between similar points in the clustering result is as compact as possible. The standard FCM uses the as a cost function to be minimized.

The minimization of can be achieved by Lagrange multiplier method under constraint , , while the membership degrees matrix and cluster centers are updated according to

By iteration, the algorithm ends when condition is satisfied, where is a small positive number representing the end of iteration threshold.

2.2. Multiobjective Optimization Based on Genetic Algorithm

Unlike single-objective optimization algorithms, the multiobjective optimization algorithm optimizes multiple objective functions simultaneously. Because it is necessary to optimize multiple conflicting objectives simultaneously, it is often difficult to find a solution to make all the objective functions reach the optimum simultaneously. For multiobjective optimization algorithms, each objective function is considered equally important when the relative importance of the goals is unknown. Therefore, the multiobjective optimization problem is not to optimize one solution, but to optimize one solution set, which is characterized by improving any objective function without impairing other objective functions. We call this solution a nondominated solution or a Pareto optimal solution, which is defined as follows [24].

For minimizing the multiobjective problem, a vector of target components is where is the decision variable. If is the Pareto optimal solution, it needs to be satisfied: only if , there is no decision variable , dominating .

There are different approaches to solving multiobjective optimization problems [24, 25], for example, aggregating, population based non-Pareto and Pareto-based techniques. Vector evaluated genetic algorithm (VEGA) is a technique in the population based non-Pareto approach in which different subpopulations are used for the different objectives. Multiple objective GA (MOGA), nondominated sorting GA (NSGA), and niched Pareto GA (NPGA) constitute a number of techniques under the Pareto-based nonelitist approaches [25]. NSGA-II [26], SPEA [27], and SPEA2 [28] are some recently developed multiobjective elitist techniques.

As a multiobjective genetic algorithm, NSGA-II algorithm is a mature multiobjective elite selection algorithm. Compared with the NSGA, the NSGA-II has been improved in three aspects: when constructing the Pareto optimal solution set, the time complexity of the algorithm is reduced from to by adopting a new rank-based fast nondominated sorting method. The elitist reservation mechanism is proposed. After selection, offsprings from breeding individuals compete with their parents to produce the next generation. The new optimal individual reservation mechanism can not only improve the performance of multiobjective evolutionary algorithm (MOEA) but also effectively prevent the loss of the optimal solution and improve the overall evolutionary level of the population. In order to calibrate the fitness values of different elements at the same level after rapid nondominated sorting and to make the individuals in the Pareto frontier extend to the front of the entire frontier Pareto, the crowded distance comparison operator is used instead of the original fitness sharing method.

The present paper uses NSGA-II as the underlying multiobjective algorithm for developing the proposed fuzzy clustering method.

3. Dynamic Fuzzy Clustering Method Based on Adaptive NSGA-II

3.1. Chromosome Representation

In general, there are two kinds of chromosome coding schemes to solve the clustering problem by using genetic algorithm: numerical coding based on the clustering center; encoding based on the partition matrix [29]. Since the genetic operator in this paper uses the variable chromosome length operation, the first chromosome coding scheme is adopted.

Definition denotes a chromosome that represents cluster centers with dimensional attribute space. The coding form can be expressed asFigure 1 shows an example of a chromosome comprising five centers in two dimensions.

It uses the sequence form of real value to describe the chromosome, avoids the complex encoding form of binary form, and can display the practical significance of the representation more intuitively.

3.2. Population Initialization

The selection of initial cluster centers will have a great impact on the final clustering results. However, due to the crossover operator that dynamically changes the chromosome length, the fixed initial cluster centers are not conducive to maintaining the diversity of the population. Therefore, this paper uses the most common method of random given initial cluster centers to initialize the population.

Note that, for the sample datasets, the range of attribute values may not be the same, which can have a significant impact on the calculation of the NSGA-II algorithm. Therefore, it is necessary to standardize the sample data set, Max-Min Normalization first needs to be performed on the sample dataset to reduce the possible error. The Max-Min Normalization is defined as follows:

3.3. Selection of Fitness Function

The performance of multiobjective optimization is highly dependent on the choice of objective function, which can produce good results by reasonably selecting the objective function. The selection of objective functions should be such so that they can balance each other critically and are possibly contradictory in nature. Contradiction in the objective functions is beneficial since it guides to global optimum solution. It also ensures that no single clustering objective is optimized leaving the other probable significant objectives unnoticed.

In this paper, two kinds of fitness functions, DB Index and Index , are used as objective functions for NSGA-II algorithm. The two fitness functions are described in detail below.

3.3.1. Davies-Bouldin (DB) Index

DB Index [30] is a commonly used cluster validity index. This index is the ratio function of the sum of within-cluster scatter to between-cluster separation.

Define the scatter of the th class aswhere denotes the data point in the th class and denotes the center of the th class; represents the number of data points in the th class; is an index value. The distance between cluster center and , is defined as . The similarity between th cluster and th cluster is defined as

The Davies-Bouldin (DB) index is then defined as

The objective is to minimize the DB index for achieving proper clustering.

3.3.2. Index

Index [31] is another commonly used cluster validity index. where is the number of clusters. Here, stand for within-cluster scatter, defined as

stand for between-cluster separation, defined as

and are correlation coefficients. The power is used to control the contrast between the different cluster configurations, in general, . In this article, we have taken . is a constant for a given dataset, normalized to avoid the minimum value of the indicator. The value of for which is maximized is considered to be the correct number of clusters.

The goal in this paper is to minimize and simultaneously. At the same time, pay attention to adjust the correlation coefficient in . By adjusting the parameters, the values of and are in the same order of magnitude, avoiding the selection error caused by too large a target value. At the same time, in the use of the algorithm, it can be found that, with the increase of the number of clusters , the value of begins to decrease, and the value of begins to increase, which conforms to the conflicting requirements of the two objective functions mentioned earlier.

3.4. Genetic Manipulation

3.4.1. Selection

The two individuals are randomly selected to play a tournament and the winner is selected by the crowded comparison operator. This operator takes into account two attributes of the nondominant rank and the crowded distance. If two individuals are at different levels, the lower level is preferred. If both individuals are at the same level, choose a solution that has less crowded region.

3.4.2. Crossover

After selection, the selected chromosomes are placed in the mating pool. The performance of crossover operator will determine the performance of genetic manipulation to a great extent. Because of the variable-length encoding used in chromosome coding, the conventional one-point crossover approach does not apply to the current situation. In this paper, the following two crossover methods are used to perform crossover operation with the same probability.

(1) Based on the Nearest Neighbor Matching Crossover Operation. Let two parents and denote parent solutions with and cluster centers. Assume that we are in the case of , select the gene string in turn, which represents each cluster center in , and select the nearest distance string from to match them. Already paired gene strings are no longer involved in pairing. Reordering the previous gene strings in , choose a point randomly from within . For and , traditional crossover operations are used to generate new offspring and .

By using this crossover method, the offsprings maintain the same number of cluster centers as their parents and maintain the stability of the population. Using gene rearrangements before crossover can make the different chromosomes have the most similar cluster centers in the same position, avoid the generation of poor offspring when crossing and then lead to population degradation. The crossover operation can be illustrated in Figure 2.

(2) Based on the Truncation and Stitching Cross Operation. Different from the first method, the crossover operation based on truncation and stitching will produce the offsprings which are different from the number of the parent cluster centers, so as to maintain the diversity of the population. In this crossover operation, the string representing each cluster center is indivisible and can only be crossed at different gene strings.

The operation is described as follows: and are two parent individuals, where

Suppose that the intersection points of and are and , respectively. The offsprings and generated after crossing can be expressed as

The number of cluster centers represented by and is and , respectively. The crossover operation can be illustrated in Figure 3.

3.4.3. Mutation

Individuals are mutated according to gene loci, and random variation is usually made according to the variation probability . If the chromosome is selected for mutation, the location of the mutated gene will be selected randomly. After mutating, the floating point number at the gene site is replaced by another uniform random number.

3.4.4. Adaptive Operation

By using the adaptive strategy of crossover probability and mutation probability , the two parameters can be automatically changed according to the fitness of the current population. For the whole population, when the fitness value of the population tends to be consistent or tends to local optimum, the and increase appropriately; when the fitness value is dispersed, and are appropriately reduced. For an individual in a population, when its fitness is higher than the average fitness of the population, the lower and values make it more likely to enter the next generation; when the current fitness value is lower than the average fitness value, the higher and values will be given to make it more likely to be eliminated. Thus, the adaptive strategy can provide the best and for the solution [32]. and are calculated as follows:where is the larger fitness value of two individuals to be cross-operated, is the fitness value of the current individual, is the maximum fitness value of the current generation, and is the average fitness of the current generation, , . It should be noted that the fitness value mentioned here is the sum of two objective function values. When an individual’s fitness value is the maximum fitness value of a contemporary population, we set its and to 0.6 and 0.001, respectively.

3.4.5. Selecting a Solution from the Nondominated Set

In this paper, the majority voting method is used to determine the number of clusters . That is to say, in the dominant set, the number of occurrences of a cluster in the whole dominating cluster is more than 50% of the total number of occurrences, and the same number continuously appears more than 5 generations; we think it is the optimal cluster number. If the algorithm still cannot choose the optimal cluster number at the specified maximum number of iterations, corresponding to the best individual in the final generation is taken as the optimal cluster number.

3.4.6. Determine the Final Clustering Result

After the number of clusters is determined, all the individuals whose population number is equal to are selected to form a new population for clustering. The method is to use a combination of Canonical Genetic Algorithm (CGA) and FCM algorithms. The crossover operation used here only uses the nearest neighbor matching cross operation mentioned above, so it will not change the number of clusters. By combining the global optimization algorithm with FCM, this can effectively overcome the problem that the FCM algorithm can only obtain the local optimal solution. Finally, the algorithm will terminate after the objective function value no longer changes obviously, and the obtained result is the optimal clustering result.

At this point, the relevant concepts of the algorithm have been described. Algorithm 1 shows the steps of the ADNSGA2-FCM algorithm.

input: Dataset
Initialize parameters FCM and NSGA-II including population size Pop,
itermax, , , , , Tmax.
Random to select initial number of clusters and random to generate initial cluster
centers to create a initialize population .
Decode each individual to obtain the cluster centers, and calculate the membership
degrees using Eq. (3).
Calculate new cluster centers of each individual using Eq. (4) based on and
Calculate the of each individual using Eq. (2) based on and
Calculate fitness values and of each individual using Eq. (10) and (11). Calculate
, store and at each iteration.
Non-dominated sorting and crowding distance operation for population.
Using the crowded comparison operator to select.
Calculate and using Eq. (16).
Generate offspring using genetic operation.
Recombination current generation and offspring to select next generation using
elitism operation.
Using majority voting technique to determine the number of cluster .
If the number satisfies the selection condition, go to step (14); else go to step (3).
Find all chromosomes whose cluster numbers are equal to from the population in
step (11). The new population is composed of these chromosomes.
Decode each individual to obtain the cluster centers, and calculate the membership
degrees using Eq. (3).
Calculate new cluster centers of each individual using Eq. (4) based on and
Calculate the of each individual using Eq. (2) based on and
Calculate fitness values and of each individual using Eq. (10) and (11).
Calculate , store and at each iteration.
Non-dominated sorting and crowding distance operation for population.
Using the crowded comparison operator to select.
Calculate and using Eq. (16).
Generate offspring using genetic operation.
Recombination current generation and offspring to select next generation using
elitism operation.
If ADNSGA2-FCM has not met the stopping criterion (
and ), else and go to step (14).
Return the best individual ().

3.5. Time Complexity

The ADNSGA2-FCM algorithm has a worse-case time complexity, where denotes the number of generations, is population size, is the size of data, is maximum number of clusters, and are data dimensions.(1)In the initial stage of population, the time required is and each string contains dimensional features until the population size is full. Therefore, this construction requires .(2)In FCM clustering for each individual, suppose the number of data in the current data set is , both procedures of membership assignment and updating of center values take time. For the population, the time complexity is .(3)The time complexity of the two objective functions DB Index and Index are both .(4)The time complexity of each execution of crossover and mutation operators is .(5)The nondominated sorting in NSGA-II needs time for each solution to compare with every other solution to find if it is dominated. is the number of objectives and a maximum number of the nondominated solutions equals the population size . The comparison for all population members therefore requires time, where .(6)In the label assignment for each nondominated solution, time is required to assign label for every data point. To select the best solution from nondominated solutions, this yields time.

It can be seen that the time complexity of the algorithm is worse, and the complexity of each generation in the worst case is . Assuming that the algorithm runs generation, the time complexity is .

4. Experiment Study

In this paper, for the purpose of verifying the performance of the method proposed in this paper (ADNSGA2-FCM), some clustering algorithms are chosen for extensive comparative analysis. There are two kinds of soft subspace clustering algorithms ESSC [33] and MOEASSC [34]. MOEASSC is a multiobjective method and ESSC is a single-objective one. There are three kernel-based attribute weighting algorithms, VKCM-K-LP [35], VKFCM-K-LP [36], and MOKCW [37]. The MOKCW method is a multiobjective method; VKCM-K-LP and VKFCM-K-LP are single-objective methods. The VKCM-K-LP method is crisp version of clustering method, and the VKFCM-K-LP method is a fuzzy clustering method. The NSGA-II-FCM method is the nonadaptive version of ADNSGA2-FCM and used fixed parameters.

4.1. Datasets and Parameter Setting

For the purpose of comparison, there are two groups of data sets, artificial and real-life data sets. The three artificial data sets are Square 1, Square 4, and Sizes 5 from [20]. The six real-life data sets are obtained from the UCI Machine Learning Repository [38], namely, Iris, Wine, Newthyroid, Vertebral, Image, and Abalone. As shown in Table 1 the data sets considered are briefly described, where is the true number of classes and and are, respectively, the number of features and objects. For most SSC algorithms, the experiments are conducted on the data sets standardized into the interval , which can alleviate the uneven impact of different attributes’ ranges on updating the weights. Therefore, the standardization is based on the minimum and maximum values of each attribute.

The parameters of the ADNSGA2-FCM algorithm are set as shown in Table 2, the parameters of other algorithms are set as shown in Table 3.

4.2. Experiment Result and Analysis

In the first experiment, the above nine data sets (Square 1, Square 4, Sizes 5, Iris, Wine, Newthyroid, Vertebral, Image, and Abalone) and the adjusted Rand Index (ARI) [39] are used here to evaluate the clustering quality of ADNSGA2-FCM algorithm, which is to be maximized. Table 4 summarizes the result obtained from ADNSGA2-FCM. The real number of clusters can be obtained by the ADNSGA2-FCM algorithm in nine datasets. From the ARI value, the ADNSGA2-FCM algorithm has a big difference to the nine data sets. This is mainly because the ADNSGA2-FCM algorithm uses the fuzzy -means (FCM) method as the clustering method, but the FCM algorithm is suitable for allocating the data of the spherical clusters. Therefore, for six datasets of Square 1, Square 4, Sizes 5, Iris, Wine, and Newthyroid, the effect is better, and the effect of the other three data sets is poor. It also can be observed from Table 4 that, in all data sets, the optimal clustering result is obtained by the algorithm.

Figures 4–6 compare the results from the three synthetic data sets (Square 1, Square 4, and Sizes 5) of data partitioning obtained by ADNSGA2-FCM with center markings and the true data partitioning. The algorithm performs well under the well-separated structures of Square 1 (Figure 4) and Square 4 (Figure 5) as well as the unequally sized clusters of Sizes 5 (Figure 6). The overlapping and unequally sized characteristic causes more misclassification of the data points which are at the borderline between clusters.

(a)

(b)

(a)

(b)

(a)

(b)

In order to evaluate the performance of the clustering result of seven algorithms, three well-known external CVIs accuracy (Acc), rand index (RI) [40], and normalized mutual information (NMI) [33] are adopted here. They all take their values from the interval , in which 1 means the best match between the result and the true partition, whereas 0 means the worst result. In this experiment, all algorithms are executed 30 times independently, and their performances are compared in terms of the best case of Acc, RI, and NMI shown in Table 5. Among them, the best result is expressed in bold.

It can be firstly observed from Table 5 that, in all data sets, the optimal clustering result is obtained by the multiobjective algorithm. This result can prove that the multiobjective clustering algorithm has some advantages compared with the single-objective clustering algorithm. For most data sets, the ADNSGA-FCM algorithm proposed in this paper can obtain the best results.

For the data set Iris and Vertebral, the kernel-based multiobjective clustering algorithm MOKCW can achieve the best results. Compared with MOKCW algorithm and VKCM-K-LP algorithm, the two results are similar, and the result of ADNSGA2-FCM is worse. For the Vertebral data set, the ADNSGA2-FCM algorithm obtains the best Acc value. For the Wine data set, the MOEASSC algorithm can achieve the best effect, and the ADNSGA-FCM effect is very close to it. It shows that two kinds of multiobjective clustering methods based on evolutionary computation can obtain the best global results on Wine datasets. In the three datasets of Newthyroid, Image, and Abalone, ADNSGA2-FCM proposed in this paper has obvious advantages over other algorithms.

From Table 5 we can also see that the effect of the ADNSGA2-FCM algorithm using the adaptive mechanism is significantly better than the NSGA-II-FCM algorithm. Except that the two indicators are the same as ADNSGA2-FCM algorithm, the other indicators of NSGA-II-FCM are quite different from ADNSGA2-FCM algorithm. More obviously, in the Image dataset, the NSGA-II-FCM algorithm cannot obtain the correct number of clustering by 30 independent executions. Through careful analysis, we conclude that the adaptive mechanism makes the ADNSGA2-FCM algorithm finally find the correct number of clusters. This is mainly due to the fact that the adaptive mechanism effectively controls the speed of crossover and mutation of genetic algorithms. Because the NSGA-II-FCM algorithm does not adopt the adaptive mechanism, it leads to its premature convergence to the local optimal solution, which leads to the final clustering number being wrong. Looking at the other five data sets. Because the parameter values of NSGA-II-FCM are fixed, this leads to the fact that the algorithm does not make full use of data information in the optimization process, and the rate of convergence is too fast. Although the correct number of clusters was eventually found, the clustering effect was poor. With the increasing number of data and attributes in the data set, this trend is even more obvious. From this set of experiments, we can see that using an adaptive mechanism does improve the clustering effect.

From this result, it is easy to think that because the clustering problem lacks prior knowledge of the data set, and the genetic algorithm is also a random search algorithm, it is difficult to give suitable crossover probability and mutation probability. However, adopting an adaptive mechanism here can avoid giving fixed global parameters directly.

From the above analysis, we believe that the adoption of an adaptive mechanism is effective.

Table 6 shows the average performance rankings of all algorithms on the 6 datasets regarding Acc, RI, and NMI computed from Table 5, making a more evident comparison. From Table 6, we can see that the ADNSGA2-FCM algorithm proposed in this paper ranks first in Acc and RI, ranking second on NMI, mainly due to the fact that NMI indicators are not consistent with Acc and RI indicators. On Acc, the ADNSGA2-FCM algorithm has a greater advantage than the second algorithm. On RI, the ADNSGA2-FCM algorithm performs slightly better than the second algorithm. In NMI, ADNSGA2-FCM algorithm is worse than MOKCW algorithm, but it is not much different. It shows that the ADNSGA2-FCM algorithm has some advantages over the other 6 algorithms on the three indexes of Acc, RI, and NMI, and better clustering results can be obtained.

Figure 7 shows the histogram of mean values of the three indices in comparison for different algorithms. As can be observed from Tables 5 and 6 and Figure 7, the performance of our proposed method has obvious advantages in the Acc index and has a slight advantage in the RI index, which is not as good as the MOKCW algorithm in the NMI index.

The final Pareto optimal front obtained by ADNSGA2-FCM clustering technique on the real-life data sets, Iris, Wine, Newthyroid, Vertebral, Image, and Abalone is illustrated in Figures 8–10, respectively.

5. Conclusion

This paper presents a fuzzy clustering method based on multiobjective genetic algorithm. The ADNSGA2-FCM algorithm was developed to solve the clustering problem by combining the fuzzy clustering algorithm (FCM) with the multiobjective genetic algorithm (NSGA-II) and introducing an adaptive mechanism. In this paper, NSGA-II algorithm uses two cluster validity indexes of Index and DB Index as its objective function, so as to control multiobjective optimization. The algorithm does not need to give the number of clusters in advance. After the number of initial clusters and the center coordinates are given randomly, the optimal solution set is found by the multiobjective evolutionary algorithm. After determining the optimal number of clusters by majority vote method, the value is continuously optimized through the combination of Canonical Genetic Algorithm and FCM, and finally the best clustering result is obtained.

In addition to the basic framework of multiobjective genetic algorithm, the appropriate objective function is also one of the success factors of ADNSGA2-FCM algorithm. This paper does not use a single cluster evaluation index but uses two comprehensive evaluation indicators. These two indexes take into account both the within-cluster scatter and the between-cluster separation. The experimental results show that the multiobjective clustering method is better than the single-objective clustering method, and the better clustering results can be obtained by choosing a reasonable objective function.

Although the ADNSGA2-FCM algorithm performs well, it also has some inherent problems. Since the algorithm adopts the NSGA-II framework, the multiobjective genetic algorithm can only compromise among multiple objective functions, so the method can only approach the real Pareto front. Because the NAGA-II algorithm is a kind of genetic algorithm and there is strong randomness, we can find the optimal solution through the randomness, or we can not find the optimal solution through the randomness. So we cannot guarantee the optimal clustering solution is absolutely right.

In the following work, we hope to improve the selection and clustering accuracy of the optimal clustering results.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Fund of China (Grant no. 71471060) and the Fundamental Research Funds for the Central Universities (Grant nos. 2017XS135, 2018QN096).

References

J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: the fuzzy c-means clustering algorithm,” Computers & Geosciences, vol. 10, no. 2-3, pp. 191–203, 1984.
View at: Publisher Site | Google Scholar
J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, p. 14, University of California Press, Berkeley, Calif, USA, 1967.
View at: Google Scholar | MathSciNet
E. R. Hruschka, R. J. G. B. Campello, A. A. Freitas, and A. C. P. L. F. de Carvalho, “A survey of evolutionary algorithms for clustering,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 39, no. 2, pp. 133–155, 2009.
View at: Publisher Site | Google Scholar
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, Wiley-Interscience, New York, NY, USA, 2nd edition, 2001.
View at: MathSciNet
S. Bandyopadhyay, “Simulated annealing using a Reversible Jump Markov Chain Monte Carlo algorithm for fuzzy clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 4, pp. 479–490, 2005.
View at: Publisher Site | Google Scholar
A. Hatamlou and M. Hatamlou, “PSOHS: An efficient two-stage approach for data clustering,” Memetic Computing, vol. 5, no. 2, pp. 155–161, 2013.
View at: Publisher Site | Google Scholar
R. J. Kuo, Y. J. Syu, Z.-Y. Chen, and F. C. Tien, “Integration of particle swarm optimization and genetic algorithm for dynamic clustering,” Information Sciences, vol. 195, pp. 124–140, 2012.
View at: Publisher Site | Google Scholar
T. M. Silva Filho, B. A. Pimentel, R. M. C. R. Souza, and A. L. I. Oliveira, “Hybrid methods for fuzzy clustering based on fuzzy c-means and improved particle swarm optimization,” Expert Systems with Applications, vol. 42, no. 17-18, pp. 6315–6328, 2015.
View at: Publisher Site | Google Scholar
U. Maulik and S. Bandyopadhyay, “Genetic algorithm-based clustering technique,” Pattern Recognition, vol. 33, no. 9, pp. 1455–1465, 2000.
View at: Publisher Site | Google Scholar
U. Maulik and S. Bandyopadhyay, “Fuzzy partitioning using a real-coded variable-length genetic algorithm for pixel classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 41, no. 5, pp. 1075–1081, 2003.
View at: Publisher Site | Google Scholar
C. D. Nguyen and K. J. Cios, “GAKREM: a novel hybrid clustering algorithm,” Information Sciences, vol. 178, no. 22, pp. 4205–4227, 2008.
View at: Publisher Site | Google Scholar
A. Ye and Y. Jin, “A Fuzzy C-Means Clustering Algorithm Based on Improved Quantum Genetic Algorith,” Journal of Database Theory and Application, vol. 9, no. 1, pp. 227–236, 2016.
View at: Publisher Site | Google Scholar
G. Garai and B. B. Chaudhuri, “A novel genetic algorithm for automatic clustering,” Pattern Recognition Letters, vol. 25, no. 2, pp. 173–187, 2004.
View at: Publisher Site | Google Scholar
D. X. Chang, X. D. Zhang, and C. W. Zheng, “A genetic algorithm with gene rearrangement for K-means clustering,” Pattern Recognition, vol. 42, no. 7, pp. 1210–1222, 2009.
View at: Publisher Site | Google Scholar
S. Saha and S. Bandyopadhyay, “A fuzzy genetic clustering technique using a new symmetry based distance for automatic evolution of clusters,” in Proceedings of the International Conference on Computing: Theory and Applications, ICCTA 2007, pp. 309–313, ind, March 2007.
View at: Publisher Site | Google Scholar
S. Bandyopadhyay and S. Saha, “A point symmetry-based clustering technique for automatic evolution of clusters,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 11, pp. 1441–1457, 2008.
View at: Publisher Site | Google Scholar
Y. Zhang, D.-W. Gong, and J. Cheng, “Multi-objective particle swarm optimization approach for cost-based feature selection in classification,” IEEE Transactions on Computational Biology and Bioinformatics, vol. 14, no. 1, pp. 64–75, 2017.
View at: Publisher Site | Google Scholar
Y. Zhang, D.-W. Gong, X.-Y. Sun, and Y.-N. Guo, “A PSO-based multi-objective multi-label feature selection method in classification,” Scientific Reports, vol. 7, no. 1, article no. 376, 2017.
View at: Publisher Site | Google Scholar
J. Handl and J. Knowles, “Evolutionary multiobjective clustering,” in Parallel Problem Solving from Nature—PPSN VIII, vol. 3242 of Lecture Notes in Computer Science, pp. 1081–1091, Springer, Berlin, Germany, 2004.
View at: Publisher Site | Google Scholar
J. Handl and J. Knowles, “An evolutionary approach to multiobjective clustering,” IEEE Transactions on Evolutionary Computation, vol. 11, no. 1, pp. 56–76, 2007.
View at: Publisher Site | Google Scholar
S. Saha and S. Bandyopadhyay, “A symmetry based multiobjective clustering technique for automatic evolution of clusters,” Pattern Recognition, vol. 43, no. 3, pp. 738–751, 2010.
View at: Publisher Site | Google Scholar
I. Saha, U. Maulik, and D. Plewczynski, “A new multi-objective technique for differential fuzzy clustering,” Applied Soft Computing, vol. 11, no. 2, pp. 2765–2776, 2011.
View at: Publisher Site | Google Scholar
X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 8, pp. 841–847, 1991.
View at: Publisher Site | Google Scholar
C. A. C. Coello, “A comprehensive survey of evolutionary-based multiobjective optimization techniques,” Knowledge and Information Systems, vol. 1, no. 3, pp. 269–308, 1999.
View at: Publisher Site | Google Scholar
K. Deb, Multiobjective Optimization Using Evolutionary Algorithms, New York, NY, USA, Wiley, 2001.
View at: MathSciNet
K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002.
View at: Publisher Site | Google Scholar
E. Zitzler and L. Thiele, “An evolutionary algorithm for multiobjective optimization?: the strength pareto approach,” TIK-Report 43, 1998.
View at: Google Scholar
E. Zitzler, M. Laumanns, and L. Thiele, “SPEA2: improving the strength pareto evolutionary algorithm,” TIK-Report, vol. 103, 2001.
View at: Google Scholar
L. Zhao, Y. Tsujimura, and M. Gen, “Genetic algorithm for fuzzy clustering,” in Proceedings of the IEEE International Conference on Evolutionary Computation, pp. 716–719, Nagoya, Japan.
View at: Publisher Site | Google Scholar
D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224–227, 1978.
View at: Publisher Site | Google Scholar
U. Maulik and S. Bandyopadhyay, “Performance evaluation of some clustering algorithms and validity indices,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1650–1654, 2002.
View at: Publisher Site | Google Scholar
M. Srinivas and L. M. Patnaik, “Adaptive probabilities of crossover and mutation in genetic algorithms,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 24, no. 4, pp. 656–667, 1994.
View at: Publisher Site | Google Scholar
Z. Deng, K.-S. Choi, F.-L. Chung, and S. Wang, “Enhanced soft subspace clustering integrating within-cluster and between-cluster information,” Pattern Recognition, vol. 43, no. 3, pp. 767–781, 2010.
View at: Publisher Site | Google Scholar
H. Xia, J. Zhuang, and D. Yu, “Novel soft subspace clustering with multi-objective evolutionary approach for high-dimensional data,” Pattern Recognition, vol. 46, no. 9, pp. 2562–2575, 2013.
View at: Publisher Site | Google Scholar
M. R. P. Ferreira, F. D. A. T. De Carvalho, and E. C. Simões, “Kernel-based hard clustering methods with kernelization of the metric and automatic weighting of the variables,” Pattern Recognition, vol. 51, pp. 310–321, 2016.
View at: Publisher Site | Google Scholar
M. R. Ferreira and F. d. de Carvalho, “Kernel fuzzy c-means with automatic variable weighting,” Fuzzy Sets and Systems, vol. 237, pp. 1–46, 2014.
View at: Publisher Site | Google Scholar | MathSciNet
Z. Zhou and S. Zhu, “Kernel-based multiobjective clustering algorithm with automatic attribute weighting,” Soft Computing, pp. 1–25, 2017.
View at: Publisher Site | Google Scholar
K. Bache and M. Lichman, UCI Machine Learning Repository, University of irvine california (Schedule/Informatics) 2008, 2013.
J. M. Santos and M. Embrechts, “On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification,” in Artificial Neural Networks – ICANN 2009, vol. 5769 of Lecture Notes in Computer Science, pp. 175–184, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
View at: Publisher Site | Google Scholar
J. Z. Huang, M. K. Ng, H. Rong, and Z. Li, “Automated variable weighting in k-means type clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 657–668, 2005.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2018 Ze Dong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1521

Downloads

948

Citations