Abstract

Ensemble clustering can improve the generalization ability of a single clustering algorithm and generate a more robust clustering result by integrating multiple base clusterings, so it becomes the focus of current clustering research. Ensemble clustering aims at finding a consensus partition which agrees as much as possible with base clusterings. Genetic algorithm is a highly parallel, stochastic, and adaptive search algorithm developed from the natural selection and evolutionary mechanism of biology. In this paper, an improved genetic algorithm is designed by improving the coding of chromosome. A new membrane evolutionary algorithm is constructed by using genetic mechanisms as evolution rules and combines with the communication mechanism of cell-like P system. The proposed algorithm is used to optimize the base clusterings and find the optimal chromosome as the final ensemble clustering result. The global optimization ability of the genetic algorithm and the rapid convergence of the membrane system make membrane evolutionary algorithm perform better than several state-of-the-art techniques on six real-world UCI data sets.

1. Introduction

Cluster analysis, also known as clustering, is a core technique in machine learning and artificial intelligence [1], which is a process of dividing a data object into subsets, each subset is defined as a cluster, and objects in the same cluster are as similar as possible, yet objects between two clusters are as different as possible.

Ensemble clustering, also known as consensus clustering or cluster aggregation, is simply reconciling clustering result coming from different clustering algorithms [2] or different initialization parameters run in the same algorithm [3]. The purpose of ensemble clustering is to find a consensus result which is as similar as possible to multiple existing base clusterings [4]. Compared with the single clustering algorithm, the clustering ensemble algorithm has higher robustness and stability, and the clustering results are insensitive to noise, isolated points, and sampling changes, so ensemble clustering has become a hotspot of cluster research in recent years. Existing ensemble clustering research methods can be divided into three categories, that is, the median partition based methods [5, 6], the pairwise similarity based methods [710], and the graph partitioning based methods [4, 1113]. Among them, the median partition based methods aim to find a clustering that maximizes the similarity between this clustering and all of the base clusterings which can be viewed as the median point of the median partition [5, 6, 14].

The clustering problem of finding the optimal solution in many base clusterings becomes an optimization problem. Due to the large space of all possible base clusterings, finding the optimal solution is generally infeasible, and genetic algorithm as a classic optimization problem solving method has attracted my attention. Genetic algorithm is a randomized search method which simulates the evolution of biological laws [15]. It has inherent parallelism and global optimization ability. Using probabilistic optimization method, it can automatically obtain and guide the optimization search space and adaptively adjust the search direction [1618]. The ensemble clustering problem is generally regarded as the median partition problem. In fact, the median partition problem is NP-complete [5]. Genetic algorithm has been proposed to find the approximative solution, in which the base clusterings are represented as chromosomes [5, 19]. In their study, chromosome is defined by base clustering class labels; when the number of data objects is large, the evolutionary efficiency is very low. In this paper, we improve the coding of chromosomes, and then the improved genetic algorithm is combined with membrane computing model for ensemble clustering.

P system, also known as a novel membrane computing model, is a biological computational model inspired by the study of the living cells, initiated by Păun in 1998. It aims to achieve calculation process by simulating the function of living cells, tissues, and organs. Objects in this model, which has complete computing capability, can evolve in a maximal parallelism and distributed manner [20]. It is exactly because of the maximum parallelism of membrane system that realizes multiple cell object concurrent evolution to search the optimal solution, which is similar to the effect of multipopulation evolution, thus making better performance of ensemble clustering. Membrane systems have the same computing power as Turing machines and even do what Turing machines can do more efficient [21, 22]. According to the different organizational structure of the system, the P system is divided into three categories: cell-like P system [23], tissue-like P system [24], and neural-like P system [25]. Among them, the cell-like P system is the first membrane model proposed by scholars, and the research of this P system is also most complete [2628]. Its basic components include membrane structure, objects, and membrane rules. In the cell-like P system, membranes divide the whole system into different regions in which objects and rules exist; the objects are usually represented by characters or strings of symbols; the rules in each region are used to process the objects in the corresponding membrane. Objects are operated by rules in the membrane in a highly parallel mechanism [2931], so that the system can make ensemble clustering more efficient.

In this paper, we introduce three genetic operators (selection, crossover, and mutation) of the genetic mechanism to realize the evolution of the chromosome and use the communication mechanism of cell-like P system to realize the sharing of outstanding objects between the membranes; it accelerates the convergence of the algorithm. The proposed algorithm is used to optimize the base clusterings and find the optimal chromosome as the final ensemble clustering result. In Section 2, we give basic concept of ensemble clustering and genetic algorithm and cell-like P system. Section 3 describes the improved GA-based consensus clustering algorithm. Section 4 addresses proposed algorithm. Section 5 shows the result of the experiment and finally we summarized the work in this paper and then plan the future work in Section 6.

2. Preliminaries

In this section, we introduce some basic concepts of ensemble clustering, genetic algorithm, and cell-like P system.

2.1. Ensemble Clustering

Ensemble clustering process is divided into two steps; first we generate a set of different base clusterings and then use consensus function to find a consensus clustering result which agrees as much as possible with existing base clusterings. In order to produce a number of diversified base clusterings, from the perspective of the algorithm, same clustering algorithm can be used with different initialization parameters or the use of different clustering algorithms. From the data set preprocessing point of view, we can choose different attributes or different sample subsets of data sets. The ensemble clustering process is shown as Figure 1.

2.2. Genetic Algorithm

Genetic algorithm is one of the intelligent optimization algorithms; it has the advantages of fast search speed, good universality, and global search ability.

The basic steps of genetic algorithm are as follows:(1)Select encoding mode; set the crossover rate, mutation rate, and the evolution generation Gen = 0.(2)The initial population is P(Gen).(3)Calculate the fitness of each chromosome in the population according to the objective function.(4)Gen = Gen + 1.(5)If Gen reaches the set condition, go to step (11); otherwise go to step (6).(6)Two chromosomes are selected from P(Gen − 1), and the probability of selection was proportional to chromosome’s fitness.(7)Crossover is performed at a randomly determined point of each pair selected chromosome at a preset hybridization rate.(8)A point is randomly selected from each selected chromosome in accordance with the preselected mutation rate, and the corresponding bit value is changed.(9)The new generated chromosomes and those with high fitness value in P(Gen − 1) are selected for evolution to the next generation P(Gen).(10)If termination condition is not satisfied, go to (3).(11)The chromosome with the highest fitness in the population P(Gen) is the final result, and the algorithm stops.

2.3. Cell-Like P System

P system is a distributed, maximal parallelism and nondeterministic computation model; numerous studies [32] have shown that many simple membrane computing models have the same compute power as Turing machines in theory and may even have the potential to go beyond the limitations of Turing machines.

Cell-like P system is the earliest membrane computing model; three basic elements of the P system are membrane structure, the multiple sets of objects, and evolutionary rules. The data set is represented by strings or characters; objects are controlled by this intramembrane evolution rule and can pass through the membrane. P system is divided into many regions by membranes; the outermost layer of the membrane structure is called skin membrane. A plurality of submembranes is contained in the skin membrane; the basic membrane structure is shown as Figure 2.

A cell-like P system of degree is defined as follows:where(1) is an alphabet which includes all the objects of the system.(2) is the output alphabet.(3) is a set of catalysts whose elements will not change during evolution and do not produce new characters, but they are necessary for some evolutionary rules.(4) is the membrane structure of degree .(5) are the multisets of objects in each membrane region .(6) are the revolutionary rules in membrane .(7) is the precedence level of rule .(8) is the output of this P system.

In the cell-like P system, the basic evolutionary rule is the two tuples (), which can also be expressed as , is the string of , and , or , is the string in arbitrary , means the object remains in membrane , means the object will be sent to the outer membrane, and means the object will be sent to the inner membrane . If the evolutionary rule contains , this membrane is dissolved after the rule is executed. P system starts with the initial state (represented by the object multiset) and uses the evolutionary rule to process and transport objects to complete the calculation.

3. Improved GA-Based Ensemble Clustering Algorithm

3.1. Microcluster Based Chromosome Encoding

The fitness function guides the evolution direction of the population. Genetic algorithm is one of the solutions for clustering problem. In the previous studies, in genetic-based ensemble clustering algorithm, the class labels of base clusterings are used as chromosome encoding. When the number of data objects is large, it occupies a lot of space and the efficiency is reduced. In addition, crossover and mutation operations may result in the reassignment of the data points that have been assigned in the same clusters. Specifically, if two objects are divided into the same clusters among all the base clusterings, we consider them fully similar, and they will be considered to be one object that cannot be separated by crossover and mutations operations. So in this paper, we improve the coding of chromosome and proposed the microcluster based chromosome encoding approach.

We introduce the concept of the microcluster for a more compact representation of the base clusterings. Let be a date set of objects. We run times basic clustering algorithms to partition to base clusterings , where is the th base clustering. Let be the cluster in that contains object . The objects and are regarded as a microcluster if they are divided into the same cluster for all of the base clusterings; that is, for , .

Given multiple base clusterings, we can obtain a set of nonoverlapping microclusters shown in Figure 3, donated as

In Figure 3, we show the generation process of microcluster, and we use a date set with seven objects as a sample. Two base clusterings and are shown in (a) and (b), which contain two clusters and three clusters; we overlap (a) and (b) to get (c); then we generate a set of microclusters in (d). The process of microclusters generation is as shown in Figure 3. is a set of microclusters, and represents the th microclusters.

In this paper, we use the label of microcluster-based to replace the label of original object to code the chromosome, and a microcluster contains one or many objects that can be regarded as an object in the process of chromosome coding, which can reduce the length of the chromosome and decrease the error caused by mutation and crossover and thereby improve the accuracy of the algorithm. For example, in Figure 3 the two base clusterings are coded with the class label of objects; they are coded as , in previous approach; in this paper, we can code them as , ; each base clustering includes four microclusters and coded value represents the cluster labels to which they belong. This method makes the individual coding shorter and thus reduces the search space, and meanwhile the individuals considered to be fully similar in the base clusterings are no longer separated.

3.2. Design of Fitness Function

The fitness function guides the evolution direction of the population; the solution of the clustering problem is to find a clustering result that makes the objects in the same cluster have the largest similarity, but the largest difference between two clusters. So in this paper we use a clustering evaluation method OCQ proposed in [33] as fitness function. The definition of OCQ is as follows:where Cmp represents cluster compactness and Sep indicates the cluster’s disposability. is the balance coefficient and, which is used to weight the proportion of the Cmp and Sep, different data sets with different value. Cmp is defined as follows:where is the number of clusters, is the variance of , and is the variance of class . is defined as follows:where is the number of objects in data set , , and is the distance between and . The smaller the value of Dev, the better of the clustering result. Sep is defined as follows:where is the Gaussian constant, in order to facilitate the calculation, usually , and , and are the center of clusters and . The larger the value of OCQ, the better of the clustering result.

3.3. Elite Selection Function

In this section, we introduce an elite selection strategy to preserve the optimal individual in the evolution of the population. In each generation, a certain number of high fitness chromosomes are selected directly for evolution to the next generation in order to save excellent genes. In addition to the fact that elite strategy improves the evolution efficiency and optimization ability of the proposed algorithm, the ratio of the chromosomes that are directly selected for evolution to the next generation increases linearly with the number of iterations :where and are the maximum and minimum selection ratio; when the evolution algebra increases, the proportion of excellent genes in the population also increases, so we design this elite selection function to let the ratio grow with . Experiments show that when the size is 2%~10% of the population, the evolution result is the best, and and are set as 0.1 and 0.02, respectively.

4. The Proposed GA-Based Membrane Evolutionary Algorithm

4.1. The Evolution Rules and the Communication Rules of Cell-Like P System

In cell-like P system, membrane rules mainly include two types of rules, evolutionary rules and communication rules. Evolutionary rules are used to promote the evolution of chromosome. Communication rules are used to communication and share information between two regions.

In this paper, the evolutionary rules contain -means rules, AL, SL, and CL rules [14], selection rules, crossover rules, and mutation rules.

-means rules are used to generate the base clusterings; the detailed description of -means rules is as follows.

Given a data set , and a set of center of cluster , if the distance between and is less than the distance between and , the object will be reassigned to :When all the points are assigned to the corresponding clusters, the new center of cluster corresponding to each cluster is the average value of the points in this cluster:where the center of the new cluster and is the number of objects belonging to .

For AL, SL, and CL rules, two partitions with the highest similarity are merged into a new bigger partition and thus the number of objects will finally reduce to one. The similarity of two partitions will be computed by the mentioned three rules. Let be the set of merged partition in the -step for . is the number of objects of date set. represents the number of partitions in . Each partition contains one or more microclusters. Let represent a microcluster; we write if microcluster belongs to Let ; the similarity matrix for , AL, SL, and CL rules can be operated as follows: where is the Cosine similarity and is the number of microclusters of .

Selection rules imitate the nature laws of natural selection, which are used to select objects from population to evolution to the next generation. In this paper, we calculate the fitness value of each chromosome, and then the selection probability of each chromosome is obtained based on the fitness value. Each chromosome is selected to do crossover and mutation to improve the fitness. And then a certain percentage of chromosomes with high fitness are chosen as candidate set evolution to the next generation. We use the usual rotating wheel method to define selection rule; the selection probability formula is as follows:where is the number of the chromosomes and is the fitness value of each individual.

In the evolutionary process, the algorithm often falls into the local optimum, crossover rate and mutation rate are increased to improve the global convergence [34], and the crossover function is as follows:where , is predefined maximum crossover rate, and is the minimum crossover rate.

The mutation function is as follows:where ,, and are predefined maximum mutation rate and minimum mutation rate.

The crossover rule uses the single-point crossover in which the intersection is according to the crossover probability (12). The single-point mutation is used to realize the mutations of objects and produce new individuals. Since the mutation operation has a certain degree of blindness, we set the mutation probability very small, and the mutation probability is calculated as (13). If is a mutation point determined by the mutation function , its value becomes , which means a random positive integer between (), and is the maximum value of the present mutation individual.

Communication Rules. Communication rules enable the exchange of information between two membranes, share excellent objects, and promote the evolution of the object set in each membrane. The form of the communication rule is as follows:

This communication rule means object in membrane is exchanged with the object in membrane ; if means is null, is transported to , and vice versa. In this paper, we define a copy of object that still remains in membrane after is transported to .

4.2. Description of the Proposed GA-Based Membrane Evolutionary Algorithm

In this section, we design the membrane structure for proposed algorithm which is shortly called GMEAEC and descript the algorithm process. The membrane structure is as shown in Figure 4.

This cell-like P system is defined as follows:where(1) represents the initial objects in membrane 1; initial objects are the data to be clustered. , are the base clusterings randomly selected from membrane 1, are elite individuals selected from subpopulations according to the probability (7), and is the best chromosomes in each generation preserved in membrane 0.(2) are the evolution rules in membrane , are the evolution rules which are used to generate base clusterings including -means rules, and AL, CL, and SL rules, , include select rule, crossover rule, mutation rule, and communication rule in membrane , which are used to achieve the evolution of the population, while is the rule in membrane that is the communication rule.(3) is the output result in membrane 0.

The description of the algorithm process is as follows:(1)Run base clusterings algorithm times in membrane 1 to construct a pool of base clusterings and then generate microcluster representation.(2)Randomly select the same number of base clusterings from membrane 1 to membrane , respectively, to construct multiple population.(3)Initialize the population; each chromosome is coded by a base clustering represented by the microcluster-based label.(4)Calculate the fitness of the individuals according to the fitness function.(5)Transport -elite individuals of each subpopulation to membrane to construct elite individuals and simultaneously original populations keep a copy.(6)Use selection rules to select the chromosomes according to the predefined probability, and use crossover rule and mutation rule to promote chromosomes evolution; the population in each membrane evolves in parallel.(7)Sort the fitness of the chromosomes in the membrane and then select the top- chromosomes and transport them to membrane to replace the low fitness chromosomes.(8)Transport the best chromosome to membrane 0; if its fitness value is larger than the present one, replace it, or else abandon it.(9)If the condition is satisfied, the algorithm ends, and we obtain the highest fitness chromosome; then map microclusters back to objects and output the objects in the membrane 0, or else repeat (4)–(9).

The overall process of our approach is shown in Figure 5. We first use -means and three agglomerative methods to generate base clusterings pool, and then we assign the data objects to the microcluster, after that we code the chromosome with label of microcluster-based introduced in Section 3.1. The evolutionary mechanism of GA will find the final ensemble result.

The membrane evolutionary algorithm takes the advantage of the maximum parallelism of membrane systems and global search optimization ability of genetic algorithm; in the base clusterings generation step, we use four algorithms combined with different initial parameters to obtain diversified base clustering which make the ensemble result share the information of many single clustering results and integrate them to get a better ensemble clustering result than any one of them. In the ensemble clustering step, the result is obtained by the membrane evolutionary algorithm which uses the improved genetic algorithm; the improved encoding of the chromosome regards the objects assigned in the same clusters for all base clustering as a microcluster, so that they will not be separated by crossover and mutation operation which increases the accuracy of the clustering. In addition, the elite selection strategy and parallelism of membrane systems make the -elite chromosomes be generated synchronously in each membrane and the -elite chromosomes among them are transported to all membranes to guide the evolution of the next generation. All of the above make the GMEAEC performs better than other algorithms.

4.3. Time Complexity Analysis

In this section, the time cost in the worst case of GMEAEC is analyzed. In the base clustering generation step, we put the objects in membrane 1 and use -means and three agglomerative clustering methods with different initial parameters to generate base clusterings. Let dataset have records; each record has attributes; we partition the date set to clusters; the computational complexity of -means is , where is the number of iterations for the convergence of -mean clustering and is the number of base clusterings generated by -means. The computational complexity of three agglomerative methods is , and is the number of base clusterings generated by each agglomerative method. After generating base clusterings pool, we can compute microclusters, and the complexity of the microclusters generation is ; the complexity of the integration step is , where MaxGen is the number of iterations for convergence of genetic algorithm. As a result, the complexity of the base clustering generation is , and the complexity of the ensemble clustering step is .

5. Experiment Analysis

5.1. Experimental Setup

Experimental Data. We use six real-world data sets of UC Irvine Machine Learning Repository [35] in our experiment. Table 1 shows some important characteristics of these data sets.

Validation Measure. It is used to measure the accuracy of the proposed algorithm; in this paper, we use normalized rand index () [36] since the cluster label of all data sets is known. Its value usually ranges between . The higher value means the high accuracy of the clustering result.

Base Clusterings Generation. It has been shown that ensemble clustering will be more effective when the base clusterings errors are different; that is, diversity among the base clusterings will enhance the ensemble result. A single clustering algorithm over many iterations usually generates similar result, so for each dataset we use -means and three agglomerative clustering methods, namely, average-linkage (AL), complete-linkage (CL), and single-linkage (SL) to generate base clusterings pool, with initial number of clusters randomly within ; is the true number of clusters = min, and is the number of the data sets. By running -means and AL, CL, and SL 50 times, respectively, a pool of 200 base clusterings is obtained for each benchmark dataset, for each run of the proposed algorithm and comparison ensemble algorithm we randomly select base clusterings for ensemble. To rule out the factor of getting lucky occasionally, for each we repeat selection many times for each experiment and get the average performance of all ensemble methods. Unless specially mentioned, the ensemble size is in our experiment.

Parameter Setting. The maximum iterate times of the proposed algorithm are set according to the dataset size. The crossover rate and mutation rate are set as follows: and are 0.3 and 0.1 and are 0.09 and 0.01. We design the crossover rate and mutation rate associated with the evolution algebra to improve the global convergence of the proposed algorithm. The number of the membranes is , among which membrane 0 is used for saving the optimal solution and membrane 1 is used to generate base clustering pool, membrane is used for preserving the better individual in each population, and other membranes are used for the evolution of individuals in a parallel way; among them the top- individuals with high fitness will directly evolve to the next generation. Evolution generation is various in different data sets for the best result.

5.2. Comparison against Base Clusterings

The purpose of the ensemble clustering is to generate a more accuracy and robust clustering result than base clusterings algorithm by integrating multiple base clusterings results to a consensus one; in this section, we compare our proposed algorithm GAEAEC against the base clusterings to prove the effectiveness of the algorithm. The average value of scores is obtained over 100 times runs for each algorithm. As shown in Figure 6, the proposed GMEAEC algorithm outperforms base clusterings algorithms on all of the given data sets.

5.3. Comparison against Other Ensemble Clustering Approaches

In this section, we evaluate the effectiveness and robustness of the proposed algorithm by comparing it with five other ensemble clustering approaches, five types of ensemble clustering method, namely, -means based consensus clustering (KCC) [37]; a GA-based ensemble clustering algorithm [19] which is shortly called CEGA; and three graph partitioning algorithms, CSPA, HGPA, and MCLA [4] which are employed for the comparison purpose. KCC is a method which transforms the consensus clustering to -means clustering by the contingency matrix and binary data set. CEGA is a GA-based ensemble clustering method which encodes the chromosome with the class label of the base clusterings. CSPA is one of the most primitive ensemble clustering methods; if the objects are divided into the same cluster for all base clusterings, then they are considered to be completely similar; if not they are dissimilar, and the similarity of two objects is defied by the probability of dividing into the same clusters. Based on the above description, the entire matrix can be computed in one sparse matrix multiplication , is the number of base clusterings, is the times of objects , and belongs to the same clusters. The graph partitioning method METIS algorithm [38] is used to partition the similarity graph (vertex = object, edge weight = similarity). HGPA is a hypergraph partitioning algorithm, each data is regarded as vertices with the same weight, and each cluster is considered as a hyperedge. The ensemble clustering is converted into a hypergraph partitioning by cutting the graph into partitions with the minimal cut. The idea of MCLA is to group the hyperedges which is represented by clusters and divide the object to the hyperedges in which it participates most times.

We run the proposed GMEAEC algorithm and another ensemble clustering algorithm 100 times on each data set; for each run, the base clusterings are randomly selected from the base clusterings pool, and the number of the base clusterings is preset. More detail about it and parameter setting is descripted in Section 5.1. We show the statistics of the max, min, average (ave), and variance (var) of value in Table 2; we use two criteria, average value and variance, to evaluate the accuracy and the robustness of the proposed algorithm. We can see from Table 2 that the top 3 highest scores of average value and the bottom 3 scores of variance are highlighted in bold. The proposed algorithm achieves the highest scores for balance, pima, and wine datasets, both average value and maximum value in terms of for 100 runs, while the variance values for wine and magic04 datasets are the lowest. To compare the performance of these approaches in a clear way, Figure 7(a) shows the number of each approach to be ranked in the top 3 of the average value which indicates the accuracy of the algorithm. Figure 7(b) shows the number of each approach to be ranked in the bottom 3 of the variance value which illustrates the stability and robustness of the algorithm. The proposed algorithm achieves the overall best performance in both clustering accuracy stability and robustness compared to other ensemble clustering approaches for all the datasets.

5.4. Robustness to Ensemble Size

In this section, we further evaluate the robustness of GMEAEC by varying the size of base clusterings. For each dataset, we, respectively, select 10, 20, 30, 40, and 50 base clusterings for clustering ensemble. For each , we run the GMEAEC and other ensemble clustering algorithms for 10 times and report the average scores in Figure 8. We can see from Figure 8 that the GMEAEC performance is nearly consistently the best for all ensemble sizes and significantly better than other ensemble methods for all the dataset. Especially for balance dataset, the GMEAEC appears obviously superior on various ensemble sizes than other methods, which demonstrates the advantage of our method in robustness for all dataset and ensemble size.

6. Concluding Remarks

In this paper, we improve coding of chromosomes in the previous study; a microcluster-based chromosome encoding is designed to improve the accuracy of ensemble clustering. The improved genetic algorithm contains select rule, crossover rules, and mutation rules. These rules are used as evolution rules to combine with the communication mechanism of cell-like P system. This novel GA-based membrane evolution algorithm is proposed for ensemble clustering. The global convergence of the proposed algorithm and parallel computing ability of cell-like P system make it show better performance in six real-world data sets. In the future, we will combine the GA with other evolutionary algorithms and other membrane systems to improve accuracy and efficiency of ensemble clustering.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This project is supported by National Natural Science Foundation of China (61472231, 61170038, 61502283, and 61640201), Jinan City Independent Innovation Plan Project in College and Universities, China (201401202), Ministry of Education of Humanities and Social Science Research Project, China (12YJA630152), and Social Science Fund Project of Shandong Province, China (11CGLJ22 and 16BGLJ06).