Abstract

P systems are a class of distributed parallel computing models; this paper presents a novel clustering algorithm, which is inspired from mechanism of a tissue-like P system with a loop structure of cells, called membrane clustering algorithm. The objects of the cells express the candidate centers of clusters and are evolved by the evolution rules. Based on the loop membrane structure, the communication rules realize a local neighborhood topology, which helps the coevolution of the objects and improves the diversity of objects in the system. The tissue-like P system can effectively search for the optimal partitioning with the help of its parallel computing advantage. The proposed clustering algorithm is evaluated on four artificial data sets and six real-life data sets. Experimental results show that the proposed clustering algorithm is superior or competitive to k-means algorithm and several evolutionary clustering algorithms recently reported in the literature.

1. Introduction

Data clustering is a fundamental conceptual problem in data mining, which describes the process of grouping data into classes or clusters such that the data in each cluster share a high degree of similarity while being very dissimilar to data from other clusters [1]. Over the past years, a large number of clustering algorithms have been proposed [24], which can be divided roughly in two categories: hierarchical and partitional. Hierarchical clustering proceeds successively by either merging smaller clusters into larger ones or splitting larger clusters. Partitional clustering attempts to directly decompose a data set into several disjointed clusters based on similarity measure, for example, mean square error (MSE). Clustering algorithms have been used in a wide variety of areas, such as pattern recognition, machine learning, image processing, and web mining [5, 6]. In the present study, -means algorithm [7, 8] has received wide attention because of the following two reasons: (i) -means has been recently elected and listed among the top most influential data mining algorithms [9] and (ii) it is at the same time very simple and quite scalable, as it has linear asymptotic running time with respect to any variable of the problem. However, -means is sensitive to the initial centers and easy to get stuck at the local optimal solutions. Moreover, -means takes large time cost to find the global optimal solution when the number of data points is large.

In recent years, some evolutionary algorithms have been introduced to overcome the shortcomings of -means algorithm because of their global optimization capability. Several genetic algorithms- (GA-) based clustering algorithms have been proposed in the literature [1014]. However, most of GA-based clustering algorithms can suffer from the degeneracy when numerous chromosomes represent the same solution. The degeneracy can lead to inefficient coverage of the search space as the same configurations of clusters are repeatedly explored. To overcome the shortcoming, particle swarm optimization- (PSO-) based or ant colony optimization- (ACO-) based clustering algorithms have been proposed. Kao et al. have proposed a hybrid technique based on combining the -means and PSO for cluster analysis [15]. Shelokar et al. have introduced an evolutionary algorithm based on ACO for clustering problem [16]. Niknam and Amiri have presented a hybrid evolutionary optimization algorithm based on the combination of PSO and ACO for solving the clustering problem [17].

The aim of membrane computing is to abstract computing ideas (data structures, operations with data, ways to control operations, computing models, etc.) from the structure and the functioning of a single cell and from complexes of cells, such as tissues and organs including the brain. There are three main classes of P systems investigated: cell-like P systems (based on a cell-like (hence hierarchical) arrangement of membranes delimiting compartments where multisets of chemicals evolve according to given evolution rules) [18], tissue-like P systems (instead of hierarchical arrangement of membranes, consider arbitrary graphs as underlying structures, with membranes placed in the nodes while edges correspond to communication channels) [19], and neural-like P systems [20]. Many variants of all these systems have been considered, for example, [21, 22] for cell-like P systems, [23, 24] for tissue-like P systems, and [2530] for neural-like P systems. An overview of the field can be found in [31], with up-to-date information available at the membrane computing website (http://ppage.psystems.eu/). These efforts have addressed the parallel computing advantage of P systems as well as the high effectiveness of solving a variety of difficult problems; especially, P systems can solve a number of NP-hard problems in linear or polynomial time complexity [32] and even solve PSPACE problems in a feasible time [33, 34]. Moreover, membrane algorithms have demonstrated a powerful global optimization performance [3537].

This paper focuses on application of membrane computing to data clustering. Our motivation is applying the specially designed elements and inherent mechanisms of P systems to realize a novel clustering algorithm, called the membrane clustering algorithm.

2. Data Clustering Problem

Clustering is the process of recognizing natural groups or clusters from a data set based on some similarity measure. Suppose that data set has sample points, , (), and is partitioned into clusters, . Denote by the corresponding centers. Usually, partitional clustering algorithm searches for the optimal centers in the solution space according to some clustering measure in order to solve data clustering problem. A commonly used clustering measure is where is the associate weight of point with cluster j, which will be either 1 or 0 (if point is allocated to cluster j, is 1, otherwise 0).

The clustering process, separating the objects into the clusters, is realized as an optimization problem. The goal of the optimization problem is to find the optimal centers by minimizing objective function 1:

In addition, the value will be used to evaluate objects in the proposed clustering algorithm. If the value of an object is the smaller one, the object is the better; otherwise, it is worse.

3. Proposed Membrane Clustering Algorithm

In this section the proposed membrane clustering algorithm is discussed in detail, which is inspired by the mechanism of membrane computing. A tissue-like P system with a loop structure of cells is designed as its optimization framework. The tissue-like P system with a loop structure of cells can be described as the following construct: where ( ) is the set of objects in cell ; () is the set of evolution rules in cell , which contains three evolution rules: selection, crossover, and mutation rules; is finite set of communication rules with the following forms:(i)antiport rule:, , . The rule is used to communicate the objects between a cell and its two adjacent cells;(ii)symport rule:, . The rule is used to communicate the objects between cell and the environment. indicates the output region of the system.

Figure 1 shows membrane structure of the tissue-like P system, which consists of cells. The cells are labeled by , respectively. The region labeled by 0 is the environment and is also output region of the system. The directed lines in Figure 1 indicate the communication of objects between the cells. Moreover, the cells will be arranged as a loop topology based on the communication rules described below. As usual in P system, the cells, as parallel computing units, will run independently. In addition, the environment always stores the best object found so far in the system. When the system halts, the object in the environment will be regarded as the output of the whole system.

The role of the tissue-like P system is to evolve the optimal centers of clusters for a data set; thus each object in cells will express a group of (candidate) centers. Thus, each object in cells is considered as a ()-dimensional real vector of the form where are components of th cluster center , . For simplicity, suppose that each cell has the same number of objects, which is denoted by .

Initially, the system will randomly generate initial objects for each cell. When an initial object is generated, () random real numbers are produced repeatedly to form it with the constraint of where and are lower bound and upper bound of jth dimensional component of data points, respectively, .

As usual, the tissue-like P system has two mechanisms: evolution and communication mechanisms. The two mechanisms will be described as follows.

3.1. Evolution Mechanism

The role of evolution rules is to evolve the objects in cells to generate new objects used in next computing step. During the evolution, each cell maintains the same size (the number of objects). In this work, three known genetic operations (selection, crossover, and mutation) [38, 39] are used as the evolution rules in cells. In a computing step, all objects (located in object pool) in each cell and the best objects (located in external pool) from its two adjacent cells constitute a matching pool. The objects in external pool are actually the best objects communicated from its two adjacent cells in previous computing step. The objects in matching pool will be evolved by executing selection, crossover, and mutation operations in turn. In order to maintain the size of objects in each cell, truncation operation is used to constitute new object pool according to the values of objects. The objects in new object pool will be regarded as the objects to be evolved in next computing step. Figure 2 shows the evolution procedure of objects in a cell.

In this work, selection operation uses usual rotating wheel method, while crossover operation uses single-point crossover in which the position of crossover point is determined according to crossover probability [39]. The single-point mutation is used to realize the mutations of objects. If is a mutation point determined according to mutation probability , its value becomes, after mutating, where the signs “+” or “−” occur with equal probability, and is real number in the range , generated with uniform distribution.

3.2. Communication Mechanism

The communication mechanism is used to exchange the objects between each cell and its two adjacent cells and update the best object found so far in the environment. The communication mechanism is realized by communication rules of two types: antiport rule , which indicates that object is communicated from cell to cell and object is communicated from cell to cell , and symport rule , which indicates that object is communicated from cell to the environment.

The communication rules impliedly indicate the connection relationship between cells. Figure 3 shows the communication relation of objects between cells in the designed tissue-like P system. From a logical point of view, the communication relation shows that the cells form a loop topology, shown in Figure 3(a). Meanwhile, this also reflects a neighborhood structure of the communication of objects; namely, each cell only exchanges and shares the objects with its two adjacent cells, shown in Figure 3(b). After the objects are evolved, each cell (such as cell ) transmits its several best objects into adjacent cells (such as cells and ) and retrieves several best objects from adjacent cells (such as cells and ) by using the communication rule, constituting the matching pool of objects in next computing step. The special logical structure can bring the following benefits.(1)The coevolution of objects in the cells can accelerate the convergence of the proposed clustering algorithm.(2)The object sharing mechanism of the local neighborhood structure can enhance the diversity of objects in the entire system.

The communication of objects not only occurs between cells, but also appears between cell and the environment. The global best object found so far in whole system is stored always in the environment. After objects are evolved, each cell communicates its best object found in current computing step into the environment to update the global best object. The update strategy is that if then ; otherwise, retains unchanged, where is the current best object, is the global best object, and is the fitness function ( value).

As usual in P system, the cells, as parallel computing units, will run independently. In addition, the environment always stores the best object found so far in the system. In this work, maximum execution step number is used as the halting condition of the tissue-like P system; that is, the tissue-like P system will continue to run until it reaches the maximum execution step number. When the system halts, the object in the environment will be regarded as the output of whole system, namely, the found optimal centers.

Based on the tissue-like P system described above, the proposed membrane clustering algorithm is summarized in Algorithm 1.

Input parameters: Data set, , the number of clusters, , the number of cell, , the number of objects in each cell,
, maximum execution step number, , crossover rate, , and mutation rate, .
Output results: the optimal centers, .
Step  1. Initialization
for    to  
  for    to   
   Generate th initial object for cell , ;
   Partition all data points into clusters, ;
   Compute the value of the object, ;
  end for
end for
 Fill the global best object using the best of all initial objects;
 Set computing step ;
Step  2. Object evolution in cells
for each cell () in parallel do
  Evolve all object () in its mating pool using evolution rules;
  Use truncation operation to maintain its best objects;
  for    to  
   Partition all data points into clusters, ;
   Compute the value of the object, ;
  end for
end for
Step  3. Object communication between cells
for each cell () in parallel do
  Transmit better objects in cell to its two adjacent cells;
  Receive better objects from its two adjacent cells into its mating pool;
  Update using the best object in cell ;
end for
Step  4. Halt condition judgment
if   is satisfied
  ;
  goto Step  2;
end if
 The system exports the global best object in the environment and halts;

4. Simulation Experiments

The proposed membrane clustering algorithm is evaluated on ten data sets and compared with classical -means algorithm and several clustering algorithms based on evolutionary algorithms, including GA [10], PSO [15], and ACO [16]. In order to test the robustness of these clustering algorithms, we repeat the experiments 50 times for each data set.

In the experiments, two kinds of data sets are used to evaluate these clustering algorithms. First is the four manually generated data sets used in the existing literatures, AD_5_2, Data_9_2, Square_4, and Sym_3_22, shown in Figure 4. Second is the six real-life data sets provided in UCI [40], including the Iris, BreastCancer, Newthyroid, LungCancer, Wine, and LiveDisorder. The sizes of the data sets can be found in Table 1.

The proposed membrane clustering algorithm will be compared with -means and three evolutionary clustering algorithms recently reported in the literature, including GA, PSO, and ACO. These algorithms are implemented in Matlab 7.1 according to the following parameters.(1)Tissue-like P systems. Each cell contains 100 objects and communicates its first five best objects into two adjacent cells. The maximum computing step number is chosen to be 200. In the implementation, evolution rules use the adaptive crossover probability and mutation probability . In order to study performances of tissue-like P systems of different degrees, four cases are considered in the experiments: .(2)GA [10]. In the rotating wheel method, single-point crossover and single-point mutation are used, where the crossover and mutation probabilities, and , are chosen to be 0.8 and 0.001, respectively. Let the population size be and let maximum iteration number be .(3)PSO [15]. The uses a linear decreasing inertia weight, where and ; , the population size , and maximum iteration number is 200.(4)ACO [16]. The best parameter values are and .

In the experiments, we realize four tissue-like P systems with degrees 4, 8, 16, and 20, respectively. The aim is to evaluate the effects of the number of cells (i.e., different degrees) on clustering quality. The four tissue-like P systems are applied to find out the optimal centers for the ten data sets, respectively. In this work, the value is also used to measure the clustering quality of each clustering algorithm. Considering that the evolution rules in the designed tissue-like P system include stochastic mechanism, we independently execute the tissue-like P systems of the four degrees 50 times on each data set and then compute their mean values and standard deviations of the 50 runs. The mean values are used to illustrate the average performance of the algorithms while standard deviations indicate their robustness. Table 2 provides experimental results of the tissue-like P systems of four degrees on ten data sets, respectively. The results of degrees 16 and 20 are better than those of the other two degrees, namely, lower mean values and smaller standard deviations. It can be further observed that the tissue-like P system with degree 16 obtains the smallest mean values and standard deviations on most of data sets. The results illustrate that the tissue-like P system with degree 16 has good clustering quality and high robustness.

In order to further evaluate clustering performance, the proposed membrane clustering algorithm is compared with GA-based, PSO-based, and ACO-based clustering algorithms as well as classical -means algorithm. Table 3 gives the comparison results of the tissue-like P system of degree 16 with other four clustering algorithms on the ten data sets, respectively. The comparison results show that the tissue-like P system provides the optimum average value and smallest standard deviation in comparison to those of other algorithms. For instance, the results obtained on the AD_5_2 show that the tissue-like P system converges to the optimum of 326.4478 at almost times and PSO reaches to 326.44 in most of runs, while ACO, GA, and -means attain 326.45, 322.31, and 332.47, respectively. The standard deviations of values for the tissue-like P system, PSO, and ACO are 0.0105, 0.0128, and 0.0344, respectively, which are significantly smaller than the other two algorithms. For the results on the Iris, the optimum value is 96.75, which is obtained in most of runs of the tissue-like P system; however, the other four algorithms fail to attain the value even once within 50 runs. The results on the Newthyroid also show that the tissue-like P system provides the optimum value of 1869.29 while the PSO, ACO, GA, and -means obtain 1872.51, 1872.56, 1875.11, and 1886.25, respectively. In addition, the tissue-like P system obtains smallest standard deviation on each data set in comparison to the other four algorithms, which illustrates that it has high robustness.

Wilcoxon’s rank sum test is a nonparametric statistical significance test for independent samples. The statistical significance test has been conducted at the 5% significance level in the experiments. We create five groups for the ten data sets, which are corresponding to the five clustering algorithms (tissue-like P system, GA, PSO, ACO, and -means), respectively. Each group consists of the values produced by 50 consecutive runs of the corresponding algorithms. In order to illustrate if the goodness is statistically significant, we have completed a statistical significance test for these clustering algorithms. Table 4 gives the values provided by Wilcoxon’s rank sum test for comparison of two groups (one group corresponding to the tissue-like P system and another group corresponding to some other method) at a time. The null hypothesis assumes that there is no significant difference between the mean values of two groups, whereas there is significant difference in the mean values of two groups for the alternative hypothesis. It is evident from Table 4 that all values are less than 0.05 (5% significance level). This is a strong evidence against the null hypothesis, establishing significant superiority of the proposed membrane clustering algorithm.

5. Conclusion

In this paper, we discuss a membrane clustering algorithm, a novel clustering algorithm in the framework of membrane computing. Distinguished from the existing evolutionary clustering techniques, two inherent mechanisms of membrane computing are exploited to realize the membrane clustering algorithm, including evolution and communication mechanisms. For this purpose, a tissue-like P system consisting of cells is designed, in which each cell as parallel computing unit runs in maximally parallel way and each object of the system represents a group of candidate centers. Moreover, the communication rules impliedly realize a local neighborhood structure; namely, each cell exchanges and shares the best objects with its two adjacent cells. Under the control of evolution and communication mechanisms of objects, the tissue-like P system is able to search for the optimal centers for a data set to be clustered. In addition, the local neighborhood structure can guide the exploitation of the optimal object and enhance the diversity of evolution objects.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (Grant nos. 61170030 and 61472328), the Chunhui Project Foundation of the Education Department of China (nos. Z2012025 and Z2012031), and the Sichuan Key Technology Research and Development Program (no. 2013GZX0155), China.