Research Article  Open Access
A Novel Clustering Algorithm Inspired by Membrane Computing
Abstract
P systems are a class of distributed parallel computing models; this paper presents a novel clustering algorithm, which is inspired from mechanism of a tissuelike P system with a loop structure of cells, called membrane clustering algorithm. The objects of the cells express the candidate centers of clusters and are evolved by the evolution rules. Based on the loop membrane structure, the communication rules realize a local neighborhood topology, which helps the coevolution of the objects and improves the diversity of objects in the system. The tissuelike P system can effectively search for the optimal partitioning with the help of its parallel computing advantage. The proposed clustering algorithm is evaluated on four artificial data sets and six reallife data sets. Experimental results show that the proposed clustering algorithm is superior or competitive to kmeans algorithm and several evolutionary clustering algorithms recently reported in the literature.
1. Introduction
Data clustering is a fundamental conceptual problem in data mining, which describes the process of grouping data into classes or clusters such that the data in each cluster share a high degree of similarity while being very dissimilar to data from other clusters [1]. Over the past years, a large number of clustering algorithms have been proposed [2–4], which can be divided roughly in two categories: hierarchical and partitional. Hierarchical clustering proceeds successively by either merging smaller clusters into larger ones or splitting larger clusters. Partitional clustering attempts to directly decompose a data set into several disjointed clusters based on similarity measure, for example, mean square error (MSE). Clustering algorithms have been used in a wide variety of areas, such as pattern recognition, machine learning, image processing, and web mining [5, 6]. In the present study, means algorithm [7, 8] has received wide attention because of the following two reasons: (i) means has been recently elected and listed among the top most influential data mining algorithms [9] and (ii) it is at the same time very simple and quite scalable, as it has linear asymptotic running time with respect to any variable of the problem. However, means is sensitive to the initial centers and easy to get stuck at the local optimal solutions. Moreover, means takes large time cost to find the global optimal solution when the number of data points is large.
In recent years, some evolutionary algorithms have been introduced to overcome the shortcomings of means algorithm because of their global optimization capability. Several genetic algorithms (GA) based clustering algorithms have been proposed in the literature [10–14]. However, most of GAbased clustering algorithms can suffer from the degeneracy when numerous chromosomes represent the same solution. The degeneracy can lead to inefficient coverage of the search space as the same configurations of clusters are repeatedly explored. To overcome the shortcoming, particle swarm optimization (PSO) based or ant colony optimization (ACO) based clustering algorithms have been proposed. Kao et al. have proposed a hybrid technique based on combining the means and PSO for cluster analysis [15]. Shelokar et al. have introduced an evolutionary algorithm based on ACO for clustering problem [16]. Niknam and Amiri have presented a hybrid evolutionary optimization algorithm based on the combination of PSO and ACO for solving the clustering problem [17].
The aim of membrane computing is to abstract computing ideas (data structures, operations with data, ways to control operations, computing models, etc.) from the structure and the functioning of a single cell and from complexes of cells, such as tissues and organs including the brain. There are three main classes of P systems investigated: celllike P systems (based on a celllike (hence hierarchical) arrangement of membranes delimiting compartments where multisets of chemicals evolve according to given evolution rules) [18], tissuelike P systems (instead of hierarchical arrangement of membranes, consider arbitrary graphs as underlying structures, with membranes placed in the nodes while edges correspond to communication channels) [19], and neurallike P systems [20]. Many variants of all these systems have been considered, for example, [21, 22] for celllike P systems, [23, 24] for tissuelike P systems, and [25–30] for neurallike P systems. An overview of the field can be found in [31], with uptodate information available at the membrane computing website (http://ppage.psystems.eu/). These efforts have addressed the parallel computing advantage of P systems as well as the high effectiveness of solving a variety of difficult problems; especially, P systems can solve a number of NPhard problems in linear or polynomial time complexity [32] and even solve PSPACE problems in a feasible time [33, 34]. Moreover, membrane algorithms have demonstrated a powerful global optimization performance [35–37].
This paper focuses on application of membrane computing to data clustering. Our motivation is applying the specially designed elements and inherent mechanisms of P systems to realize a novel clustering algorithm, called the membrane clustering algorithm.
2. Data Clustering Problem
Clustering is the process of recognizing natural groups or clusters from a data set based on some similarity measure. Suppose that data set has sample points, , (), and is partitioned into clusters, . Denote by the corresponding centers. Usually, partitional clustering algorithm searches for the optimal centers in the solution space according to some clustering measure in order to solve data clustering problem. A commonly used clustering measure is where is the associate weight of point with cluster j, which will be either 1 or 0 (if point is allocated to cluster j, is 1, otherwise 0).
The clustering process, separating the objects into the clusters, is realized as an optimization problem. The goal of the optimization problem is to find the optimal centers by minimizing objective function 1:
In addition, the value will be used to evaluate objects in the proposed clustering algorithm. If the value of an object is the smaller one, the object is the better; otherwise, it is worse.
3. Proposed Membrane Clustering Algorithm
In this section the proposed membrane clustering algorithm is discussed in detail, which is inspired by the mechanism of membrane computing. A tissuelike P system with a loop structure of cells is designed as its optimization framework. The tissuelike P system with a loop structure of cells can be described as the following construct: where ( ) is the set of objects in cell ; () is the set of evolution rules in cell , which contains three evolution rules: selection, crossover, and mutation rules; is finite set of communication rules with the following forms:(i)antiport rule:, , . The rule is used to communicate the objects between a cell and its two adjacent cells;(ii)symport rule:, . The rule is used to communicate the objects between cell and the environment. indicates the output region of the system.
Figure 1 shows membrane structure of the tissuelike P system, which consists of cells. The cells are labeled by , respectively. The region labeled by 0 is the environment and is also output region of the system. The directed lines in Figure 1 indicate the communication of objects between the cells. Moreover, the cells will be arranged as a loop topology based on the communication rules described below. As usual in P system, the cells, as parallel computing units, will run independently. In addition, the environment always stores the best object found so far in the system. When the system halts, the object in the environment will be regarded as the output of the whole system.
The role of the tissuelike P system is to evolve the optimal centers of clusters for a data set; thus each object in cells will express a group of (candidate) centers. Thus, each object in cells is considered as a ()dimensional real vector of the form where are components of th cluster center , . For simplicity, suppose that each cell has the same number of objects, which is denoted by .
Initially, the system will randomly generate initial objects for each cell. When an initial object is generated, () random real numbers are produced repeatedly to form it with the constraint of where and are lower bound and upper bound of jth dimensional component of data points, respectively, .
As usual, the tissuelike P system has two mechanisms: evolution and communication mechanisms. The two mechanisms will be described as follows.
3.1. Evolution Mechanism
The role of evolution rules is to evolve the objects in cells to generate new objects used in next computing step. During the evolution, each cell maintains the same size (the number of objects). In this work, three known genetic operations (selection, crossover, and mutation) [38, 39] are used as the evolution rules in cells. In a computing step, all objects (located in object pool) in each cell and the best objects (located in external pool) from its two adjacent cells constitute a matching pool. The objects in external pool are actually the best objects communicated from its two adjacent cells in previous computing step. The objects in matching pool will be evolved by executing selection, crossover, and mutation operations in turn. In order to maintain the size of objects in each cell, truncation operation is used to constitute new object pool according to the values of objects. The objects in new object pool will be regarded as the objects to be evolved in next computing step. Figure 2 shows the evolution procedure of objects in a cell.
In this work, selection operation uses usual rotating wheel method, while crossover operation uses singlepoint crossover in which the position of crossover point is determined according to crossover probability [39]. The singlepoint mutation is used to realize the mutations of objects. If is a mutation point determined according to mutation probability , its value becomes, after mutating, where the signs “+” or “−” occur with equal probability, and is real number in the range , generated with uniform distribution.
3.2. Communication Mechanism
The communication mechanism is used to exchange the objects between each cell and its two adjacent cells and update the best object found so far in the environment. The communication mechanism is realized by communication rules of two types: antiport rule , which indicates that object is communicated from cell to cell and object is communicated from cell to cell , and symport rule , which indicates that object is communicated from cell to the environment.
The communication rules impliedly indicate the connection relationship between cells. Figure 3 shows the communication relation of objects between cells in the designed tissuelike P system. From a logical point of view, the communication relation shows that the cells form a loop topology, shown in Figure 3(a). Meanwhile, this also reflects a neighborhood structure of the communication of objects; namely, each cell only exchanges and shares the objects with its two adjacent cells, shown in Figure 3(b). After the objects are evolved, each cell (such as cell ) transmits its several best objects into adjacent cells (such as cells and ) and retrieves several best objects from adjacent cells (such as cells and ) by using the communication rule, constituting the matching pool of objects in next computing step. The special logical structure can bring the following benefits.(1)The coevolution of objects in the cells can accelerate the convergence of the proposed clustering algorithm.(2)The object sharing mechanism of the local neighborhood structure can enhance the diversity of objects in the entire system.
(a)
(b)
The communication of objects not only occurs between cells, but also appears between cell and the environment. The global best object found so far in whole system is stored always in the environment. After objects are evolved, each cell communicates its best object found in current computing step into the environment to update the global best object. The update strategy is that if then ; otherwise, retains unchanged, where is the current best object, is the global best object, and is the fitness function ( value).
As usual in P system, the cells, as parallel computing units, will run independently. In addition, the environment always stores the best object found so far in the system. In this work, maximum execution step number is used as the halting condition of the tissuelike P system; that is, the tissuelike P system will continue to run until it reaches the maximum execution step number. When the system halts, the object in the environment will be regarded as the output of whole system, namely, the found optimal centers.
Based on the tissuelike P system described above, the proposed membrane clustering algorithm is summarized in Algorithm 1.

4. Simulation Experiments
The proposed membrane clustering algorithm is evaluated on ten data sets and compared with classical means algorithm and several clustering algorithms based on evolutionary algorithms, including GA [10], PSO [15], and ACO [16]. In order to test the robustness of these clustering algorithms, we repeat the experiments 50 times for each data set.
In the experiments, two kinds of data sets are used to evaluate these clustering algorithms. First is the four manually generated data sets used in the existing literatures, AD_5_2, Data_9_2, Square_4, and Sym_3_22, shown in Figure 4. Second is the six reallife data sets provided in UCI [40], including the Iris, BreastCancer, Newthyroid, LungCancer, Wine, and LiveDisorder. The sizes of the data sets can be found in Table 1.

(a)
(b)
(c)
(d)
The proposed membrane clustering algorithm will be compared with means and three evolutionary clustering algorithms recently reported in the literature, including GA, PSO, and ACO. These algorithms are implemented in Matlab 7.1 according to the following parameters.(1)Tissuelike P systems. Each cell contains 100 objects and communicates its first five best objects into two adjacent cells. The maximum computing step number is chosen to be 200. In the implementation, evolution rules use the adaptive crossover probability and mutation probability . In order to study performances of tissuelike P systems of different degrees, four cases are considered in the experiments: .(2)GA [10]. In the rotating wheel method, singlepoint crossover and singlepoint mutation are used, where the crossover and mutation probabilities, and , are chosen to be 0.8 and 0.001, respectively. Let the population size be and let maximum iteration number be .(3)PSO [15]. The uses a linear decreasing inertia weight, where and ; , the population size , and maximum iteration number is 200.(4)ACO [16]. The best parameter values are and .
In the experiments, we realize four tissuelike P systems with degrees 4, 8, 16, and 20, respectively. The aim is to evaluate the effects of the number of cells (i.e., different degrees) on clustering quality. The four tissuelike P systems are applied to find out the optimal centers for the ten data sets, respectively. In this work, the value is also used to measure the clustering quality of each clustering algorithm. Considering that the evolution rules in the designed tissuelike P system include stochastic mechanism, we independently execute the tissuelike P systems of the four degrees 50 times on each data set and then compute their mean values and standard deviations of the 50 runs. The mean values are used to illustrate the average performance of the algorithms while standard deviations indicate their robustness. Table 2 provides experimental results of the tissuelike P systems of four degrees on ten data sets, respectively. The results of degrees 16 and 20 are better than those of the other two degrees, namely, lower mean values and smaller standard deviations. It can be further observed that the tissuelike P system with degree 16 obtains the smallest mean values and standard deviations on most of data sets. The results illustrate that the tissuelike P system with degree 16 has good clustering quality and high robustness.

In order to further evaluate clustering performance, the proposed membrane clustering algorithm is compared with GAbased, PSObased, and ACObased clustering algorithms as well as classical means algorithm. Table 3 gives the comparison results of the tissuelike P system of degree 16 with other four clustering algorithms on the ten data sets, respectively. The comparison results show that the tissuelike P system provides the optimum average value and smallest standard deviation in comparison to those of other algorithms. For instance, the results obtained on the AD_5_2 show that the tissuelike P system converges to the optimum of 326.4478 at almost times and PSO reaches to 326.44 in most of runs, while ACO, GA, and means attain 326.45, 322.31, and 332.47, respectively. The standard deviations of values for the tissuelike P system, PSO, and ACO are 0.0105, 0.0128, and 0.0344, respectively, which are significantly smaller than the other two algorithms. For the results on the Iris, the optimum value is 96.75, which is obtained in most of runs of the tissuelike P system; however, the other four algorithms fail to attain the value even once within 50 runs. The results on the Newthyroid also show that the tissuelike P system provides the optimum value of 1869.29 while the PSO, ACO, GA, and means obtain 1872.51, 1872.56, 1875.11, and 1886.25, respectively. In addition, the tissuelike P system obtains smallest standard deviation on each data set in comparison to the other four algorithms, which illustrates that it has high robustness.

Wilcoxon’s rank sum test is a nonparametric statistical significance test for independent samples. The statistical significance test has been conducted at the 5% significance level in the experiments. We create five groups for the ten data sets, which are corresponding to the five clustering algorithms (tissuelike P system, GA, PSO, ACO, and means), respectively. Each group consists of the values produced by 50 consecutive runs of the corresponding algorithms. In order to illustrate if the goodness is statistically significant, we have completed a statistical significance test for these clustering algorithms. Table 4 gives the values provided by Wilcoxon’s rank sum test for comparison of two groups (one group corresponding to the tissuelike P system and another group corresponding to some other method) at a time. The null hypothesis assumes that there is no significant difference between the mean values of two groups, whereas there is significant difference in the mean values of two groups for the alternative hypothesis. It is evident from Table 4 that all values are less than 0.05 (5% significance level). This is a strong evidence against the null hypothesis, establishing significant superiority of the proposed membrane clustering algorithm.

5. Conclusion
In this paper, we discuss a membrane clustering algorithm, a novel clustering algorithm in the framework of membrane computing. Distinguished from the existing evolutionary clustering techniques, two inherent mechanisms of membrane computing are exploited to realize the membrane clustering algorithm, including evolution and communication mechanisms. For this purpose, a tissuelike P system consisting of cells is designed, in which each cell as parallel computing unit runs in maximally parallel way and each object of the system represents a group of candidate centers. Moreover, the communication rules impliedly realize a local neighborhood structure; namely, each cell exchanges and shares the best objects with its two adjacent cells. Under the control of evolution and communication mechanisms of objects, the tissuelike P system is able to search for the optimal centers for a data set to be clustered. In addition, the local neighborhood structure can guide the exploitation of the optimal object and enhance the diversity of evolution objects.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was partially supported by the National Natural Science Foundation of China (Grant nos. 61170030 and 61472328), the Chunhui Project Foundation of the Education Department of China (nos. Z2012025 and Z2012031), and the Sichuan Key Technology Research and Development Program (no. 2013GZX0155), China.
References
 J. A. Hartigan, Clustering Algorithms, John Wiley & Sons, 1975. View at: MathSciNet
 A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, Engiewood Cliffs, NJ, USA, 1988. View at: MathSciNet
 R. Xu and D. Wunsch II, “Survey of clustering algorithms,” IEEE Transactions on Neural Networks, vol. 16, no. 3, pp. 645–678, 2005. View at: Publisher Site  Google Scholar
 A. K. Jain, “Data clustering: 50 years beyond Kmeans,” Pattern Recognition Letters, vol. 31, no. 8, pp. 651–666, 2010. View at: Publisher Site  Google Scholar
 B. Everitt, S. Landau, and M. Leese, Cluster Analysis, Arnold, London, UK, 2001.
 S. Saha and S. Bandyopadhyay, “A symmetry based multiobjective clustering technique for automatic evolution of clusters,” Pattern Recognition, vol. 43, no. 3, pp. 738–751, 2010. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient kmeans clustering algorithms: analysis and implementation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881–892, 2002. View at: Publisher Site  Google Scholar
 D. Steinley, “Kmeans clustering: a halfcentury synthesis,” The British Journal of Mathematical and Statistical Psychology, vol. 59, no. 1, pp. 1–34, 2006. View at: Publisher Site  Google Scholar  MathSciNet
 X. Wu, Top Ten Algorithms in Data Mining, Taylor & Francis, Boca Raton, Fla, USA, 2009.
 S. Bandyopadhyay and U. Maulik, “An evolutionary technique based on Kmeans algorithm for optimal clustering in ${\mathbb{R}}^{N}$,” Information Sciences, vol. 146, no. 1–4, pp. 221–237, 2002. View at: Publisher Site  Google Scholar  MathSciNet
 S. Bandyopadhyay and S. Saha, “GAPS: a clustering method using a new point symmetrybased distance measure,” Pattern Recognition, vol. 40, no. 12, pp. 3430–3451, 2007. View at: Publisher Site  Google Scholar
 M. Laszlo and S. Mukherjee, “A genetic algorithm that exchanges neighboring centers for kmeans clustering,” Pattern Recognition Letters, vol. 28, no. 16, pp. 2359–2366, 2007. View at: Publisher Site  Google Scholar
 D. X. Chang, X. D. Zhang, and C. W. Zheng, “A genetic algorithm with gene rearrangement for Kmeans clustering,” Pattern Recognition, vol. 42, no. 7, pp. 1210–1222, 2009. View at: Publisher Site  Google Scholar
 C. D. Nguyen and K. J. Cios, “GAKREM: a novel hybrid clustering algorithm,” Information Sciences, vol. 178, no. 22, pp. 4205–4227, 2008. View at: Publisher Site  Google Scholar
 Y. T. Kao, E. Zahara, and I. W. Kao, “A hybridized approach to data clustering,” Expert Systems with Applications, vol. 34, no. 3, pp. 1754–1762, 2008. View at: Publisher Site  Google Scholar
 P. S. Shelokar, V. K. Jayaraman, and B. D. Kulkarni, “An ant colony approach for clustering,” Analytica Chimica Acta, vol. 509, no. 2, pp. 187–195, 2004. View at: Publisher Site  Google Scholar
 T. Niknam and B. Amiri, “An efficient hybrid approach based on PSO, ACO and kmeans for cluster analysis,” Applied Soft Computing Journal, vol. 10, no. 1, pp. 183–197, 2010. View at: Publisher Site  Google Scholar
 G. Păun, “Computing with membranes,” Journal of Computer and System Sciences, vol. 61, no. 1, pp. 108–143, 2000. View at: Publisher Site  Google Scholar  MathSciNet
 C. MartinVide, G. Păun, J. Pazos, and A. RodriguezPatón, “Tissue P systems,” Theoretical Computer Science, vol. 296, no. 2, pp. 295–326, 2003. View at: Publisher Site  Google Scholar  MathSciNet
 M. Ionescu, G. Păun, and T. Yokomori, “Spiking neural P systems,” Fundamenta Informaticae, vol. 71, no. 23, pp. 279–308, 2006. View at: Google Scholar  MathSciNet
 G. Păun, “P systems with active membranes attacking NPcomplete problems,” Journal of Automata, Languages and Combinatorics, vol. 6, no. 1, pp. 75–90, 2001. View at: Google Scholar  MathSciNet
 L. Pan and T. Ishdorj, “P systems with active membranes and separation rules,” Journal of Universal Computer Science, vol. 10, no. 5, pp. 639–649, 2004. View at: Google Scholar  MathSciNet
 G. Păun, M. J. PérezJiménez, and A. RiscosNúñez, “Tissue P systems with cell division,” International Journal of Computers, Communications and Control, vol. 3, no. 3, pp. 295–303, 2008. View at: Google Scholar
 L. Pan and M. J. PérezJiménez, “Computational complexity of tissuelike P systems,” Journal of Complexity, vol. 26, no. 3, pp. 296–315, 2010. View at: Publisher Site  Google Scholar  MathSciNet
 L. Pan and G. Pǎun, “Spiking neural P systems with antispikes,” International Journal of Computers, Communications and Control, vol. 4, no. 3, pp. 273–282, 2009. View at: Google Scholar
 L. Pan, G. Pǎun, and M. J. PérezJiménez, “Spiking neural P systems with neuron division and budding,” Science China, vol. 54, no. 8, pp. 1596–1607, 2011. View at: Publisher Site  Google Scholar  MathSciNet
 J. Wang, L. Zou, H. Peng, and G. Zhang, “An extended spiking neural P system for fuzzy knowledge representation,” International Journal of Innovative Computing, Information and Control, vol. 7, no. 7, pp. 3709–3724, 2011. View at: Google Scholar
 H. Peng, J. Wang, M. J. PérezJiménez, H. Wang, J. Shao, and T. Wang, “Fuzzy reasoning spiking neural P system for fault diagnosis,” Information Sciences, vol. 235, pp. 106–116, 2013. View at: Publisher Site  Google Scholar  MathSciNet
 J. Wang, P. Shi, H. Peng, M. J. PerezJimenez, and T. Wang, “Weighted fuzzy spiking neural P systems,” IEEE Transactions on Fuzzy Systems, vol. 21, no. 2, pp. 209–220, 2013. View at: Publisher Site  Google Scholar
 J. Wang and H. Peng, “Adaptive fuzzy spiking neural P systems for fuzzy inference and learning,” International Journal of Computer Mathematics, vol. 90, no. 4, pp. 857–868, 2013. View at: Publisher Site  Google Scholar  MathSciNet
 G. Păun, G. Rozenberg, and A. Salomaa, The Oxford Handbook of Membrane Computing, Oxford University Press, New York, NY, USA, 2010.
 G. Pǎun and M. J. PérezJiménez, “Membrane computing: brief introduction, recent results and applications,” BioSystems, vol. 85, no. 1, pp. 11–22, 2006. View at: Publisher Site  Google Scholar
 A. Alhazov, C. MartínVide, and L. Pan, “Solving a PSPACEcomplete problem by recognizing P systems with restricted active membranes,” Fundamenta Informaticae, vol. 58, no. 2, pp. 67–77, 2003. View at: Google Scholar  MathSciNet
 T. Ishdorj, A. Leporati, L. Pan, X. Zeng, and X. Zhang, “Deterministic solutions to {\tt {QSAT}} and {\tt Q3{SAT}} by spiking neural P systems with precomputed resources,” Theoretical Computer Science, vol. 411, no. 25, pp. 2345–2358, 2010. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 G. Zhang, J. Cheng, M. Gheorghe, and Q. Meng, “A hybrid approach based on differential evolution and tissue membrane systems for solving constrained manufacturing parameter optimization problems,” Applied Soft Computing Journal, vol. 13, no. 3, pp. 1528–1542, 2013. View at: Publisher Site  Google Scholar
 H. Peng, J. Wang, M. J. PérezJiménez, and P. Shi, “A novel image thresholding method based on membrane computing and fuzzy entropy,” Journal of Intelligent and Fuzzy Systems, vol. 24, no. 2, pp. 229–237, 2013. View at: Publisher Site  Google Scholar
 H. Peng, J. Wang, M. J. PérezJiménez, and A. RiscosNúñez, “The framework of P systems applied to solve optimal watermarking problem,” Signal Processing, vol. 101, pp. 256–265, 2014. View at: Publisher Site  Google Scholar
 E. Falkenauer, Genetic Algorithms and Grouping Problems, John Wiley & Sons, 1998.
 L. Davis, Handbook of Genetic Algorithms, Van Nostrand Reinhold, 1991.
 http://www.ics.uci.edu/~mlearn/MLRepository.html.
Copyright
Copyright © 2015 Hong Peng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.