Abstract

We present a multiobjective genetic clustering approach, in which data points are assigned to clusters based on new line symmetry distance. The proposed algorithm is called multiobjective line symmetry based genetic clustering (MOLGC). Two objective functions, first the Davies-Bouldin (DB) index and second the line symmetry distance based objective functions, are used. The proposed algorithm evolves near-optimal clustering solutions using multiple clustering criteria, without a priori knowledge of the actual number of clusters. The multiple randomized K dimensional (Kd) trees based nearest neighbor search is used to reduce the complexity of finding the closest symmetric points. Experimental results based on several artificial and real data sets show that proposed clustering algorithm can obtain optimal clustering solutions in terms of different cluster quality measures in comparison to existing SBKM and MOCK clustering algorithms.

1. Introduction

Clustering is one of the most common unsupervised data mining methods to explore the hidden structures embedded in a data set [1]. Clustering gives rise to a variety of information granules whose use reveals the structure of data [2]. Clustering has been effectively applied in a variety of engineering and scientific disciplines [3]. In order to mathematically identify clusters in a data set, it is usually necessary to first define a measure of similarity or proximity which will establish a rule for assigning patterns to the domain of a particular cluster center. Symmetry is considered as a preattentive feature that enhances recognition and reconstruction of shapes and objects. However, the exact mathematical definition of symmetry, such as Miller [4], is inadequate to describe and quantify symmetry found in the natural world or those found in the visual world. Since symmetry is so common in the abstract and in the nature, it is reasonable to assume that some kind of symmetry occur in the structures of clusters. The immediate problem is how to find a way to measure symmetry. Zabrodsky et al. [5] have proposed a kind of symmetry distance to detect symmetry in a figure extracted from an image. Their basic strategy is to choose the symmetry that is the closest to the figure measured by an appropriate measure, in which they adopt the minimum sum of the squared distances over which the vertices must be removed to impose the assumed symmetry. It follows that we need an algorithm for effectively imposing a given symmetry with a minimum displacement [6]. A new type of nonmetric distance, based on point symmetry, is proposed which is used in a -means based clustering algorithm, referred to as symmetry based -means (SBKM) algorithm [7]. SBKM will fail for some data sets where the clusters themselves are symmetrical with respect to some intermediate point. This work is extended in Chung and Lin [8] to overcome some of limitations existing in SBKM. These symmetry based clustering techniques adopted the concept of -means for discovering clusters. The -means [9] algorithm is one of the more widely used clustering algorithms. However, it is well known that -means algorithm is sensitive to the initial cluster centers and easy to get stuck at the local optimal solutions. Second important problem in partitioning clustering is to find a partition of the given data, with a specified number of clusters that minimizes the total within cluster variation. Unfortunately in many real life cases the number of clusters in a data set is not known a priori.

In order to overcome the limitation of being easy to get stuck at the local optimal solutions, some attempts have been made to use genetic algorithms for clustering data sets [1012]. To overcome the problem of automatic cluster determination from the data sets. Recently, many automatic clustering techniques have been introduced. These automatic clustering techniques are based on genetic algorithm methods and Differential Evolution (DE) methods. A fuzzy variable string length based point symmetry genetic clustering technique is proposed in [13]. It automatically evolves the appropriate types of clusters, both convex and nonconvex, which have some symmetrical structures. It fails if the clusters do not have symmetry property. In [14], a two-stage genetic clustering algorithm (TGCA) is proposed. It can automatically determine the proper number of clusters and the proper partition from a given data set. It is suitable for clustering the data with compact spherical clusters only. Single objective genetic clustering methods [15] fail to solve the issues of clusters shape and size simultaneously. They are suffering from ineffective genetic search, which in turn get stuck at suboptimal clustering solutions [16]. To overcome the limitations of these algorithms some attempts have been made to use multiobjective genetic algorithms. Handl and Knowles [17] proposed multiobjective clustering with automatic -determination (MOCK) to detect the optimal number of clusters from data sets. But due to the heuristic nature of the algorithm, it provides an approximation to the real (unknown) Pareto front only. Therefore, generation of the best clustering solution cannot be guaranteed, and result shows some variation between individual runs. Saha and Bandyopadhyay [18] proposed a multiobjective clustering technique. In this algorithm points are assigned to different clusters based on the point symmetry based distance. It is able to detect clusters having point symmetry property. However it will fail for clusters having nonsymmetrical shape. Some researchers have applied Differential Evolution (DE) to the task of automatic clustering from data sets. In [19] a Differential Evolution (DE) technique on the point symmetry based cluster validity index is presented. To find the optimal number of clusters, they proposed a modified symmetry based index. The main limitation of this algorithm is problem-dependent dynamic control factor. Suresh et al. [20, 21] applied the Differential Evolution (DE) to the task of automatic fuzzy clustering, where two conflicting fuzzy validity indices are simultaneously optimized. They used a real-coded representation of the search variables to accommodate variable number of cluster centers [20, 21]. It depends on cluster centroid and thus is biased in any sense towards spherical clusters. Tremendous research effort has gone in the past few years to evolve the clusters in complex data sets through evolutionary techniques. Most clustering algorithms assume the number of clusters to be known a priori. The desired granularity [22] is generally determined by external, problem criteria. There seems to be no definite answer to how many clusters are in data set a user defined criterion for the resolution has to be given instead. Second, most of the existing clustering algorithms adopt 2-norm distance measures in the clustering. These measures fail when clusters tend to develop along principal axes. The symmetry based clustering techniques also seek for clusters which are symmetric with respect to their centers. Thus, these techniques will fail if the clusters do not have this property.

The objective of this paper is twofold. First, it aims at the automatic determination of the optimal number of clusters in any data set. Second, it attempts to find clusters of arbitrary shapes and sizes. We show that genetic algorithm with a new line symmetry based distance can give very promising results if applied to the automatic clustering problem. The proposed algorithm evolves near-optimal clustering solutions using multiple clustering criteria, without a priori knowledge of the actual number of clusters. The multiple randomized Kd trees based nearest neighbor search is used to reduce the complexity of finding the closest symmetrical points. We refer to this new algorithm as the multiobjective line symmetry based genetic clustering (MOLGC) algorithm. We have compared the MOLGC with two other clustering techniques: SBKM and MOCK. The following performance metrics have been used in the comparative analysis: (1) the accuracy of final clustering results; (2) the computation time; and (3) the statistical significance test.

This paper is organized as follows: The related work on symmetry is reviewed in Section 2. In Section 3, proposed line symmetry measure with multiple randomized Kd trees based nearest neighbor search approach is presented. Multiobjective line symmetry based genetic clustering technique is explained in Section 4. Section 5 contains data description and experimental results. Finally, we conclude in Section 6.

In this section at first, point symmetry based distance is described in brief. Then line symmetry based distance is discussed.

2.1. Point Symmetry Based Distance

Symmetry is considered as a preattentive feature that enhances recognition and reconstruction of shapes and objects. Su and Chou [7] presented an efficient point symmetry distance (PSD) measure to help partitioning the data set into the clusters, where each cluster has the point symmetry property. The point symmetry distance measure between the data point , , and the data point relative to the cluster centroid , , is defined as for and , where denotes the 2-norm distance. Note that distance is minimized when the pattern exists in the data sets. They have proposed a symmetry based -means clustering algorithm called SBKM which assigns the patterns to a particular cluster depending on the symmetry based distance only when is greater than some user specified threshold . Otherwise, assignment is done according to the Euclidean distance. We can demonstrate how the point symmetry distance (PSD) measure works well for the case of symmetrical intra-clusters. Assume positions of two centroids and are and . The , , and are three data points and their positions are , , and , respectively. The PSD measure between data points and cluster center is

Because and is less than the specified threshold (0.18), the data point is said to be the most symmetrical point of relative to ; thus we have . Consequently, assigning the data point to the cluster is a good decision. But the problems occurring in the point symmetry distance (PSD) measure are (1) lacking the distance difference symmetry property, (2) leading to an unsatisfactory clustering result for the case of symmetrical inter-clusters. In the first problem, the PSD measure favors the far data point when we have more than two symmetrical data points and this may degrade the symmetrical robustness. We can depict this problem as shown in Figure 1.

Let , , , and ; then find the most symmetry point of relative to and ,

The data point is selected as the most symmetrical point of relative to the centroid . This shows that (1) favors the far data point when we have more than two data points and this may corrupt the symmetrical robustness.

In the second problem, if two clusters are symmetrical to each other with respect to the centroid of any third cluster, then PSD measure gives an unsatisfactory clustering result. As presented in Figure 2, the cluster and the cluster are symmetrical to the cluster center .

Let , , , , , , and ; then for the data point , and by (1), we have

In the above example point symmetry distance is smallest among all distances, so the data point should be assigned to the cluster , but it conflicts our visual assessment. Due to the above two problems, Chung and Lin [8] proposed a symmetry based distance measure known as Symmetry Similarity Level (SSL), which satisfies the distance difference symmetry property. Let denote the cluster centroid; and denote two related data points as shown in Figure 3.

Let and ; then the Distance Similarity Level (DSL) operator for measuring the distance difference symmetry between the distance and the distance is defined by

They replaced the interval to the interval , ; in (5), the number of examined symmetrical points will be increased and the computational gain might be degraded. They proposed second component called Orientation Similarity Level (OSL). By applying the projection concept, the OSL between the two vectors, and , is defined by

By (5) and (6), they combined the effect of and to define a Symmetry Similarity Level (SSL) between the vectors and . They defined the following: for and . Because of and , it is easy to verify that is held. Based on SSL operator, Chung and Lin [8] proposed a modified point symmetry based -means (MPSK) algorithm. The time complexity for finding the symmetry point for objects is (). So this approach is not suitable for large and high dimensional data sets. To overcome limitations of SBKM, Bandyopadhyay and Saha [23] proposed new point symmetry based clustering algorithm known as variable string length genetic clustering technique with point symmetry (VGAPS). The proposed point symmetry distance is defined as follows. The symmetrical (reflected) point of with respect to a particular centre is that is denoted by . Let unique nearest neighbors of be at Euclidean distances of , . Then : where is the Euclidean distance between the point and cluster center . It can be seen from (8) that cannot be chosen equal to 1, since if point exists in the data set then and hence there will be no impact of the Euclidean distance. To overcome this problem, they have taken average distance between reflected point and its first and the second unique nearest neighbor’s points. They proposed a rough guideline of the choice of , the threshold value on the point symmetry distance that is equal to maximum nearest neighbor distance in the data set. For reducing the complexity of point symmetry distance computation, Kd tree based data structure is used. VGAPS detects clusters which are point symmetry with respect to their centers. Thus VGAPS will fail if the clusters do not have this property.

2.2. Existing Line Symmetry Based Distance

From the geometrical symmetry view point, point symmetry and line symmetry are two widely discussed issues. Motivation by this, Saha and Maulik [24] proposed a new line symmetry based automatic genetic clustering technique called variable string length genetic line symmetry distance based clustering (VGALS-Clustering). To measure amount of line symmetry of a point with respect to a particular line , , the following steps are followed:(1)For a particular data point , calculate the projected point on the relevant symmetrical line .(2)Find as where nearest neighbors of are at Euclidean distances of , . Then the amount of line symmetry of a particular point with respect to that particular symmetrical line of cluster is calculated as where is the centroids of the particular cluster and is the Euclidean distance between the points and . The possible problem existing in this given line symmetry measure is lacking the closure property. The closure property can be expressed as follows: if the data point is currently assigned to the cluster centroid in the current iteration, the determined most symmetrical point relative to must have been assigned to in the previous iteration.

3. Proposed Line Symmetry Measure

Both point symmetry and line symmetry distance lack the closure property and this would result in an unsatisfactory clustering result. According to the symmetry property, the data point in Figure 4, which is not in the cluster originally, if symmetry distance of point with cluster center is the most symmetrical distance among other symmetry distances, it tells us that data point should currently be assigned to the cluster . But the most symmetrical point of relative to the centroid is the data point , which has been assigned to the centroid . Since the data point has not been assigned to the centroid before, it violates closure property. It would give an unsatisfactory clustering result.

By considering above problem in existing symmetry based distances, we have applied a constraint in new line symmetry distance measure to satisfy closure property, in which, to compute the line symmetry distance of the data point , we have restricted the candidate symmetrical points relative to each symmetrical line of the corresponding cluster . For the data point relative to symmetrical line of cluster , this restriction can help us to search more suitable symmetrical point , because we ignore the candidate most symmetrical point which is not in the cluster . As depicted in Figure 5, let the point have most line symmetry distance with respect to particular line of cluster and the symmetrical point is , but due to the above constraint the proposed line symmetry distance method is assigned the point to cluster . The assignment of to the cluster is a reasonable assignment from our visual system. We applied the second modification in which the first and second symmetrical points and of point are found in cluster (as shown in Figure 6) relative to the symmetrical line, not in all data points; that is, each point , , is assigned to cluster iff , where and , , and and belong to cluster . The distance is calculated as given in (22) and is the symmetrical threshold, where and the distance is the maximum nearest neighbor distance in the data set.

The value of is kept equal to the maximum nearest neighbor distance among all the points in the data set. Point assignment based on proposed line symmetry distance is given in Algorithm 1.

Procedure: Point assignment based on proposed line symmetry distance( )
for = 1 to do
 for = 1 to do
  Find the first and the second symmetrical points and of relative to a
  projected point on line of cluster /To ensure the closure property /
  Calculate the line symmetry-based distance
 end for
Find =
if , where , = 1, …, and then
Assign the point to the cluster
Else
  Assign the point to the cluster based on the Euclidean distance measure, =
end if
end for
Compute new cluster centers of the clusters as follows:
                 ,
where is the number of data points belonging to the cluster and is set of data
points which have been assigned to the cluster center .

For computing the proposed line symmetry distance of a given data set, we find the symmetrical line of each cluster by using central moment method [25] that is used to measure the Symmetry Similarity Level between two data points relative to symmetrical line. Let the data set be denoted by ; then the th order moment is defined as

The centroid of the given data set for one cluster is defined as . The central moment is defined as where and . According to the calculated centroid and (12), the major axis of each cluster can be determined by the following two items:(a)The major axis of the cluster must pass through the centroids.(b)The angle between the major axis and the -axis is equal to .

Consequently, for one cluster, its corresponding major axis is thus expressed by

Let normalized form of data points be stored into memory of the computer system. Now we can apply central moment method for computing the shape of the data points. A brief mathematical calculation for finding the symmetrical line of each cluster by using central moment method is given in Figure 7 and below:

The centroid of cluster is calculated as

We can apply centroid of cluster for computing the central moments. The physical significance of the central moments is that they just give the area and the moment of inertia. The lower order central moments (Zeroth) give the area of the region :

The product moment involves finding the product of and increasing to a power

The second order central moment along -axis is

The second order central moment along -axis is

The angle between the major axis and the -axis is

The obtained major axis is treated as the symmetric line of the relevant cluster. This symmetrical line is used to measure the amount of line symmetry of a particular point in that cluster. In order to measure the amount of line symmetry of a point () with respect to a particular line of cluster , , the following steps are followed.(1)For a particular data point , calculate the projected point on the relevant symmetrical line of cluster (as shown in Figure 8) and then find out all possible symmetrical data point relative to each symmetrical line for , and .(2)Find as where nearest neighbors of are at Euclidean distances of , . In fact, the role of the parameter is intuitively easy to understand and it can be set by the user based on specific knowledge of the application. In general, a fixed value of may have many drawbacks. For clusters with too few points, the points likely to be scattered and the distance between two neighbors may be too large. For very large cluster fixed number of neighbors may not be enough because few neighbors would have a distance close to zero. Obviously, the parameter is related to the expected minimum cluster size and should be much smaller than the number of objects in the data. To gain a clear idea of the distance of the neighborhood of a point, we have chosen in our implementation. The randomized Kd trees based nearest neighbor search is used to reduce the computation time of the nearest neighbors. The amount of line symmetry of a particular point with respect to particular symmetrical line of cluster is calculated as where is the centroid of the cluster and denotes Euclidean distance between data point and cluster center .

3.1. Multiple Randomized Kd Trees Based Nearest Neighbor Search

The problem of nearest neighbor search is one of major importance in a variety of applications such as image recognition, data compression, pattern recognition and classification, machine learning, document retrieval systems, statistics, and data analysis. The most widely used algorithm for nearest neighbor search is the dimensional tree (Kd tree) [2630]. This works well for exact nearest neighbor search in low dimensional data but quickly loses its effectiveness as dimensionality increases. In high dimensions to find the nearest neighbor may require searching a very large number of nodes. However, solving this problem in high dimensional spaces seems to be a very difficult task and there is no algorithm that performs significantly better than the standard brute-force search. To address this problem, Anan and Hartley [31] have investigated the following strategies: (1) They create different Kd trees each with a different structure in such a way that searches in the different trees will be (largely) independent. (2) With a limit of nodes to be searched, they break the search into simultaneous searches among all the trees. On average, nodes will be searched in each of the trees. (3) The principal component analysis is used to rotate the data to align its moment axes with the coordinate axes. Data will then be split up in the tree by hyperplanes perpendicular to the principal axes. They have written that either by using multiple search trees or by building the Kd tree from data realigned according to its principal axes, search performance improves and even improves further when both techniques are used together.

To overcome the above problem, we have used the approximate nearest neighbor search approach, in which the randomized trees are built by choosing the split dimension randomly from the first dimensions on which data has the greatest variance and each tree is constructed independently [32]. In proposed MOLGC algorithm, instead of always splitting on the maximally variant dimension, each tree selects randomly among the top five most variant dimensions at each level. When searching the trees, a single priority queue is maintained across all the randomized trees so that search can be ordered by increasing distance to each bin boundary. The degree of approximation is determined by examining a fixed number of leaf nodes, at which point the search is terminated and the best candidates returned. In the multiple randomized Kd trees based nearest neighbor search technique, the data points are preprocessed into a metric space , so that, given any query point , the nearest or generally nearest points of to can be reported efficiently. In proposed MOLGC algorithm to find line symmetric distance of a particular point with respect to the symmetrical line of cluster , we have to find the nearest neighbors of (where for and ). Therefore the query point is set equal to . After getting the nearest neighbors of , the line symmetrical distance of to the symmetrical line of cluster is calculated by using (22).

4. Multiobjective Line Symmetry Based Genetic Clustering Technique

In this section, a multiobjective genetic clustering technique using the proposed line symmetry based distance is proposed. The algorithm is known as multiobjective line symmetry based genetic clustering (MOLGC). The subsections of this section are organized as follows.

4.1. Chromosome Representation

In proposed algorithm, the numerical feature values of all cluster centers are encoded into a real coded chromosome as a clustering solution. The length of a particular chromosome is , given by , where is the dimension of the data set and is the number of cluster centers encoded in that chromosome.

For example a chromosome representation has three cluster centers in four dimensional feature space. The encoded three clusters are , , and . For a variable-length chromosome representation, each chromosome has the initial length . The number of clusters, denoted by , is randomly generated in the range . Here is chosen as 2, and is chosen to be , where is the size of the data set. Their after -means algorithm is executed with the set of centers encoded in each chromosome. The resultant centers are used for replacing the centers in the corresponding chromosomes. The steps of proposed MOLGC algorithm are given in Algorithm 2.

Begin
Generate the initial population P/ Popsize = /
while (the termination criterion is not satisfied)
  for = 1 to
   Assign the points based on line symmetry based technique
   Call Procedure: Point assignment based on proposed line symmetry distance() to compute
   the fitness of the chromosome
   Compute objective functions for current chromosomes
   select the chromosomes to apply genetic operators
   Apply crossover operation with probability
   Apply mutation operator to the selected chromosome with mutation probability
   Compute objective functions for new offsprings
  end for
end while
Select the best solution from population
End

4.2. Fitness Computation

The fitness of an individual indicates the degree of suitability of the solution it represents. In general, the fitness of a chromosome is evaluated using the objective function of the problem. The first objective of the clustering problem considered in this paper is to maximize the similarity within each cluster and the dissimilarity among clusters. The second objective is to detect clusters based on line symmetry distance. In this paper, two objective functions, the Davies-Bouldin (DB) index [33] and proposed line symmetry distance, are used to evaluate the fitness of a chromosome. The DB index is used to find clusters which are compact and well separated by minimizing the intracluster distance while maximizing the intercluster distance. DB index is the ratio of the sum of within cluster scatter of cluster to between cluster separations. Within cluster scatter of cluster is defined as where denotes the cluster center of cluster . Cluster center is computed as where denotes the number of the objects belonging to cluster . The within cluster scatter denotes the th root of the th moment of the objects belonging to cluster with respect to their mean. The distance between clusters and is denoted as and is defined as , where stands for Euclidean distance between the centroids and of the clusters and , respectively. Then, DB index is defined as

Here, where corresponds to the number of selected clusters. An individual cluster index is taken as the maximum pairwise comparison computed as the ratio of the sum of within cluster dispersions from the two partitions divided by a measure of the between cluster separation. Smaller values for DB index correspond to good clusters. We set the fitness of chromosome to be equal to , where is the DB index of individual . The second objective function is based on proposed line symmetry distance. The procedure of the fitness computation is given in Algorithm 3.

Procedure: Fitness( ) //Fitness computation of chromosome
Assign the data points based on the procedure line symmetry
= 0
for = 1 to do
 for each data point , = 1, …, and do
   = +
 end for
end for

The fitness function of the chromosomes fitls is defined as the inverse of ; that is,

4.3. Genetic Operators

In this subsection, genetic operators used in proposed clustering algorithm are discussed. These genetic operators pass genetic information between subsequent generations of the population.

4.3.1. Selection

Pareto based selection is used to select fitter solutions in each step of the evolution. It is a stochastic selection method where the selection probability of a chromosome is proportional to the value of its fitness [34]. The fitness for a chromosome chrom, denoted by fitness (chrom), is converted from its Pareto count (or dominance count) of in the whole population. The probability that the individual chrom is selected from the population is denoted by where is the population size; comparing with the conventional roulette wheel selection method that is directly based on the fitness of solutions, Pareto-dominance based selection method can lower the selection pressure and increase the chances of the subspaces with low fitness to be selected into next population.

4.3.2. Crossover

Crossover is a probabilistic process that exchanges information between two parent chromosomes for generating two child chromosomes [34]. For chromosomes of length , a random integer, called the crossover point, is generated in the range . The portions of the chromosomes lying to the right of the crossover point are exchanged to produce two offspring. Let parent chromosomes and encode and cluster centers, respectively. , the crossover point in , is generated as mod . Let be the crossover point in , and it may vary in between [, ], where LB() and RB() indicate the left and right bounds of the range of , respectively. LB() and RB() are given by

Therefore is given by

As an example, let two chromosomes and be with number of 2 and 5 clusters. Now we can apply crossover operation on and as

The crossover point in is generated as where 5 is random number generated by function. The crossover point in varies in between and = 4.

The crossover point in is mod

The offspring and generated after crossover operation are

Crossover probability is selected adaptively as in [35]. Let be the maximum fitness value of the current population, the average fitness value of the population, and the larger of the fitness values of the solutions to be crossed. Then the probability of crossover, , is calculated as

Here, the values of and are equal to 1.0 [35]. Clearly, when then and will be equal to . The value of increases when the chromosome is quite poor. In contrast if is low it means chromosome is good. It will prevent the proposed MOLGC algorithm from getting stuck at local optimum.

4.3.3. Mutation

Each cluster center in a chromosome is changed with a random variable generated from a Laplacian distribution [18]. This distribution is characterized by location (any real number) and scale parameters. The probability density function of Laplace () is where the scaling parameter sets the magnitude of perturbation that is referred to as the diversity. Here, parameter denotes the location value which is to be perturbed. We set the scaling parameter equal to 1.0 in our experimental results. In this mutation operation the old value at the mutation position of a chromosome is replaced with newly generated random value using Laplace distribution. The mutation operation is applied for all dimensions of data set independently. The mutation probability is selected adaptively for each chromosome as in [35]. The expression is given below: where and are equal to process 0.5. The adaptive mutation process assists genetic algorithm to come out of local optimum. When value decreases then and both will be increased. As a result GA will come out of local optimum. It will also happen for the global optimum and may result in interference of the near optimal solutions. As a result genetic algorithm never converges to the global optimum. But the values of adaptive crossover probability and adaptive mutation probability will be higher for low fitness solutions and will get low values for higher fitness solutions. The high fitness solutions aid in convergence of the genetic algorithm and the low fitness solutions prevent the genetic algorithm from getting stuck at a local optimum. It may be possible for a solution with highest fitness value; and are both 0. As a result the best solution is transferred into the next generation without crossover and mutation. For selection operator this may lead to an exponential growth of the solution in the population and may cause premature convergence. To overcome the above problem, a default mutation rate (of 0.01) is kept for every solution in the proposed algorithm MOLGC.

4.4. Termination Criterion

The proposed multiobjective clustering algorithm has been executed for a fixed number of generations. The fixed number is supplied by the user for terminating the algorithm. After termination, the algorithm gives the best string of the last generation that provides the solution to the clustering problem.

5. Experimental Evaluation

The experiments reported in this section were performed on a 2.0 GHz Core 2 Duo processors with 2 GB of memory. We have tested proposed MOLGC algorithm on both real and synthetic data. The qualities of clustering results are measured by adjusted Rand index. We compared the performance of SBKM, MOCK, and MOLGC algorithms. The source code of SBKM is available on Ref. http://mail.tku.edu.tw/chchou/others/SBKM.rar. The source code for the MOCK algorithm is obtained from Ref. (http://personalpages.manchester.ac.uk/mbs/julia.handl/mock.html). For the purpose of comparison, another multiobjective clustering technique, MOCK, is also executed on the above mentioned data sets with default parameter settings. In order to show the effectiveness of the proposed MOLGC clustering technique over existing symmetry based clustering techniques, a symmetry based -means (SBKM) algorithm is executed on both real and synthetic data.

5.1. Parameter Setting

The proper setting of parameters in genetic algorithm is crucial for its good performance. Different parameter values might yield very different results. A good setting for algorithm may give best solution within a reasonable time period. In contrast, a poor setting might cause the algorithm to be executed for a very long time before finding a good solution. Sometimes it may so happen that it is not able to find a good solution at all. Grefenstette [36] has used genetic algorithm to investigate the optimal parameters of genetic algorithms. He has reported the best parameter values for GA; these are population size = 30, number of generations = not specified, crossover rate of 0.9, and mutation rate of 0.01. However, the selection of optimal parameters in GA is domain dependent and relies on the specific application areas. Below we justify how the used parameters are selected in MOLGC.(1)Population size: Goldberg [37] has theoretically analyzed that the optimal population size increases exponentially and is rather large for even moderate chromosome lengths. It has been shown in [37] that the number of schemata processed effectively is proportional to , where is the population size. This seems to justify the selection of large population size. However, the larger the population size, the longer the genetic algorithm takes to compute each generation. Motivated by above discussion, we set population size = 50 in our proposed algorithm (MOLGC).(2)Number of generations: A GA generally converges within a few generations. The pure selection convergence times are generations, where is the size of the population. Thus GA generally searches fairly quickly. In [37] it is mentioned that for a given adequate population size if some linkage knowledge is incorporated into the chromosomes then it is expected that mixing of good building blocks can take place before convergence. Thus it is important to detect near-uniformity of the population and terminate the GA, before wasting function evaluations on an inefficient, mutation-based search. So we set number of generations = 50 (executing MOLGC further did not improve its performance).(3)Initialization of population: It is customary to initialize genetic algorithm with a population of random individuals. But sometimes previously known (good) solutions can be used to initialize a fraction of the population and this results in faster convergence of GA. In the proposed MOLGC, after randomly generating the cluster centers, some iterations of -means algorithm are executed to separate the cluster centers as much as possible.(4)Selection of crossover and mutation probabilities: These are two basic parameters of GA. The crossover operation is performed based on crossover probability (). If , then child offspring is the same copy of parents. If , then offspring is result of crossover operation on parents chromosome. If , then all offspring are made by crossover. Crossover operation is performed so that good fitness value parent chromosomes can be combined in the offspring to result in potentially improved solutions. However, it is good to leave some parts of population to survive for the next generation. Mutation probability () determines how often parts of a chromosome are mutated. If there is no mutation, offspring is taken after crossover (or copy) without any change. If mutation is performed (i.e., ), a part of a chromosome is changed. If mutation probability is 100%, the whole chromosome is changed; if it is 0%, nothing is changed. Mutation is made to prevent GA from falling into local optima. But it should not occur very often; otherwise GA will change to random search. In MOLGC initially the mutation probability and crossover probability were kept fixed. We obtained good results with combination of and .

The parameters used for proposed MOLGC algorithm in our experimental study are given in Table 1. Apart from the maximum number of clusters, these parameters are kept constant over the entire range of data sets in our comparison study. In this comparison study, the SBKM algorithm is executed for 50 iterations. The parameter is chosen equal to 0.18 for all data sets. For MOCK algorithm the total number of generation is kept equal to 50.

In order to evaluate the performance of the proposed multiobjective genetic clustering algorithm more objectively, eight artificial data sets and three real data sets are used.

5.2. Artificial Data Sets

The artificial data set-1, data set-2, data set-3, and data set-4 are obtained from [7, 17] and remaining data sets were generated by two data generators (http://personalpages.manchester.ac.uk/mbs/julia.handl/generators.html). These generators permit controlling the size and structure of the generated data sets through parameters, such as number of points and dimensionality of the data set.(1)Data set-1: This data set, used in [7], contains 300 points distributed on two crossed ellipsoidal shells. This is shown in Figure 9(a).(2)Data set-2: This data set, used in [7], is combination of ring shaped, compact, and linear clusters. The total number of points in it is 300. The dimension of this data set is two. This is shown in Figure 9(b).(3)Data set-3: This data set, used in [7], consists of 250 data points distributed over five spherically shaped clusters. This is shown in Figure 9(c).(4)Data set-4: This data set, used in [17], consists of 1000 data points distributed over four square clusters. This is shown in Figure 9(d).(5)Data set-5: This data set contains 10 dimensional 838 data point’s distributed over Gaussian shaped four clusters.(6)Data set-6: This data set consists of 10 dimensional 3050 data points distributed over Gaussian shaped ten clusters.(7)Data set-7: This data set is a 50 dimensional data set and it consists of 351 data points distributed over ellipsoid shaped four clusters.(8)Data set-8: This data set contains 50 dimensional 2328 data points distributed over ellipsoid shaped ten clusters.

The real data sets are obtained from UCI repository (http://archive.ics.uci.edu/ml/). For experimental results four real data sets are considered.(1)Iris: Iris data set consists of 150 data points distributed over three clusters. Each cluster has 50 points. This data set represents different categories of irises characterized by four feature values. It has three classes, Setosa, Versicolor, and Virginica, among which the last two classes have a large amount of overlap while the first class is linearly separable. The sepal area is obtained by multiplying the sepal length by the sepal width and the petal area is calculated in an analogous way.(2)Cancer: Wisconsin breast cancer data set consists of 683 sample points. Each pattern has nine features corresponding to clump thickness, cell size uniformity, cell shape uniformity, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. There are two categories in the data: malignant and benign. The two classes are known to be linearly separable.(3)Wine: This is the Wine recognition data consisting of 178 instances having 13 features resulting from a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.(4)Diabetes: This is the diabetes data set consisting of 768 instances having 8 attributes.

5.3. Evaluation of Clustering Quality

To compare the performance of all three algorithms (SBKM, MOCK, and MOGLC) adjusted Rand index technique [38] is used. Let be the number of objects that are in both class and cluster . Let and be the number of objects in class and cluster , respectively. Under the generalized hyper geometric model, it can be shown that

The adjusted Rand index [38] can be simplified to

Adjusted Rand index is limited to the interval with a value of 1 with a perfect clustering. The high value of adjusted Rand index indicates the good quality of clustering result. The average and standard deviation of adjusted Rand index for data sets produced by 20 consecutive runs of SBKM, MOCK, and MOLGC are depicted in Tables 2(a) and 2(b), respectively.

5.4. Results on Artificial Data Sets

(1)Data set-1: We use this data set to illustrate that the proposed algorithm incorporated with line symmetry distance can also be applied to detect ring-shaped clusters even if they are crossed. Figure 10(a) shows the clustering result achieved by the SBKM algorithm. Figure 10(b) illustrates the final result achieved by the MOCK algorithm. Figure 10(c) shows the clustering result of the MOLGC algorithm. We find that the SBKM algorithm cannot work well for this case. Both MOLGC and MOCK clustering algorithms provide as the optimal number of clusters in different runs. SBKM clustering algorithm discovers number of clusters but it is unable to perform the proper partitioning from this data set in different runs.(2)Data set-2: This data set is a combination of ring-shaped, compact, and linear clusters, as shown in Figure 9(b). Most clustering algorithms based on objective function minimization fail to detect this kind of data sets because their performance depends on the dissimilarity measures used to generate a partition of the data sets. The clustering result achieved by the SBKM algorithm is shown in Figure 11(a). The final clustering result of the MOCK algorithm is illustrated in Figure 11(b). Figure 11(c) shows that the proposed algorithm works well for a set of clusters of different geometrical structures. Both SBKM and MOCK clustering algorithms provide number of clusters in different runs but both are unable to perform the proper partitioning from this data set. MOLGC clustering algorithm detects the optimal number of clusters and the proper partitioning from data set-2 in all consecutive runs.(3)Data set-3: As can be seen from Figure 12(a) to Figure 12(c), for this data set the SBKM clustering technique is unable to detect appropriate number of clusters. The best solution provided by MOCK is not able to determine the appropriate number of clusters from this data set. The corresponding partitioning is shown in Figure 12(b). MOLGC algorithm is able to detect the appropriate number of clusters from this data set in different runs. The corresponding partitioning is shown in Figure 12(c). MOCK splits data points of one cluster into two clusters and provides as the optimal number of clusters in different runs. SBKM merges the all data points into four clusters and provides as the appropriate number of clusters.(4)Data set-4: Both MOCK and MOLGC clustering algorithms are able to detect the appropriate number of clusters from this data set in different runs. The clustering result obtained by the SBKM algorithm is shown in Figure 13(a). The partitioning identified by MOCK clustering algorithm is shown in Figure 13(b). Figure 13(c) shows that the proposed algorithm works well for this data set. SBKM again overlaps the data points in two clusters and discovers as the optimal number of clusters. It is unable to perform the proper partitioning from this data set in different runs.(5)Data set-5: The proposed MOLGC algorithm and MOCK algorithms are able to detect the appropriate number of clusters from this data set in different runs. MOCK merges the data points of two clusters and it is not able to detect proper partitioning from this data set in all runs. SBKM is not able to detect the appropriate number of clusters and the proper partitioning from this data set in different runs. It again splits data points of one cluster into two clusters and provides clusters. As shown in the Tables 2(a) and 2(b), the SBKM algorithm cannot work well.(6)Data set-6: From Tables 2(a) and 2(b), it is clear that proposed MOLGC and MOCK algorithms perform much better on this data set than the other algorithm SBKM. SBKM detects clusters from this data set. It is unable to provide the appropriate number of clusters and the proper partitioning in different runs. Both MOCK and MOLGC clustering algorithms detect the appropriate number of clusters from data set-6 in all runs. But MOCK performs overlapping on some data points into two clusters from this data set.(7)Data set-7: As can be seen from Table 2(a), it is noticeable that MOLGC performs the best (providing the highest adjusted Rand index value) for this data set. The performance of MOCK is also better when compared to SBKM algorithm. For this data set, SBKM provides as the optimal number of clusters. It splits the maximum dense clusters into two clusters and overestimates the number of clusters from this data set. Both MOCK and MOLGC clustering algorithms produce the proper number of clusters and partitioning from this data set in different runs. But adjusted Rand index value corresponding to the partitioning obtained by MOLGC is higher than that of MOCK (as shown in Table 2(a)).(8)Data set-8: As shown in Table 2(a), the adjusted Rand index of MOLGC is the highest for this data set, while the performance of MOCK is second. However, the performance of SBKM algorithm is found poor. For this data set, both SBKM and MOCK detect as the appropriate number of clusters but both clustering algorithms are unable to produce the appropriate partitioning from this data set in all consecutive runs. The adjusted Rand index values reported in Tables 2(a) and 2(b) also show the poorer performance of both SBKM and MOCK algorithms from this data set. MOLGC discovers as appropriate number of clusters and the appropriate partitioning from this data set in different runs.

5.5. Results on Real Data Sets

(1)Iris: As seen from Table 2(a), the adjusted Rand index of MOLGC is the best for Iris, while the performance of MOCK is second. However, it can be seen from Tables 2(a) and 2(b) that the performance of SBKM algorithm is found poor. SBKM, MOCK, and MOLGC provide as the appropriate number of clusters form this data set in all consecutive runs. But SBKM detects overlapping of data points in two clusters whereas the third cluster is well separated from these two clusters.(2)Cancer: As can be seen from Table 2(a), it is manifest that MOLGC performs the best (providing the highest adjusted Rand index value) for this data set. The performance of MOCK and MOLGC is similar, but the performance of SBKM algorithm is found poor. All clustering algorithms are able to provide the proper number of clusters from this data set in different consecutive runs.(3)Wine: From Tables 2(a) and 2(b), it is evident that MOLGC performs the best for this data set. Both MOLGC and MOCK clustering algorithms are able to provide as the proper number of clusters from this data set. The adjusted Rand index value obtained by MOLGC is also the maximum (refer Table 2(a)). SBKM is not able to perform the proper partitioning from this data set.(4)Diabetes: From Tables 2(a) and 2(b), it is again clear that MOLGC performs much better than the other two algorithms (providing the highest adjusted Rand index value). MOLGC and MOCK clustering algorithms detect as the optimal number of clusters from this data set. Both clustering algorithms are able to provide the proper partitioning from this data set in different consecutive runs. SBKM is not able to detect appropriate number of clusters in all consecutive runs. The corresponding adjusted Rand index value is reported in Tables 2(a) and 2(b).

It can be seen from the above results that the proposed MOLGC clustering algorithm is able to detect the appropriate number of clusters from most of the data sets used here for the experiments. The superiority of MOLGC is also established on four real-life data sets which are of different characteristics with the number of dimensions varying from 2 to 13. Results on the eight artificial and four real-life data sets establish the fact that MOLGC is well-suited to detect the number of clusters from data sets having clusters of widely varying characteristics.

The performance results reported in Tables 2(a) and 2(b) clearly demonstrate the clustering accuracy of SBKM, MOCK, and MOLGC for artificial and real data sets. Table 3 indicates average computing time taken by 20 consecutive runs of SBKM, MOCK, and MOLGC for clustering of the above data sets. Results show SBKM and MOCK execution time is increased linearly with increasing dimensions of data sets. The MOLGC shows better results in terms of reduction in CPU time in comparison to SBKM and MOCK.

The proposed MOLGC clustering algorithm is able to identify automatically the appropriate number of clusters in different runs. MOLGC generates the entire set of solutions with automatic determination of correct number of clusters in a single run. It consistently generates the proper cluster number from eight artificial and four real data sets in different runs.

5.6. Reconstruction Criterion

In this paper a reconstruction criterion is used to optimize the performance of the SBKM, MOCK, and MOLGC clustering algorithms. A fuzzy C-means (FCM) algorithm based clustering platform is considered in [39]. The objective of this work is to raise awareness about the essence of the encoding and decoding processes completed in the context of fuzzy sets. The main design aspects deal with the relationships between the number of clusters and the reconstruction properties and the resulting reconstruction error. Let be a set of points in a multidimensional experimental data set. Now three sets of prototypes are generated by running the SBKM, MOCK, and MOLGC clustering algorithms separately on experimental data. For any data point , we obtain its membership grades to the corresponding clusters. They are denoted by ( and ) and are result of the minimization of the following objective function: where is a coefficient. The distance used in the objective function is viewed as the Point Symmetry Distance (PSD) in SBKM [7], nearest neighbor consistency (MOLGC), and line symmetry distance in MOLGC:

By solving (41) through the use of Lagrange multipliers, we arrive at the expression of the granular representation of the numeric value:

Figure 14 highlights the essence of reconstruction criterion. Our starting point is the result of clustering expressed in terms of the prototypes and the partition matrix.

The main objective of this reconstruction process is to reconstruct the original data using the cluster prototypes and the partition matrix by minimizing the sum of distances [39]: where is the reconstructed version of . We used the Point Symmetry Distance (PSD) for SBKM [7], nearest neighbor consistency for MOCK, and line symmetry distance for MOLGC in (43) and zeroing the gradient of with respect to , we have

The performance of reconstruction is expressed as

We investigate the behavior of the clustering results quantified in terms of the criteria of reconstruction for artificial and real data sets. Table 4 presents the reconstruction error values reported for clusters by 20 consecutive runs of SBKM, MOCK, and MOLGC, respectively. In all experiments, the value of the coefficient was set to 2.

5.7. Statistical Significance Test

For a more careful comparison among SBKM, MOCK, and MOLGC, a statistical significance test called Wilcoxon rank sum test [40] for independent samples has been conducted at the 5% significance level. It is a nonparametric alternative to the paired -test. It assumes commensurability of differences, but only qualitatively: greater differences still count more, which is probably desired, but the absolute magnitudes are ignored. From the statistical point of view, the test is safer since it does not assume normal distributions. Also, the outliers have less effect on the Wilcoxon test than on the -test. The Wilcoxon test assumes continuous differences ; therefore they should not be rounded to, say, one or two decimals since this would decrease the power of the test due to a high number of ties. When the assumptions of the paired -test are met, the Wilcoxon rank test is less powerful than the paired -test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the -test. Three groups corresponding to three algorithms SBKM, MOCK, and MOLGC have been created for each data set. Each group consists of the performance scores (adjusted Rand index for the artificial data and real life data) produced by 20 consecutive runs of corresponding algorithm. The median values of each group for all the data sets are shown in Table 5. The results obtained with this statistical test are shown in Table 6. To establish that this goodness is statistically significant, Table 6 reports the values produced by Wilcoxon’s rank sum test for comparison of groups (SBKM, MOCK, and MOLGC) at a time. As a null hypothesis, it is assumed that there are no significant differences between the median values of two groups. However, the alternative hypothesis is that there is significant difference in the median values of the two groups. All the values reported in the table are less than 0.05 (5% significance level).

The smaller the value, the stronger the evidence against the null hypothesis provided by the data. The signed rank test among algorithms MOLGC, SBKM, and MOCK for artificial data and real life data provides a value, which is very small. This is strong evidence against the null hypothesis, indicating that the better median values of the performance metrics produced by MOLGC are statistically significant and have not occurred by chance. Similar results are obtained for all other data sets and for all other algorithms compared to MOLGC, establishing the significant superiority of the MOLGC algorithm.

6. Conclusion

In this paper, a line symmetry based multiobjective MOLGC algorithm is proposed. In the proposed algorithm, the points are assigned to different clusters based on line symmetry based distance. In this multiobjective genetic clustering algorithm two objective functions, one based on a new line symmetry based distance and another based on Euclidean distance DB index, are used for computation of fitness. The proposed algorithm can be used to group given data set into a set of clusters of different geometrical structures. Compared with the SBKM and the MOCK, the proposed MOLGC algorithm adopts a line symmetry approach to cluster data; therefore, the later approach is more flexible. Most importantly, a modified version of the line symmetry distance is proposed to overcome some limitations of the original version of the symmetry distance introduced by Chung and Lin [8]. In addition, the MOLGC algorithm outperforms the SBKM algorithm and the MOCK based on the comparisons of the results presented in this paper. Tables 2(a) and 2(b) indicate the quality of best clustering results in terms of adjusted Rand index generated by SBKM, MOCK, and MOLGC for eight artificial data sets and four real data sets. Table 3 tabulates the comparisons of the computational time of the MOLGC algorithm and other popular clustering algorithms. Obviously, the proposed algorithm needs more computational resources than other algorithms. However, the proposed algorithm provides a possible solution to detect clusters with a combination of compact clusters, shell clusters, and line-shaped clusters. It should be emphasized that although the present MOLGC algorithm demonstrates to some extent the potential of detecting clusters with different geometrical structures, there still remains a lot of research space for improving the MOLGC algorithm, such as how to reduce the computational time.

Finally, it is an interesting future research topic to extend the results of this to face recognition.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.