The Scientific World Journal

Volume 2015, Article ID 180749, 9 pages

http://dx.doi.org/10.1155/2015/180749

## Convalescing Cluster Configuration Using a Superlative Framework

^{1}Department of Information Technology, Info Institute of Engineering, Coimbatore 641107, India^{2}Department of CSE, SNS College of Technology, Coimbatore 641035, India

Received 18 June 2015; Revised 17 September 2015; Accepted 21 September 2015

Academic Editor: Patricia Melin

Copyright © 2015 R. Sabitha and S. Karthik. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Competent data mining methods are vital to discover knowledge from databases which are built as a result of enormous growth of data. Various techniques of data mining are applied to obtain knowledge from these databases. Data clustering is one such descriptive data mining technique which guides in partitioning data objects into disjoint segments. -means algorithm is a versatile algorithm among the various approaches used in data clustering. The algorithm and its diverse adaptation methods suffer certain problems in their performance. To overcome these issues a superlative algorithm has been proposed in this paper to perform data clustering. The specific feature of the proposed algorithm is discretizing the dataset, thereby improving the accuracy of clustering, and also adopting the binary search initialization method to generate cluster centroids. The generated centroids are fed as input to -means approach which iteratively segments the data objects into respective clusters. The clustered results are measured for accuracy and validity. Experiments conducted by testing the approach on datasets from the UC Irvine Machine Learning Repository evidently show that the accuracy and validity measure is higher than the other two approaches, namely, simple -means and Binary Search method. Thus, the proposed approach proves that discretization process will improve the efficacy of descriptive data mining tasks.

#### 1. Introduction

Conventional database analysis techniques are not practically good at extracting knowledge from enormous databases. Proficient data mining methods are vital to discover knowledge from these databases. Effective techniques in data mining techniques steer in finding out useful acquaintances from raw data [1–5]. Data clustering is one such technique which guides in partitioning data objects into disjoint segments. Market segmentation, image processing, and bioinformatics are popular amongst the applications of data clustering [6, 7]. Numerous algorithms are available in the literature to direct the clustering process. These algorithms have specialised features that make clustering possible in diverse ways. One among these algorithms is the versatile -means algorithm which is simple, robust, and easy to employ [8]. It is an iterative-partitioning algorithm which partitions the data into clusters, where is a user specified parameter. It starts with initial centroids and iteratively performs two steps: assigning the data object to the cluster whose centroid is the nearest to the object and updating the clusters’ centroid [9]. The major objectives of clustering are to satisfy and to maintain suitable distance measures using Euclidean distance or Manhattan distance [10]. In spite of its simplicity and ease -means has few disadvantages. The outcome of -means segmentation is influenced by initial centroid selection and hence the partitions produced force the outcome to be trapped in local optima [11].

Many adaptations in the form of heuristic approaches are made to the -means algorithm which makes it more flexible and robust. The approaches are, namely, Simulated Annealing, Ant Colony Optimization (ACO), Tabu Search, Genetic Algorithm, Optimization approach using Honey-Bee Mating, Particle Swarm Optimization (PSO), hybrid technique based on -means, ACO, and PSO, Big Bang-Big Crunch, Artificial Bee Colony, Gravitational Search algorithm, and Binary Search algorithm [12–22]. Though these heuristic approaches enhance the efficacy of -means clustering, they endure several drawbacks like complication in their structure and implementation, limited eminence in their results, optimization problems, and result convergence problems [23]. The limited eminence in the results leads to less accuracy. To overcome these limitations and to achieve accurate results in descriptive data mining tasks a superlative framework is proposed in this paper which clusters the data objects with high efficacy. The major perspective of this proposed method is to enhance the accuracy of the data clustering process.

#### 2. Related Work

Various algorithms are available in the literature to guide in the clustering process. These algorithms have some dedicated features that make clustering possible in diverse ways. One among these algorithms is the versatile -means algorithm which is simple, robust, and easy to employ [24]. Many adaptations in the form of heuristic approaches are made to this -means algorithm which makes it more flexible and robust.

Diverse centroid initializations produce dissimilar clustering results since -means clustering algorithm tends to local minima. To conquer local minima the algorithm can be executed various times with numerous dissimilar initial centroids for a given and then deciding the clusters with the nominal squared error. No global and competent way exists for generating the preliminary partitions. The final cluster points differ for various trials from the diverse preliminary centroids.

Peña, Larrañaga, and Lozano measured the efficiency, convergence speed criteria, effectiveness, and robustness of random initialization with other initialization techniques proposed by Kaufman and Rousseeuw [6]. The experimental results proved that the random method and Kaufman’s method perform much better than the others in terms of efficiency, effectiveness, and robustness. Further measuring the convergence speed, the authors suggested Kaufman’s technique to be the efficient one.

Bradley and Fayyad proposed an enhancement that initially executes -means method times to obtain accidental partitions from the input dataset [25]. The results obtained by blending the solution belonging to the clusters are reclustered times, using the subset solution as an initial guess. The preliminary centroids for the entire dataset are finalized by selecting the ones with nominal error.

Likas et al. developed a universal method involving a series of segmenting trials with the size of clusters ranging from 1 to [26]. The preceding points are set and the fresh points are chosen by investigating the entire base. The algorithm proved efficient and was independent of the initial partitions. The computational complexity becomes the drawback of the algorithm since the algorithm executes number of trials for all values. The repetitive procedure thus does not assure result convergence.

Krishna and Murty added novel methods in their amalgam scheme to attain speedy convergence and global solutions [27]. They designed the enhancement based on the variance between two data points, thus making it stay away from being trapped in confined optima.

-means with an adaptive learning strategy is illustrated by Chinrungrueng and Séquin [28]. It can be tuned without concerning any user activities and is solely dependent on the within-group variations.

Patanè and Russo projected an enhanced technique [29], using a roulette method involving genetic algorithms which is nonsusceptible to centroid spawning problems.

Tzortzis and Likas implemented MinMax algorithm [9], a method that eliminates centroid spawning problem by varying its purpose. The algorithm starts from arbitrarily selected centroids and maintains a maximum value of intraclass distance rather than the summation of the intraclass distances. Exclusively, a value is related to each segment; that is, segments having higher variations are allotted high values; thus a weighted edition is achieved. The projected method restricts generation of huge variation clusters and produces efficient results, in spite of the initialization process. Rather this methodology employs a factor which tunes towards disciplining its cluster generation. The algorithm requires this parameter to be specified prior to execution which is considered as a drawback.

Alsultan and Selim [12] proposed Simulated Annealing (SA) approach where the segmentation problem congregates to a global solution.

Kim and Ahn [13] used Genetic Algorithm (GA) which was effective on NP-complete global optimization problems and provided good near-optimal solutions in reasonable time.

Al-Sultan [14] adapted Tabu Search which incorporates metaheuristic approach and was superior over local search clustering algorithms.

Fathian et al. [15] proposed Optimization using Honey-Bee Mating (HBMO) which incorporates optimization using swarm-based approach.

Shelokar et al. [16] implemented Ant Colony Optimization (ACO) which uses distributed agents which imitate the way ants find a minimal path from their home to food source.

Chen and Ye [17] used Particle Swarm Optimization (PSO) which searches for the cluster center in the arbitrary data set automatically.

Niknam and Amiri [18] projected an amalgam method based on -means using both ACO and PSO which deciphers nonlinear clustering problem using an evolutionary approach.

Hatamlou et al. [19] incorporated Big Bang-Big Crunch technique based on one of the theories of the evolution of the universe.

Karaboga and Ozturk [20] implemented Artificial Bee Colony (ABC) which modeled the clever foraging action of a honey bee flock and was competently employed to perform multivariate clustering.

Hatamlou et al. [21] used Gravitational Search approach which helped the -means algorithm to not only escape from local optima but also increase the convergence speed of the algorithm.

Hatamlou [22] developed a Binary Search algorithm to discover superior clusters and the methodology converged to identical result in diverse runs.

#### 3. Proposed Methodology

The proposed method is a segmentation based method that receives —the size of segments—as input and partitions the dataset into clusters. It is a simple and superlative method which first discretizes the dataset, calculates preliminary centroids, and then allocates each and every object in the input base to the closest centroids. Hence the framework clusters the data objects with high efficacy.

The methodology involves discretization techniques [30] which transforms continuous data into discrete ones. The dataset with “” continuous attributes is transformed into discrete values for attributes followed by identifying the initial centroids , given the number of clusters to be generated. These centroids are used by the -means data clustering approach to segment the data objects of into exactly clusters given by , thus maximizing accuracy.

The main objective of this proposed approach is (1) to adapt simple structures in representation, (2) to develop a methodology which is effortless and easy to implement, (3) to provide robust and trustworthy approach, (4) to produce accurate clusters, and (5) to generate clusters quickly.

Contributions of this work are as follows:(i)*Proposing a Framework to Cluster the Input Dataset*. A superlative framework is proposed with three phases described in Sections 3.1, 3.2, and 3.3.(ii)*Concrete Description of Typical Discretization Process*. Discretization phase converts the continuous valued features into discrete values which are further quantile binned. As a result of the discretization and binning process, reformed data objects are obtained.(iii)*Adaptation of Binary Search Method*. Binary Search method is adapted to spawn the preliminary centroids, where the dataset is split into equal parts based on the number of clusters required. Then split point* S* is found which is used to spawn the initial centroids.(iv)*Modified **-Means Approach*. The algorithm -means employs the centroids generated from the previous step (which is not the case in general -means) as initial points and assigns the data points to the nearest centroids, followed by recomputation of cluster centroids.(v)*Algorithmic Representation of the Phases in Framework*. The algorithms for the three phases are depicted in Algorithms 1, 2, and 3.(vi)*Demonstration of Applying the Framework on Benchmark Datasets*. Performance measurement and effectiveness evaluation of the proposed methodology on benchmarked datasets are done and results are shown in Tables 1 and 2 and discussed in Sections 4.1 and 4.2.(vii)*Comparative Analysis of the Proposed Approach with **-Means and Binary Search Method*. For the comparative study, simple -means and Binary Search methods are considered. To demonstrate the strength of the proposed approach the former mentioned methods are compared with the latter one. The metrics are deliberated and sketched for various datasets. The results are shown in Tables 3 and 4 and discussed in Section 4.3.(viii)*Comprehensive Assessment of Comparative Results*. The efficacy of the proposed methodology is discussed in Section 4.4.