Scientific Programming

Volume 2018, Article ID 9253295, 23 pages

https://doi.org/10.1155/2018/9253295

## Novel Two-Dimensional Visualization Approaches for Multivariate Centroids of Clustering Algorithms

^{1}Department of Computer Engineering, Faculty of Engineering, Dokuz Eylul University, Izmir, Turkey^{2}Department of Electrical and Electronics Engineering, Faculty of Engineering, Dokuz Eylul University, Izmir, Turkey

Correspondence should be addressed to Yunus Doğan; rt.ude.ued.sc@sunuy

Received 8 January 2018; Revised 23 June 2018; Accepted 9 July 2018; Published 8 August 2018

Academic Editor: José E. Labra

Copyright © 2018 Yunus Doğan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The dimensionality reduction and visualization problems associated with multivariate centroids obtained by clustering algorithms are addressed in this paper. Two approaches are used in the literature for the solution of such problems, specifically, the self-organizing map (SOM) approach and mapping selected two features manually (MS2Fs). In addition, principle component analysis (PCA) was evaluated as a component for solving this problem on supervised datasets. Each of these traditional approaches has drawbacks: if SOM runs with a small map size, all centroids are located contiguously rather than at their original distances according to the high-dimensional structure; MS2Fs is not an efficient method because it does not take features outside of the method into account, and lastly, PCA is a supervised method and loses the most valuable feature. In this study, five novel hybrid approaches were proposed to eliminate these drawbacks by using the quantum genetic algorithm (QGA) method and four feature selection methods, Pearson’s correlation, gain ratio, information gain, and relief methods. Experimental results demonstrate that, for 14 datasets of different sizes, the prediction accuracy of the proposed weighted clustering approaches is higher than the traditional K-means++ clustering approach. Furthermore, the proposed approach combined with K-means++ and QGA shows the most efficient placements of the centroids on a two-dimensional map for all the test datasets.

#### 1. Introduction

Human visual perception can be insufficient for the interpretation of a pattern within a multivariate (or high dimensional) structure, causing errors at the decision-making stage. In knowledge discovery processes, the same drawback is encountered in multivariate datasets because of not being able to print the dataset to a visual interface as a two-dimensional (2D) structure. Furthermore, inefficient features in a multivariate dataset negatively impact the accuracy and running performance of data analysis tasks. Therefore, the notion of dimensional reduction is of particular relevance in the preprocessing phase of data analysis. Many algorithms and methods have been proposed and developed for dimensional reduction [1], of which principal component analysis (PCA) is one of the most popular methods [2]. Regardless of popularity, neither PCA nor the other available dimensional reduction methods are suitably efficient to independently visualize all instances in the dataset because at the data preparation stage before the usage of the data-mining algorithm, PCA performs some feature selection tests separately for different dimensions. Thereafter, the optimal feature number and the optimal model are determined with respect to the variance values of these proposed models. The dataset is presented as a 2D map using PCA results in a low variance value. Another drawback of PCA is that the most valuable feature is transformed to new values [2]. As a result, PCA is not typically successful at mapping the dataset in 2D, except in image representation and facial recognition studies [3, 4]. Another difficulty is that even if a dimensional reduction is applied, the visualization of all instances in a large dataset causes storage and performance problems. To address this problem, clustering algorithms can be used for data summarization and the visualization methods applied afterwards. K-Means is the most-used clustering algorithm; however, K-means submits only the high-dimensional centroids, without any visualization and without any dimensional reduction operation.

Literature records three approaches to the visualization of K-means. The first approach involves mapping for two selected features from a multivariate dataset (MS2Fs). Curman et al. have clustered the coordinate data in Vancouver using K-means and presented them on a map [5]; Cicoria et al. have clustered “Titanic Passenger Data” using K-means and printed the survival information that was the goal of the study to different 2D maps according to features [6]; an obtained binary image was clustered onto a 2D interface using K-means for face detection in a study by Hadi and Sumedang [7]; and lastly, Fang et al. implemented software to visualize a dataset according to two selected features [8]. The second approach is visualization that takes place after K-means clustering of the dataset is dimensionally reduced by PCA. Nitsche has used this type of visualization for document clustering [9], and Wagh et al. have used it for a molecular biological dataset [10]. The third approach is the visualization of the dataset clustered by K-means in conjunction with an approach like neighborhood method location in the self-organizing map (SOM) technique, used by Raabe et al. to implement a new approach to show K-means clustering with a novel technique locating the centroids on a 2D interface [11]. SOM is a successful mapping and clustering algorithm; nevertheless, it relies on the map-size parameter as the cluster number, and if it runs with a small map-size value, all centroids of the clusters are located contiguously, not at their original distances according to the high-dimensional structure.

In our study, new approaches are proposed to visualize the centroids of the clusters on a 2D map, preserving the original distances in the high-dimensional structure. These different approaches are implemented as hybrid algorithms using K-means++ (an improved version of K-means), SOM++ (an improved version of SOM), and the quantum genetic algorithm (QGA). QGA was selected for use in our hybrid solutions because dimensionality reduction and visualization problems for multivariate centroids (DRV-P-MC) are also an optimization problem.

The contributions of this study are threefold: first, clustering by K-means++; then, mapping the centroids onto a 2D interface using QGA; and evaluating the success of this method. A heuristic approach is proposed for DRV-P-MC, and the aim is to avoid the drawback of locating the clusters contiguously as in the traditional SOM++. Mapping the centroids onto a 2D interface using SOM++ and evaluating the success of this approach are performed to enable comparison. Second, the usage of four major feature selection methods was mentioned in the paper by De Silva and Leong [12], specifically relief, information gain, gain ratio, and correlation. The aim is to preserve the most valuable feature and to evaluate it as the *X* axis. Additionally, the *Y* axis is obtained by a weighted calculation using the coefficient values returned from these feature selection methods. This provides an alternative to PCA for generating a 2D dataset that avoids the PCA drawback of losing the most valuable feature. Then, clustering the datasets by K-means++ and mapping the centroids by these novel approaches separately onto a 2D interface are performed. Moreover, mapping the centroids onto a 2D interface using PCA and evaluating the success of this approach for comparative purposes are performed. Third, a versatile tool is implemented with the capability to select the desired file, algorithm, normalization type, distance metric, and size of the 2D map to calculate successes across six different metrics: “sum of the square error,” “precision,” “recall,” “*f*-measure,” “accuracy,” and “difference between multivariate and 2D structures.” These metrics are formulated in detail in Problem Definition.

In literature, generally, MS2Fs has been the preferred method to manually determine the relations between features. Our tool is not only built on MS2Fs but also contains another algorithm that maps the centroids by using information gain, for comparison with our novel approaches. It assumes that the *X* axis is the most valuable feature, followed by the *Y* axis, according to information gain scores.

This paper details our study in seven sections. DRV-P-MC is defined in detail in Section 2; related works providing guidance on DRV-P-MC, including traditional algorithms, hybrid approaches, and the notion of dimensionality reduction, are submitted in Section 3; in Sections 4 and 5, the reorganized traditional approaches and the proposed algorithms are formulated and presented in detail; the experimental studies performed by using 14 datasets with different characteristics and their accuracy results are given in Section 6; and finally, Section 7 presents conclusions about the proposed methods.

#### 2. Problem Definition

Considering that K-means++ is the major clustering algorithm used in this sort of problem, the significance of pattern visualization in decision-making is clear. However, the visualization requirement for the centroids of K-means++ uncovers the need for DRV-P-MC, since the centroids returned and the elements in them are presented by the traditional K-means++ irrespective of any relation among the centroids. The aim in solving this problem is to map the centroids onto a 2D interface by attaining the minimum difference among the multivariate centroids and their obtained 2D projections.

DRV-P-MC can be summarized as obtaining *E* from *C*. To detect whether the solution *E* of this problem is optimal or not, ℓ must be measured as zero or a value close to zero. Two theorems and their proofs are described in this section with an illustration for the measurement of ℓ.

##### 2.1. Theorem 1

In measuring the distance between two matrices like *S* and *T*, performing the divisions of the values in *S* and *T* by the minimum distances in the matrices separately avoids the revealed numeric difference owing to the dimensional difference between *E* and *C*, and thus, a proportional balance between *S* and *T* is supplied as in (1) and (2):where is the minimum distance value in *S*.where is the minimum distance value in *T*.

##### 2.2. Proof 1

Focusing on the optimal *M*, it can be observed that the distance between the most similar centroids must be located in the closest cells in *M*, and the smallest values must be obtained for both *S* and *T*. Moreover, in this optimal *M*, the other values in *S* and *T* must be obtained proportionally to their smallest values.

###### 2.2.1. Illustration

After clustering and placement operations for *k* = 4, *f* = 3, *r* = 6, and *c* = 6, assume that *C* = {{0.2, 0.3, 0.1}, {0.3, 0.4, 0.2}, {0.5, 0.6, 0.4}, {0.9, 0.9, 0.8}} and *E* = {{0, 0}, {0, 3}, {2, 2}, {5, 5}}, and *M* is obtained as in the following equation:

For *M*, *S* and *T* matrices are obtained as the following equations:where and , is the minimum number in *S*, and *y* is the minimum number in *T*.

##### 2.3. Theorem 2

To calculate the distance between two matrices containing values that are balanced with each other, traditional matrix subtraction can be used. The subtraction operations must be performed with the absolute values, like the approach in Manhattan distance, to obtain a distance value greater than or equal to zero. After the subtraction operations, a *Z* matrix is obtained, and the sum of all values gives the difference between these matrices as follows:

##### 2.4. Proof 2

To compare the two instances containing numeric values using machine learning, normalization operations must be performed to supply numeric balance among the features before the usage of any distance metric like Euclidean distance or Manhattan distance. The first theorem, essentially, claims a normalization operation for *S* and *T* matrices. The second theorem claims that, with normalized values, the distance calculation can be performed by using the traditional subtraction operation.

To illustrate, both *S* and *T* have the smallest value as 1 and normalized values. The closer the *E* is to the optimal solution, the smaller the values in the subtraction matrix *Z*. Thus, ℓ can be obtained as a value close to zero. In this example, the value of ℓ is 7.4 + 5.5 + 1.8 + 5.9 + 2.6 = 23.2, as in the following equation:

#### 3. Related Works

In our study, some traditional machine learning algorithms were utilized to implement hybrid approaches: K-means++, SOM++, PCA, and QGA. In this section, these algorithms, hybrid logic, and dimensionality reduction are described, and reasons for preferences in our study are explained along with supporting literature.

The K-means++ algorithm is a successful clustering algorithm, inspired by K-means, that has been used in studies across broadly different domains. For example, Zhang and Hepner have clustered a geographic area in a phenology study [13]; Sharma et al. have clustered satellite images in an astronomy study [14]; Dzikrullah et al. have clustered a passenger dataset in a study of public transportation systems [15]; Nirmala and Veni have used K-means++ to obtain an efficient hybrid method in an optimization study [16]; and lastly, Wang et al. have clustered a microcomputed tomography dataset in a medical study [17].

K-means++ is consistent in that it returns the same pattern at each run. In addition, the K-means++ algorithm submits related clusters for all instances and offers the advantage of starting the cluster analysis with a good initial set of centers [18]. Conversely, it suffers the drawback of poor running performance in determining this initial set of centers, and as to find good initial centroids, it must perform *k* passes over the data. Therefore, it was necessary to improve K-means++ for use of large datasets, leading to the development of a more efficient parallel version of K-means++ [19]. Another study [20] addressed the problem by using a sorted dataset, which is claimed to decrease the running time. The literature also includes many studies on enhancing the accuracy or the performance of K-means++, but none on visualizing a 2D map for the clusters of this successful algorithm. Therefore, in this study, a novel approach to visualizing K-means++ clusters on a 2D map is detailed.

SOM is both a clustering and a mapping algorithm, used as a visualization tool for exploratory data in different domains owing to its mapping ability [21]. Each cluster in SOM is illustrated as a neuron, and after the training process in an artificial neural network (ANN) structure, each neuron has *X* and *Y* values as a position on a map. In addition, all clusters in a SOM map are neighboring [22]. Nanda et al. used SOM for hydrological analysis [23]; Chu et al. used SOM for their climate study [24]; Voutilainen et al. clustered a gerontological medical dataset using SOM [25]; Kanzaki et al. used SOM in their radiation study to analyze the liver damage from radon, X-rays, or alcohol treatments in mice [26]; and Tsai et al. have clustered a dataset about water and fish species in an ecohydrological environment study [27].

Although SOM produces a map where each neuron represents a cluster and all clusters are neighboring [21, 22], the map does not position the returned centroids adjacent to each other as 1-unit distances; in fact, these clusters must be located on farther cells of the map. In this study, a novel approach to visualize SOM mappings in 2D retaining practically original distances among clusters is detailed, using the fast version of SOM, SOM++. In SOM++, the initialization step of K-means++ is used to find the initial centroids of SOM [28].

PCA is another popular machine learning method for feature selection and dimensionality reduction. Owing to its versatility, this algorithm is also used across different domains: Viani et al. used PCA to analyze channel state information (CSI) for wireless detection of passive targets [29]; Tiwari et al. analyzed solar-based organic ranking cycles for optimization using PCA [30]; Hamill et al. used PCA to sort multivariate chemistry datasets [31]; Ishiyama et al. analyzed a cytomegalovirus dataset in a medical study [32]; and Halai et al. used PCA to analyze a neuropsychological model in a psychology study [33]. PCA is used to eliminate some features for multivariate datasets before machine learning analysis, as the dataset may be large and contain some features that would make analysis efficient [34]. These unnecessary features may be identified by PCA for subsequent removal, resulting in a new dataset with new values and fewer features [2]. Essentially, this algorithm is not a clustering algorithm; however, PCA is relevant owing to its dimensionality reduction capacity; this is utilized in our study to obtain a new 2D dataset, which is then clustered using the traditional K-means++ approach.

QGA is a heuristic optimization algorithm. It refers to the smallest unit storing information in a quantum computer as a quantum bit (qubit). A qubit may store a value in between the binary values of “1” or “0,” significantly decreasing running time in the determination of an optimized result [35, 36]. This contemporary optimization algorithm is in wide-spread use. Silveira et al. used QGA to implement a novel approach for ordering optimization problems [37]; Chen et al. used QGA for a path planning problem [38]; Guan and Lin implemented a system to obtain a structural optimal design for ships using QGA [39]; Ning et al. used QGA to solve a “job shop scheduling problem” in their study [40]; and Konar et al. implemented a novel QGA as a hybrid quantum-inspired genetic algorithm to solve the problem of scheduling real-time tasks in multiprocessor systems [41]. DRV-P-MC, the focus of interest in our study, is an optimization problem as well, so this algorithm’s efficient run-time performance is employed to determine suitable cluster positions on a 2D map.

Hybrid approaches combine efficient parts of certain algorithms into new wholes, enhancing the accuracy and the efficiency of these algorithms, or producing novel algorithms. In literature, there are many successful hybrid data-mining approaches: Kumar et al. implemented a highly accurate optimization algorithm from the combination of a genetic algorithm with fuzzy logic and ANN [42]; Singhal and Ashraf implemented a high-performance classification algorithm from the combination of a decision tree and a genetic algorithm [43]; Hassan and Verma collected successful high-accuracy hybrid data-mining applications for the medical domain in their study [44]; Thamilselvan and Sathiaseelan reviewed hybrid data-mining algorithms for image classification [45]; Athiyaman et al. implemented a high-accuracy approach combination of association rule mining algorithms and clustering algorithms for meteorological datasets [46]; Sahay et al. proposed a high-performance hybrid data-mining approach combining apriori and K-means algorithms for cloud computing [47]; Yu et al. obtained a novel solution selection strategy using hybrid clustering algorithms [48]; Sitek and Wikarek implemented a hybrid framework for solving optimization problems and constraint satisfaction by using constraint logic programming, constraint programming, and mathematical programming [49]; Abdel-Maksoud et al. proposed a hybrid clustering technique combining K-means and fuzzy C-means algorithms to detect brain tumours with high accuracy and performance [50]; Zhu et al. implemented a novel high-performance hybrid approach containing hierarchical clustering algorithms for the structure of wireless networks [51]; Rahman and Islam combined K-means and a genetic algorithm to obtain a novel high-performance genetic algorithm [52]; and Jagtap proposed a high-accuracy technique to diagnose heart disease by combining Naïve Bayes, Multilayer Perceptron, C4.5 as a decision tree algorithm, and linear regression [53]. What we can infer from a detailed examination of these studies is that K-means and genetic algorithms, and their variants, can be adapted to other algorithms to implement a hybrid approach successfully. Moreover, the combination of K-means and genetic algorithms creates an extremely efficient and highly accurate algorithm.

In data analysis, unnecessary features cause two main problems in performance and accuracy. If a dataset is large or has insignificant features, a downscaling process should be performed by a dimensionality reduction operation to enable efficient use of the analysis algorithms. In literature, many techniques related to dimensionality reduction are presented. For example, Dash et al. claimed that using PCA for dimensionality reduction causes a drawback in understanding the dataset owing to the creation of new features with new values. Furthermore, they posit that the most effective attributes are damaged. Therefore, they presented a novel approach based on an entropy measure for dimensionality reduction [54]. Bingham and Mannila used a random projection (RP) method instead of PCA for the dimensionality reduction of image and text datasets, singular value decomposition (SVD), latent semantic indexing (LSI), and discrete cosine transform, claiming that RP offers simpler calculation than the other methods and has low error rates [55]. Goh and Vidal have used *k*-nearest neighbor and K-means to obtain a novel method for clustering and dimensionality reduction on Riemannian manifolds [56]; Napoleon and Pavalakodi implemented a new technique using PCA and K-means for dimensionality reduction on high-dimensional datasets [57]; Samudrala et al. implemented a parallel framework to reduce the dimensions of large-scale datasets by using PCA [58]; Cunningham and Byron used PCA, factor analysis (FA), Gaussian process factor analysis, latent linear dynamical systems, and latent nonlinear dynamical systems for the dimensional reduction of human neuronal data [59]; and Demarchi et al. reduced the dimensions of the APEX (airborne prism experiment) dataset using the auto-associative neural network approach and the BandClust algorithm [60]. Boutsidis et al. implemented two different dimensional reduction approaches for K-means clustering: the first based on RP and the second based on SVD [61]. Azar and Hassanien proposed a neurofuzzy classifier method based on ANN and fuzzy logic for the dimensional reduction of medical big datasets [62]; Cohen et al. implemented a method using RP and SVD for the dimensional reduction in K-means clustering [63]; Cunningham and Ghahramani discussed PCA, FA, linear regression, Fisher’s linear discriminant analysis, linear multidimensional scaling, canonical correlations analysis, slow feature analysis, undercomplete independent component analysis, sufficient dimensionality reduction, distance metric learning, and maximum autocorrelation factors in their survey article and observed that, in particular, PCA was used and evaluated in many studies as a highly accurate analysis [1]; and Zhao et al. used 2D-PCA and 2D locality preserving projection for the 2D dimensionality reduction in their study [64]. Sharifzadeh et al. improved a PCA method as sparse supervised principal component analysis (SSPCA) to adapt PCA for dimensionality reduction of supervised datasets, claiming that the addition of the target attribute made the feature selection and dimensional reduction operations more successful [65].

These studies show PCA to be the most-used method for dimensionality reduction despite reported disadvantages including the creation of new features, which may hamper the understanding of the dataset, changing the values in the most important and efficient features, and complex calculation and low performance for big datasets.

In other sample clustering studies, Yu et al. proposed some distribution-based distance functions, used to measure the similarity between two sets of Gaussian distributions, in their study, and distribution-based cluster structure selection. Additionally, they implemented a framework to determine the unified cluster structure from multiple cluster structures in all data used in their study [66]. In another study by Yu and Wong, a quantization driven clustering approach was designed to obtain classes for many instances. Moreover, they proposed two different methods to improve the performance of their approach, the shrinking process, and the hierarchical structure process [67]. A study by Wang et al. proposed a local gravitation model and implemented two novel measures to discover more information among instances, a local gravitation clustering algorithm for clustering and evaluating the effectiveness of the model, and communication with local agents to attain satisfactory clustering patterns using only one parameter [68]. Yu et al. designed a framework known as the double affinity propagation driven cluster for clustering on noisy instances and integrated multiple distance functions to avoid the noise involved with using a single distance function [69].

#### 4. The Reorganized Traditional Approaches

This paper assumes two ways of K-means++ clustering, the traditional usage, and a weighted usage. After normalization techniques are used to balance the dataset features and normalize the dataset, traditional K-means++ clustering is performed. This has a preprocess step to discover the initial centroids for the standard K-means. After the initialization of the centroids, K-means clustering runs using the initial centroid values. K-means++ is expressed as Algorithm 1.