Mathematical Problems in Engineering

Volume 2017 (2017), Article ID 3696850, 11 pages

https://doi.org/10.1155/2017/3696850

## A Clustering Method for Data in Cylindrical Coordinates

^{1}University of Electro-Communications, 1-5-1 Chofugaoka, Chofu, Tokyo 182-8585, Japan^{2}Department of Computer and Information Engineering, National Institute of Technology, Tsuyama College, 654-1 Numa, Tsuyama, Okayama 708-8506, Japan

Correspondence should be addressed to Kazuhisa Fujita

Received 7 April 2017; Revised 18 July 2017; Accepted 16 August 2017; Published 27 September 2017

Academic Editor: Ivan Giorgio

Copyright © 2017 Kazuhisa Fujita. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We propose a new clustering method for data in cylindrical coordinates based on the -means. The goal of the -means family is to maximize an optimization function, which requires a similarity. Thus, we need a new similarity to obtain the new clustering method for data in cylindrical coordinates. In this study, we first derive a new similarity for the new clustering method by assuming a particular probabilistic model. A data point in cylindrical coordinates has radius, azimuth, and height. We assume that the azimuth is sampled from a von Mises distribution and the radius and the height are independently generated from isotropic Gaussian distributions. We derive the new similarity from the log likelihood of the assumed probability distribution. Our experiments demonstrate that the proposed method using the new similarity can appropriately partition synthetic data defined in cylindrical coordinates. Furthermore, we apply the proposed method to color image quantization and show that the methods successfully quantize a color image with respect to the hue element.

#### 1. Introduction

Clustering is an important technique in many areas such as data analysis, data visualization, image processing, and pattern recognition. The most popular and useful clustering method is the -means. The -means uses the Euclidean distance as coefficient and partitions data to clusters. The Euclidean distance is a reasonable measurement for data sampled from an isotropic Gaussian distribution. We cannot always obtain a good clustering result using the -means because not all data distributions are isotropic Gaussian distributions.

The present study focuses on data in cylindrical coordinates. Data in cylindrical coordinates have a periodic element, so clustering methods using the Euclidean distance will lead to an improper analysis of the data. Furthermore, a clustering method using the Euclidean distance may not be able to extract meaningful centroids. For example, if a distribution in cylindrical coordinates is remarkably curved crescent-shape, the centroid of the distribution calculated by the -means may not be on the data distribution. However, there are no clustering methods optimized for data in cylindrical coordinates.

The cylindrical data are found in many fields such as image processing, meteorology, and biology. Movements of plants and animals and wind direction with another environmental measure are typical examples of cylindrical data [1]. The most popular example of data in cylindrical coordinates is color defined in the HSV color model. The HSV color has three attributes that are hue (direction), saturation (radius), and value that means brightness (height). The HSV color model can represent hue information and has a more natural correspondence to human vision than the RGB color model [2]. The clustering method for cylindrical coordinates is useful for many fields, especially image processing.

The purpose of this study is to develop a new clustering method for data in cylindrical coordinates based on the -means. We first derive a new similarity for clustering data in cylindrical coordinates assuming that the data are sampled from a probabilistic model that is the product of a von Mises distribution and Gaussian distributions. We propose a new clustering method with this new similarity for data in cylindrical coordinates. Using numerical experiments, we demonstrate that the proposed method can partition synthetic data. Furthermore, we evaluate the performance of the proposed method for real world data. Finally, we apply the proposed method to color image quantization and demonstrate that it can quantize a color image according to the hue.

#### 2. Related Works

The most commonly used clustering method is the -means [3], which is one of the top 10 most common algorithms used in data mining [4]. We have applied the -means to various fields because it is fast, simple, and easy to understand. It uses the Euclidean distance as a clustering criterion and assumes that the data is sampled from a mixture of isotropic Gaussian distributions. Thus, we can apply the -means to data sampled from a mixture of isotropic Gaussian distributions, but the -means is not appropriate for data generated from other distributions. Data in cylindrical coordinates have periodic characteristics, so the -means will be inappropriate as a clustering method for the data.

We can cluster periodic data distributed on an -dimensional sphere surface using the spherical -means (sk-means). Dhillon and Modha [5] and Banerjee et al. [6, 7] have developed the sk-means for clustering high dimensional text data. It is a -means based method that uses cosine similarity as the criterion for clustering. The sk-means assumes that the data are sampled from a mixture of von Mises-Fisher distributions with the same concentrate parameters and the same mixture weights. However, we cannot apply the sk-means to data that have direction, radius, and height. To appropriately partition these data, we need a different nonlinear separation method.

There are many methods for achieving nonlinear separation. One method is the kernel -means [8], which partitions the data points in a higher-dimensional feature space after they are mapped to the feature space using a nonlinear function. The spectral clustering [9] is another popular modern nonlinear clustering method, which uses the eigenvectors of a similarity (kernel) matrix to partition data points. The support vector clustering [10] is inspired by the support vector machine [11]. These nonlinear clustering methods based on the kernel method can provide reasonable clustering results for non-Gaussian data. However, these methods can hardly provide significant statistics because they perform the clustering in a feature space. This is a problem when we also want to determine some features of data, such as color image quantization. Furthermore, we must experimentally select the optimal kernel functions and its parameters.

Clustering methods are frequently used for color image quantization. Color image quantization reduces the number of colors in an image and plays an important role in applications such as image segmentation [12], image compression [13], and color feature extraction [14]. A color quantization technique consists of two stages: the palette design stage and the pixel mapping stage. These stages can be, respectively, regarded as calculating the centroids and assigning a data point to a cluster. Many researchers have developed color quantization methods including median cut [15], the -means [16], the fuzzy -means [17, 18], the self-organizing maps [19–21], and the particle swarm optimization [22]. However, generally, color quantization is performed in the RGB color space although HSV color space is rarely adopted.

#### 3. Methodology

##### 3.1. Assumed Probabilistic Distribution

A data point in cylindrical coordinates, , is represented by with and , where are called the radius, azimuth, and height, respectively. In this study, we represent the azimuth as a unit vector to simply calculate the cosine similarity. Here, each element of is assumed to be independent and identically distributed. Let a data point in cylindrical coordinates, , be generated by a probability density function (pdf) of the formwhere and and are pdfs of a von Mises distribution and an isotropic Gaussian distribution, respectively. A pdf of a von Mises distribution has the formwhere is the mean of the azimuth with , is the concentrate parameter, and is the modified Bessel function of the first kind (order 0). A pdf of an isotropic Gaussian distribution has the formwhere is the mean and is the variance. and are the means of the radius and height, respectively. and are the variances of radius and height, respectively. Thus, the density can be written as

We can estimate the parameters of density using maximum likelihood estimation. Let data set be generated from density . The log likelihood function of is Maximizing this equation subject to , we find the maximum likelihood estimates , , , , , and obtained fromwhere is

It is difficult to estimate the concentrate parameter , because an analytic solution cannot be obtained using the maximum likelihood estimate and we can only calculate the ratio of the Bessel functions. We approximate using the numerical method proposed by Sra [23], because it produces the most accurate estimates for (compared to other methods).

We estimate using the recursive functionwhere is the iteration number. The recursive calculations terminate when . In this study, . We calculate using the method proposed by Banerjee et al. [6]. is

##### 3.2. Cylindrical -Means

The -means family uses a particular similarity to decide whether a data point belongs to a cluster. The Euclidean distance (dissimilarity) is most frequently used by the -means family, and, moreover, is derived using the log likelihood of an isotropic Gaussian distribution. Therefore, the -means using the Euclidean distance will be able to appropriately partition data sampled from isotropic Gaussian distributions but not other distributions. We must develop a new similarity for data in cylindrical coordinates because the -means family clusters by maximizing the sum of similarities between a centroid of a cluster and data points that belong to the cluster. In this study, we obtain the optimal similarity for partitioning data in cylindrical coordinates from an assumed pdf.

First, to develop a -means based method for data in cylindrical coordinates (cylindrical -means; cyk-means), we obtain a new similarity measure for data in cylindrical coordinates by assuming a probability distribution. Assume that a data point in a cluster that has a centroid is sampled from the probability distribution denoted by (4) where . The natural logarithm of iswhere is a normalizing constant given byHere, we ignore the normalizing constant to obtain In this study, this equation is used as a similarity for the cyk-means. denotes the similarity between the data point and the centroid . The terms in (12) consist of the cosine similarity and the Euclidean similarities, and the new similarity is a sum of these similarities weighed. The weights indicate the concentrations of distributions. This similarity can also be considered as a simplified log likelihood.

The cyk-means partitions data points in cylindrical coordinates into clusters using the procedure same as the -means. Let be a set of data points in cylindrical coordinates. Let be the centroid of the th cluster. Using the similarity , the objective function is where is a binary indicator value. If the th data point belongs to the th cluster, . Otherwise, . The aim of the cyk-means is to maximize the objective function . The process to maximize the objective function is the same as that of the -means and is described as follows. (1)Fix and initialize .(2)Assign each data point to the cluster that has the most similar centroid.(3)Estimate parameters of clusters.(4)Return to Step if the cluster assignment of data points changes or the difference in the values of the optimal function from the current and last iteration is more than a threshold . Otherwise, terminate the procedure.In this study, we use where is the objective function of the th iteration. Algorithm 1 shows the details of the algorithm of the cyk-means. From (6), the elements of the centroid vector, , of the th cluster arewhere is the number of data points in the th cluster (which has the form ). The other values used to calculate the objective function are is approximated by Sra’s method using the ratio of the Bessel function .