Abstract

We propose a new clustering method for data in cylindrical coordinates based on the -means. The goal of the -means family is to maximize an optimization function, which requires a similarity. Thus, we need a new similarity to obtain the new clustering method for data in cylindrical coordinates. In this study, we first derive a new similarity for the new clustering method by assuming a particular probabilistic model. A data point in cylindrical coordinates has radius, azimuth, and height. We assume that the azimuth is sampled from a von Mises distribution and the radius and the height are independently generated from isotropic Gaussian distributions. We derive the new similarity from the log likelihood of the assumed probability distribution. Our experiments demonstrate that the proposed method using the new similarity can appropriately partition synthetic data defined in cylindrical coordinates. Furthermore, we apply the proposed method to color image quantization and show that the methods successfully quantize a color image with respect to the hue element.

1. Introduction

Clustering is an important technique in many areas such as data analysis, data visualization, image processing, and pattern recognition. The most popular and useful clustering method is the -means. The -means uses the Euclidean distance as coefficient and partitions data to clusters. The Euclidean distance is a reasonable measurement for data sampled from an isotropic Gaussian distribution. We cannot always obtain a good clustering result using the -means because not all data distributions are isotropic Gaussian distributions.

The present study focuses on data in cylindrical coordinates. Data in cylindrical coordinates have a periodic element, so clustering methods using the Euclidean distance will lead to an improper analysis of the data. Furthermore, a clustering method using the Euclidean distance may not be able to extract meaningful centroids. For example, if a distribution in cylindrical coordinates is remarkably curved crescent-shape, the centroid of the distribution calculated by the -means may not be on the data distribution. However, there are no clustering methods optimized for data in cylindrical coordinates.

The cylindrical data are found in many fields such as image processing, meteorology, and biology. Movements of plants and animals and wind direction with another environmental measure are typical examples of cylindrical data [1]. The most popular example of data in cylindrical coordinates is color defined in the HSV color model. The HSV color has three attributes that are hue (direction), saturation (radius), and value that means brightness (height). The HSV color model can represent hue information and has a more natural correspondence to human vision than the RGB color model [2]. The clustering method for cylindrical coordinates is useful for many fields, especially image processing.

The purpose of this study is to develop a new clustering method for data in cylindrical coordinates based on the -means. We first derive a new similarity for clustering data in cylindrical coordinates assuming that the data are sampled from a probabilistic model that is the product of a von Mises distribution and Gaussian distributions. We propose a new clustering method with this new similarity for data in cylindrical coordinates. Using numerical experiments, we demonstrate that the proposed method can partition synthetic data. Furthermore, we evaluate the performance of the proposed method for real world data. Finally, we apply the proposed method to color image quantization and demonstrate that it can quantize a color image according to the hue.

The most commonly used clustering method is the -means [3], which is one of the top 10 most common algorithms used in data mining [4]. We have applied the -means to various fields because it is fast, simple, and easy to understand. It uses the Euclidean distance as a clustering criterion and assumes that the data is sampled from a mixture of isotropic Gaussian distributions. Thus, we can apply the -means to data sampled from a mixture of isotropic Gaussian distributions, but the -means is not appropriate for data generated from other distributions. Data in cylindrical coordinates have periodic characteristics, so the -means will be inappropriate as a clustering method for the data.

We can cluster periodic data distributed on an -dimensional sphere surface using the spherical -means (sk-means). Dhillon and Modha [5] and Banerjee et al. [6, 7] have developed the sk-means for clustering high dimensional text data. It is a -means based method that uses cosine similarity as the criterion for clustering. The sk-means assumes that the data are sampled from a mixture of von Mises-Fisher distributions with the same concentrate parameters and the same mixture weights. However, we cannot apply the sk-means to data that have direction, radius, and height. To appropriately partition these data, we need a different nonlinear separation method.

There are many methods for achieving nonlinear separation. One method is the kernel -means [8], which partitions the data points in a higher-dimensional feature space after they are mapped to the feature space using a nonlinear function. The spectral clustering [9] is another popular modern nonlinear clustering method, which uses the eigenvectors of a similarity (kernel) matrix to partition data points. The support vector clustering [10] is inspired by the support vector machine [11]. These nonlinear clustering methods based on the kernel method can provide reasonable clustering results for non-Gaussian data. However, these methods can hardly provide significant statistics because they perform the clustering in a feature space. This is a problem when we also want to determine some features of data, such as color image quantization. Furthermore, we must experimentally select the optimal kernel functions and its parameters.

Clustering methods are frequently used for color image quantization. Color image quantization reduces the number of colors in an image and plays an important role in applications such as image segmentation [12], image compression [13], and color feature extraction [14]. A color quantization technique consists of two stages: the palette design stage and the pixel mapping stage. These stages can be, respectively, regarded as calculating the centroids and assigning a data point to a cluster. Many researchers have developed color quantization methods including median cut [15], the -means [16], the fuzzy -means [17, 18], the self-organizing maps [19ā€“21], and the particle swarm optimization [22]. However, generally, color quantization is performed in the RGB color space although HSV color space is rarely adopted.

3. Methodology

3.1. Assumed Probabilistic Distribution

A data point in cylindrical coordinates, , is represented by with and , where are called the radius, azimuth, and height, respectively. In this study, we represent the azimuth as a unit vector to simply calculate the cosine similarity. Here, each element of is assumed to be independent and identically distributed. Let a data point in cylindrical coordinates, , be generated by a probability density function (pdf) of the formwhere and and are pdfs of a von Mises distribution and an isotropic Gaussian distribution, respectively. A pdf of a von Mises distribution has the formwhere is the mean of the azimuth with , is the concentrate parameter, and is the modified Bessel function of the first kind (order 0). A pdf of an isotropic Gaussian distribution has the formwhere is the mean and is the variance. and are the means of the radius and height, respectively. and are the variances of radius and height, respectively. Thus, the density can be written as

We can estimate the parameters of density using maximum likelihood estimation. Let data set be generated from density . The log likelihood function of is Maximizing this equation subject to , we find the maximum likelihood estimates , , ā€‰ā€‰, , , and obtained fromwhere is

It is difficult to estimate the concentrate parameter , because an analytic solution cannot be obtained using the maximum likelihood estimate and we can only calculate the ratio of the Bessel functions. We approximate using the numerical method proposed by Sra [23], because it produces the most accurate estimates for (compared to other methods).

We estimate using the recursive functionwhere is the iteration number. The recursive calculations terminate when . In this study, . We calculate using the method proposed by Banerjee et al. [6]. is

3.2. Cylindrical -Means

The -means family uses a particular similarity to decide whether a data point belongs to a cluster. The Euclidean distance (dissimilarity) is most frequently used by the -means family, and, moreover, is derived using the log likelihood of an isotropic Gaussian distribution. Therefore, the -means using the Euclidean distance will be able to appropriately partition data sampled from isotropic Gaussian distributions but not other distributions. We must develop a new similarity for data in cylindrical coordinates because the -means family clusters by maximizing the sum of similarities between a centroid of a cluster and data points that belong to the cluster. In this study, we obtain the optimal similarity for partitioning data in cylindrical coordinates from an assumed pdf.

First, to develop a -means based method for data in cylindrical coordinates (cylindrical -means; cyk-means), we obtain a new similarity measure for data in cylindrical coordinates by assuming a probability distribution. Assume that a data point in a cluster that has a centroid is sampled from the probability distribution denoted by (4) where . The natural logarithm of iswhere is a normalizing constant given byHere, we ignore the normalizing constant to obtain In this study, this equation is used as a similarity for the cyk-means. denotes the similarity between the data point and the centroid . The terms in (12) consist of the cosine similarity and the Euclidean similarities, and the new similarity is a sum of these similarities weighed. The weights indicate the concentrations of distributions. This similarity can also be considered as a simplified log likelihood.

The cyk-means partitions data points in cylindrical coordinates into clusters using the procedure same as the -means. Let be a set of data points in cylindrical coordinates. Let be the centroid of the th cluster. Using the similarity , the objective function is where is a binary indicator value. If the th data point belongs to the th cluster, . Otherwise, . The aim of the cyk-means is to maximize the objective function . The process to maximize the objective function is the same as that of the -means and is described as follows. (1)Fix and initialize .(2)Assign each data point to the cluster that has the most similar centroid.(3)Estimate parameters of clusters.(4)Return to Step if the cluster assignment of data points changes or the difference in the values of the optimal function from the current and last iteration is more than a threshold . Otherwise, terminate the procedure.In this study, we use where is the objective function of the th iteration. Algorithm 1 shows the details of the algorithm of the cyk-means. From (6), the elements of the centroid vector, , of the th cluster arewhere is the number of data points in the th cluster (which has the form ). The other values used to calculate the objective function are is approximated by Sraā€™s method using the ratio of the Bessel function .

Input: Set of data points in cylindrical coordinates
Output: A clustering of
(1) Initialize
(2) repeat
(3) Set
(4) Assign data points to clusters
(5) forā€‰ā€‰ to do
(6)
(7)
(8) end for
(9) Estimate parameters
(10) forā€‰ā€‰ to ā€‰ā€‰do
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19) repeat
(20)
(21) until convergence
(22) end for
(23) until convergence

The cyk-means method has many parameters. The -means method for data in three-dimensional Cartesian coordinates has only parameters, which are multiples of the number of centroid vectors and dimensions. However, the cyk-means has parameters, which are multiples of the number of clusters and the number of parameters of a cluster. The parameters of the th cluster are , (two dimensions), , , , and . Because the cyk-means has more degrees of freedom, the dead unit problem (i.e., empty clusters) will frequently occur if the initial is not optimal.

3.3. Fixed cyk-Means

Model based clustering methods have various problems such as the dead units and initial value problems. One reason for this is that the log likelihood equation can have many local optima [9]. If a model has more parameters, these problems tend to be more frequent. In the fixed cyk-means, the concentrate parameter and the variances s are fixed for particular values. As a consequence, the fixed cyk-means has parameters. Fixing the parameters decreases the complexity of the model and makes these problems less. Algorithm 2 indicates the fixed cyk-means algorithm.

Input: Set of data points in cylindrical coordinates
Output: A clustering of
(1) Initialize
(2) repeat
(3) Set
(4) Assign data points to clusters
(5) forā€‰ā€‰ to ā€‰ā€‰do
(6)
(7)
(8) end for
(9) Estimate parameters
(10) forā€‰ā€‰ to ā€‰ā€‰do
(11)
(12)
(13)
(14)
(15) end for
(16) until convergence
3.4. Computational Complexity

Assigning data points to clusters has a complexity of per iteration. We must estimate six parameters. We obtain three s, two s, and in time per iteration, where is the convergence time of . Therefore, the total computational complexity per iteration is . The complexity of the fixed cyk-means is per iteration, so the cyk-means is approximately 1.5 times as complex as the fixed cyk-means.

4. Experimental Results

In our experiments, we use Python and its libraries (NumPy, SciPy, and scikit-learn) to implement the proposed method.

4.1. Synthetic Data

In this subsection, we demonstrate that the cyk-means and the fixed cyk-means can partition synthetic data that is defined using cylindrical coordinates. The dataset used in this experience has three clusters, as shown in Figure 1(a). The data points in each cluster are generated from the probability distribution denoted by (4), with the parameters shown in Table 1. Figures 1(b), 1(c), and 1(d) show the clustering results of the cyk-means, the fixed cyk-means with , , and , and the -means, respectively. We can see that the cyk-means and the fixed cyk-means properly partition the dataset into each cluster. On the other hand, the -means regards two upper right clusters as one cluster and unsuccessfully partitions the dataset. Table 2 shows the parameters estimated by the cyk-means, the fixed cyk-means, and the -means. The cyk-means can only estimate the concentrate parameters and the variances. The values of the concentrate parameters and the variances estimated by the cyk-means are approximate to the true values. The cyk-means most appropriately estimates the number of data points in each cluster. The fixed cyk-means most approximately estimates the all means and the cyk-means also approximately calculates the all means. These results show that the cyk-means and the fixed cyk-means sufficiently approximately estimate the all means.

In the next experiment, we examine the effectiveness of the proposed methods (the cyk-means and the fixed cyk-means with , , and ) compared to the -means and the kernel -means with a radial basis function. The parameter of the radial basis function is . The synthetic data have clusters and are defined in cylindrical coordinates. The number of data points in each cluster is 200. The mean azimuth of the th cluster is a random number in . The concentrate parameter is a random number in . The mean radius of the th cluster is a random number in . The mean height of the th cluster is a random number in . The standard deviations of and are random numbers in .

Figure 2 shows the relationship between the number of clusters and adjusted rand index (ARI). ARI evaluates the performance of clustering algorithms [24]. When ARI = 1, all data points belong to true clusters. The figure shows that the cyk-means has the largest ARI for almost all cases. The fixed cyk-means performs better than the kernel -means and the -means. The -means performs the worst. In conclusion, the cyk-means most accurately partitions synthetic data defined in cylindrical coordinates, and the fixed cyk-means also performs well.

4.2. Real World Data

We show the performances of the proposed methods for the iris dataset (http://mlearn.ics.uci.edu/databases/iris/) and the segmentation benchmark dataset (http://www.ntu.edu.sg/home/asjfcai/Benchmark_Website/benchmark_index.html) [25]. The iris dataset has 150 data points of three classes of irises. The data point consists of the four attributes, sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. The segmentation benchmark dataset consists of 100 images from the Berkeley segmentation database [26] and ground-truths generated by manual labeling.

Table 3 depicts the ARI scores of the cyk-means, the fixed cyk-means, the -means, and the kernel -means for the iris dataset. The parameters of the fixed cyk-means are , , and . of the radial basis function of the kernel -means is 0.01. In this experiment, we use only three attributes of the iris dataset because the proposed methods are specialized for 3-dimensional data. Furthermore, we transform this dataset that has three attributes into zero mean dataset. In all cases, the performance of the cyk-means is lower than the other methods. Conversely, in almost all cases, the performance of the fixed cyk-means is the best. However, the difference in the performance between the fixed cyk-means, the -means, and the kernel -means is not large.

Table 4 shows the ARI scores of the cyk-means, the fixed cyk-means, and the k-mean for seven images in the segmentation benchmark dataset. The parameters of the fixed cyk-means are , , and . To evaluate the performances of the cyk-means and the fixed cyk-means, we convert images from RGB color to HSV color. When we cluster the dataset by the -means, we use images represented by RGB color and HSV color. In this experiment, we compare a clustering result with a ground truth using the ARI score. We set the number of clusters to the number of segments in a ground truth. In all cases, the fixed cyk-means stably shows good performance. The cyk-means indicates much better or worse performances than the other methods. In other words, the cyk-means shows unstable performance. This instability will be caused by the cyk-means more easily trapping a local minimum because of more parameters.

4.3. Application to Color Image Quantization

We apply the cyk-means and the fixed cyk-means to color image quantization and compare the results to those using the -means. We convert images quantized by the proposed methods from RGB color space to HSV color space before quantization, whereas an image processed by the -means is represented using RGB. Figure 3 contains the four test images from the Berkeley segmentation database [26] and their quantization results. The original color images have sizes of or and are used as the test images to quantize into three colors. These quantization results are generated by the cyk-means, the fixed cyk-means with , , and , and the -means. The color of a pixel in the quantized image represents the value of the centroid of the cluster that contains the pixel.

For image 118035 in Figure 3, the colors of the background, the wall, and the roof are obviously different from each other. The cyk-means and the fixed cyk-means successfully segment this image, whereas the -means extracts the shade from the wall and can not merge the wall to one color. Furthermore, the quantization results using the cyk-means and the fixed cyk-means are very similar.

Image 26098 consists of red and green peppers on a display table. The cyk-means merges the red peppers and the planks of the display table and divides the dark area into two colors. The fixed cyk-means successfully extracts the red peppers. The -means assigns red to the planks and part of the green peppers.

Image 299091 consists of some sky with cloud, an ocher pyramid, and ocher ground. The cyk-means groups the ocher pyramid and white cloud into the same color, whereas the fixed cyk-means correctly segments the pyramid and the sky. The -means is unsuccessful; it divides the pyramid into three regions (an ocher region, a highlight region, and a shade region).

The cyk-means did not perform well for image 295087. It segments the image into two colors even though we set the number of clusters to three. Thus, the cyk-means makes a dead unit. This is because the concentrate parameter and variances, respectively, become small and large if a distribution of data points is regarded to visually consist of a few clusters. Thus, a few clusters include all data points and dead units (empty clusters) appear, even if we fix the number of clusters to a large number. In contrast, the fixed cyk-means (which has fixed concentrate and variance values) appropriately partitions the ground and the blue and the deep blue regions of the sky. The -means extracts shaded regions from the ground; that is, it can not group the ground into one region.

Furthermore, the initial parameters, and s, of the fixed cyk-means can control the quantization results. Figure 4 shows the quantization results generated by the fixed cyk-means using the different parameters. The original image in Figure 4 consists of two objects: the red fish and the arms of an anemone. The fixed cyk-means with , , and can not extract the red fish shown in the middle image of Figure 4. However, the fixed cyk-means with , , and extracts the red fish in the left image of Figure 4. This is because a large and/or large variances relatively increase the cosine similarity term of (12), and consequently clustering is more focused on the hue element.

In conclusion, the fixed cyk-means is a more suitable method for color image quantization than the cyk-means. The fixed cyk-means quantizes color images with respect to the hue. The quantization results of the fixed cyk-means differ from that generated by -means. That is because the Euclidean metric cannot consider the hue.

5. Conclusion and Discussion

In this study, we develop the cyk-means and the fixed cyk-means methods, which are new clustering methods for data in cylindrical coordinates. We derive a new similarity for the cyk-means from a probability distribution that is the product of a von Mises distribution and two Gaussian distributions (see (4)), because the Euclidean distance cannot properly represent dissimilarities between data points on periodic axes. Our experiments demonstrate that the cyk-means and the fixed cyk-means can properly partition synthetic data in cylindrical coordinates. Furthermore, the experimental results using real world data show that the fixed cyk-means has equal or better performance than the -means and the kernel -means. In the final experiment, the proposed methods are applied to color image quantization and successfully quantize a color image with respect to the hue element.

The experiments that partitioned synthetic data demonstrate the effectiveness of the cyk-means. In the first experimental results, the cyk-means produces good estimates of the parameters and clustering data. The results of the second experiment show that the cyk-means performs the best when clustering synthetic data. However, in the experiment using real world data we find that the cyk-means did not provide good clustering results. Furthermore, the results of the color image quantization suggest that the flexibility of the cyk-means often produces dead units or a small cluster containing few data points. Thus, the cyk-means may not be appropriate for actual applications.

The fixed cyk-means will be an effective method for actual applications. The fixed cyk-means is stable and performs well when we apply it to clusterings of synthetic data, real world data, and color image quantization. Furthermore, the fixed cyk-means hardly makes dead units because the number of its parameters is smaller than the cyk-means. The fixed cyk-means requires less computational time than the cyk-means with similar results.

In future work, we will improve the performance of the proposed methods. The proposed methods are exposed to the ill-initialization problem and/or the dead unit problem caused by an incorrect initialization, similar to -means. The -means++ method proposed by Athur and Vassilvitskii [27] solves the ill-initialization problem of -means and improves the clustering performance by obtaining an initial set of cluster centers that is close to the optimal solution. The conscience mechanism improves the performance of competitive learning and clustering algorithms [28ā€“30]. It inserts a bias into the competition process so that each unit can win the competition with equal probability. Xu et al. [31] proposed an algorithm based on competitive learning called rival penalized competitive learning [2, 32], which determines the appropriate number of clusters and solves the dead unit problem. The strategy of rival penalized competitive learning is to adapt the weights of the winning unit to the input and to unlearn the weights of the 2nd winner. By incorporating the approaches in these algorithms into the proposed methods, we will improve the performance and reduce the effect of the intrinsic problems.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

The author would like to thank Toya Teramoto of the University of Electro-Communications for testing their algorithm.