Scientific Programming

Volume 2018, Article ID 2764016, 7 pages

https://doi.org/10.1155/2018/2764016

## Clustering for Probability Density Functions by New -Medoids Method

^{1}Division of Computational Mathematics and Engineering, Institute for Computational Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam^{2}Faculty of Mathematics and Statistics, Ton Duc Thang University, Ho Chi Minh City, Vietnam^{3}Natural Science College, Can Tho University, Can Tho City, Vietnam

Correspondence should be addressed to T. Nguyen-Trang; nv.ude.tdt@oahtgnartneyugn

Received 24 November 2017; Revised 21 March 2018; Accepted 3 April 2018; Published 9 May 2018

Academic Editor: Emiliano Tramontana

Copyright © 2018 D. Ho-Kieu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This paper proposes a novel and efficient clustering algorithm for probability density functions based on -medoids. Further, a scheme used for selecting the powerful initial medoids is suggested, which speeds up the computational time significantly. Also, a general proof for convergence of the proposed algorithm is presented. The effectiveness and feasibility of the proposed algorithm are verified and compared with various existing algorithms through both artificial and real datasets in terms of adjusted Rand index, computational time, and iteration number. The numerical results reveal an outstanding performance of the proposed algorithm as well as its potential applications in real life.

#### 1. Introduction

Clustering plays a pivotal role in exploring the intrinsic structure of data, especially in data mining. Its main idea is to separate subgroups from an initial group such that objects in each subgroup have the most similarity. Therefore, it aims to minimize intracluster variation and to maximize the intercluster variation [1]. Cluster analysis is divided into two kinds: hard (crisp) clustering and soft (fuzzy) clustering [2]. For crisp clustering, -means and -medoids algorithms are the typical ones [1].

The primary difference of these algorithms is the way to approach center of cluster. For each iteration, -means updates its center by average of mass for each cluster called centroid. However, by this approach, -means is well-known to be sensitive to outlier despite efficiency in computational time. To overcome this shortcoming, -medoids clustering (KMC) is a good solution because this technique employs object in the initial input being the reference point instead of center of mass [2]. That is the reason why its centers are named medoids. Among numerous KMC algorithms, the partition around medoids (PAM) firstly proposed by [3] is known to be the most powerful. However, computational time is still a drawback of PAM when it is applied to solve large problems [4]. Therefore, in this paper, one robust but straightforward scheme is employed to address the aforementioned difficulty. This scheme which is inspired from [5] intends to discover the most middle objects to be initial medoids.

Dating back to the history, the common object of clustering is usually discrete elements with a lot of works having been done like [6–11]. Nevertheless, with the fluctuation of data nowadays, it seems more proper to feature the data by series of numbers or functions rather than just a single point. This leads to considering the probability density functions (pdfs) as other object in clustering besides the discrete element [12]. So far, some of the state-of-the-art works related to clustering for pdfs can be mentioned as follows: Chen and Hung proposed a simple but effective automatic clustering algorithm for pdfs based on ad hoc technique [13]. Besides, Nguyentrang and Vovan considered many approaches to clustering problem both in the hierarchical and nonhierarchical ways [12]. Among them, a remarkable work related to -means for pdfs called nonhierarchical method is proposed. Furthermore, Tai et al. also applied an evolutionary technique to optimize the clustering solution [14].

Nevertheless, from an overview of the related works to clustering for pdfs, it is noticed that there is no research studying KMC for pdfs. Also, for a massive amount of data as pdfs, the computational time should be taken into consideration. Therefore, on the one hand, this paper proposes a KMC algorithm for pdfs (KMCF) for the first time. On the other hand, the convergence of KMCF algorithm is resolved. Many numerical examples are performed to evaluate the robustness as well as the effectiveness of proposed method. The numerical results of the KMCF algorithm are compared with that of existing ones in the literature. All results show the dominance of the proposed method from the perspectives of both accuracy and computational time.

The remaining part of the paper is organized as follows. Section 2 presents some related theories and proposes an algorithm for clustering of pdfs based on -medoids method. Section 3 proves the convergence of the proposed algorithm. Section 4 discusses the numerical results of the proposed algorithm and existing ones. Section 5 gives conclusion of the whole work.

#### 2. Related Theory and the Proposed Algorithm

##### 2.1. Definitions

Let be set including probability density functions (pdfs) , which is divided into partitions . One feasibility partition of all given pdfs in each cluster denoted as should maintain the following properties:(i)The minimum number of objects in one cluster is 1.(ii)Each object definitely belongs to one cluster.(iii)There is no common object between two clusters.

According to [15], the clustering problem is NP-hard when the number of clusters exceeds 3. In the case of the KMCF problem, the representing here-called medoids are objects in the initial input. Therefore, the set of the representing pdfs is defined as and as a result. For more details, one example will be given.

Suppose that we have 4 pdfs estimated from initial dataset. These pdfs are partitioned into 2 clusters, and . By some techniques, the clustering result is and , where and are, respectively, the medoids of and . Then, the partition matrix is presented as follows:

Therefore the set of the medoids is

##### 2.2. -Distance

Addressing one clustering problem requires determining the similarity between elements or pdfs before grouping. This mission can be handled by certain criteria such as distance, density, or shape [16]. In the field of clustering for pdfs, the -distance firstly proposed by Pham-Gia et al. [17] is one of the most common criteria being used to evaluate the similarity between pdfs. The main technique is that this distance is primarily based on the maximum function to assess the level of proximity or separation between pdfs, which achieves many advantages as discussed in [18]. The definition of -distance is stated as follows.

*Definition 1. *Let be a set of pdfs , and ; then -distance is defined by For ,From (2), it is easy to show that is a nondecreasing function in with . From (3), we obtain

##### 2.3. The Proposed Algorithm

*Problem.* Given pdfs which are clustered into partitions (), the mathematical program is considered as follows:

: minimize

subject to ,where1. and are defined in Section 2.1,2., is a measure for similarity between and . In this paper, -distance is chosen to calculate [18].

We see that the problem is a nonconvex program where a local minimum point does not need to be a global minimum. Based on the above denotations, the proposed -medoids clustering algorithm for pdfs (KMCF) is presented as follows.

*Step 1 (choose the initial medoids). *1.Calculate the distance between every pair of all objects based on -distance, denoting .2.Compute for object as follows:3.Sort in ascending order. Select first objects having the smallest values as the initial medoids. Then we have initial medoids ( is the th cluster center at the th iteration).4.Assign each object , to the nearest medoid which is equivalent to fixing the values of . Set 5.Figure sum of distances from all objects to their medoids .

*Step 2 (update medoids). *1.In each initial established cluster, find a new medoid which is minimizing . Set 2.Update the current medoids in each cluster by replacing the new medoids

*Step 3 (assign object to their medoids). * Assign each object , to the nearest center which is equivalent to fixing the values of . Compute the sum of distances from all objects to their new medoids If , or , then the algorithm stops. Otherwise it goes to Step 2.

By the above proposed scheme in Step 1, the distance matrix is just computed one time. Moreover, the method tends to select the most middle objects as the initial medoids. As a result, this improves computational time significantly.

#### 3. Convergence of the Proposed Algorithm

##### 3.1. The Properties of Problem

First, we defined the reduced objective function of the problem as follows:

and** W** is any matrix.

Lemma 2. *The reduced objective function is a concave function.*

*Proof. *Consider two points and and let be any scalar so that ; thenTherefore, is concave. Next, we show an important property of the constrain set (5).

Lemma 3. *Consider a set given byThe extreme points of satisfy constraint (5).*

*Proof. *For visualization of Lemma 3 proof, we suppose that and the probability of belonging to 3 clusters is 0.8, 0.1, and 0.1, respectively. Then, the pdf will be assigned to the first cluster due to the highest probability. Thus, 0.9 is one of the extremes of corresponding to pdf . Moreover, this extreme point will establish a basis as . Also, it is an identity matrix. Each basic variable will receive value 1 and value 0 and vice versa. This completes the proof. Therefore, we have following definition.

*Definition 4. *The reduced problem of the problem is given as follows:

minimize subject to .

As the function is concave, there exists an extreme solution of the problem which in turn satisfies the constrain set (2). Therefore, the following statement is given immediately.

Lemma 5. *Problems and are equivalent.*

##### 3.2. The Convergence of KMCF Algorithm

A point is called the partial optimal solution of problem if it satisfies [19]1..2..

Thus, the following two problems are defined in order to receive the partial optimal solution.

*Problem **.* Given , minimize subject to .

*Problem *. Given , minimize subject to .

Then, the below algorithm generates the partial optimal solutions. Then, it is essential to restate the KMCF algorithm. Since the step to find for the object is similar, so it will not be shown here.

*The Restated KMCF*1.Choose initial medoids based on values of ; we get ; solve with ; then one gets that is an optimal basic solution of problem . Set . Denote as the th cluster center at the th iteration.2.Solve with . Let the solution be . If stop, then the optimal solution is . Otherwise, go to step .3.Solve with ; then the basic solution will be if and stop. The optimal solution is ; otherwise the algorithm comes back to step .

Theorem 6. *Algorithm restated converges to a partial optimal solution of problem in a finite number of iterations.*

*Proof. *First we show that an extreme point of is visited at most once by the algorithm before it stops. We will assume that this is not true; that is, for some , , where . When applying step (ii), we get two optimal solutions and for and , respectively; that is,However, the sequence generated by the algorithm is strictly decreasing. That means (9) is false. Therefore, an extreme point of is visited at most once by the algorithm before it stops. Moreover, because there are a finite number of extreme points of , the algorithm will reach the partial optimal solution after a limited number of iterations. Therefore, this guarantees the convergence of -medoids type algorithms in general.

It is certain that the expected value of ARI for random partitions is zero. Anyway, it still has value 1 for perfect agreement between two partitions. Therefore, the ARI will be used in this paper for evaluating the results of the clustering algorithm.

#### 4. Numerical Results

In this section, four datasets are set up to evaluate performance of the proposed algorithm. The first two sets are the simulated data which are already published in [13, 18]. The third one is taken from the well-known dataset called CUReT which is available at http://www1.cs.columbia.edu/CAVE//software/curet. The final one is a real data extracted from a video of traffic situation at Ton Duc Thang University in Vietnam at the fixed moment. Besides, three other algorithms are also taken into account to make a comparison with the proposed algorithm. First is the proposed algorithm with medoids chosen randomly, namely, random -medoids algorithm. Another one is the modification of -means for pdfs called nonhierarchical approach [20]. The last one is one of the state-of-the-art algorithms for pdfs, namely, self-update or briefly SU. All the compared algorithms will be given the suitable number of clusters in advance, except for SU. For the terminate condition, epsilon is 10^{−3} in case of SU; distance-based criteria will be employed for the remaining cases. Further, to test the stability, each algorithm is executed over independent 50 runs for every dataset and the average result is obtained as the final result. The performance of all algorithms is evaluated on three aspects: accuracy (ARI) [21], computational time (seconds), and iteration number. Further, we would like to point out that all the numerical results are developed in 2015-version Matlab software on an Intel (R) Core (TM) i3-4005U CPU @ 1.70 GHz with 4 GB main memory in Windows Server 2010 environment.

*Example 1. *In this example, the dataset is a kind of simple simulated data with “well-behaved” class structure and also well-studied in previous algorithms in field of clustering for pdfs. This data includes seven univariate normal distributed pdfs as presented in Figure 1. The details of the estimated parameters can be found in [18]. From Figure 1, one can receive the appropriate partition corresponding to three clusters as