Mathematical Problems in Engineering

Volume 2015 (2015), Article ID 535932, 9 pages

http://dx.doi.org/10.1155/2015/535932

## -Nearest Neighbor Intervals Based AP Clustering Algorithm for Large Incomplete Data

^{1}Department of Automation, Tsinghua University, Beijing 100084, China^{2}Army Aviation Institute, Beijing 101123, China

Received 15 January 2015; Accepted 2 March 2015

Academic Editor: Hui Zhang

Copyright © 2015 Cheng Lu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The Affinity Propagation (AP) algorithm is an effective algorithm for clustering analysis, but it can not be directly applicable to the case of incomplete data. In view of the prevalence of missing data and the uncertainty of missing attributes, we put forward a modified AP clustering algorithm based on *K*-nearest neighbor intervals (KNNI) for incomplete data. Based on an Improved Partial Data Strategy, the proposed algorithm estimates the KNNI representation of missing attributes by using the attribute distribution information of the available data. The similarity function can be changed by dealing with the interval data. Then the improved AP algorithm can be applicable to the case of incomplete data. Experiments on several UCI datasets show that the proposed algorithm achieves impressive clustering results.

#### 1. Introduction

With the developments of sensors and database technology, people get more focus on the Big Data issue [1]. But too often the data is difficult to analyze. Cluster analysis is one of the common methods for analyzing data, which is to partition a set of objects into different groups, so that the data in each cluster share some common traits. Affinity Propagation (AP) is a relatively new clustering algorithm that has been introduced by Frey and Dueck [2], which can handle large datasets in a relatively short period to obtain more satisfactory results. AP algorithm has superiority over other clustering algorithms in terms of processing efficiency and quality of clustering, and AP algorithm does not require the prespecified number of clusters and the initial cluster centers. Thus, AP algorithm has attracted the attention of many scholars, and various improvements have emerged [3–5].

As a common and effective clustering algorithm, the original AP clustering algorithm is only applicable to complete data like other traditional clustering algorithms. However, in practice, many datasets suffer from incompleteness due to various reasons, such as bad sensors, mechanical failures to collect data, illegible images due to low pixels and noises, and unanswered questions in surveys. Therefore, some strategies should be employed to make AP applicable to such incomplete datasets.

In the literature, several approaches to handle incomplete data have been proposed, including listwise deletion (LD), imputation, model-based method, and direct analysis [6]. There is a strong connection between these methods on the concrete implementation algorithm. LD ignores those samples with missing values, which may lose a lot of sample information. Imputation and model-based method are usually based on the assumption that data attributes are missing at random. They substitute the missing values with appropriate estimates and construct a complete dataset. However, it is inefficient to perform imputation, and they usually lead to results far from satisfactory. For incomplete data, many methods have been proposed to reduce the impact of the presence of the missing values on the clustering performance in pattern recognition. An important empirically oriented study was done by Dixon [7]. The expectation-maximization (EM) algorithm [8] is a commonly used iterative algorithm based on maximum likelihood estimation in missing data analysis. Neither statistical methods nor machine learning method for dealing with missing data meets the actual needs of current. Various methods for handling missing data remain to be further optimized.

Though incomplete data appears everywhere, principled clustering methods for such data still deserve further research. The existing research on improved methods for clustering model is mainly concentrated on the fuzzy -means clustering (FCM) algorithm (Bezdek, 1981) [9]. In 1998, imputation and discarding/ignoring were proposed by Miyamoto et al. [10] for handling missing values in FCM. In 2001, Hathaway and Bezdek proposed four strategies to improve the FCM clustering of incomplete data and proved the convergence of the algorithms [11]. These strategies are whole data strategy (WDS), partial distance strategy (PDS), optimal completion strategy (OCS), and nearest prototype strategy (NPS). In addition, Hathaway and Bezdek used triangle inequality-based approximation schemes (NERFCM) to cluster incomplete relational data [12]. Li et al. [13] put forward a FCM algorithm based on the nearest neighbor intervals and solved the case of incomplete data. Zhang and Chen [14] introduced a kernel method into the standard FCM algorithm.

However, FCM algorithms are sensitive to the initial centers, which makes the clustering results unstable. In particular when some data are missing, the selection of the initial cluster centers becomes more important. To address this issue, we consider the AP algorithm, which does not require initial cluster centers and the number of clusters. Three strategies for solving AP clustering of incomplete datasets had been proposed in our previous research [15]. These strategies were simple and easy to implement which directly deal with incomplete dataset using AP algorithm. However, the effect of dataset information on missing attributes had not been studied, by which clustering quality would be affected. In this paper, based on Improved Partial Data Strategy (IPDS), a modified AP algorithm for incomplete data based on -nearest neighbor intervals (KNNI-AP) is proposed. First, missing attributes are represented by KNNI on the basis of IPDS, which are robust. Second, the clustering problems are transformed into clustering problems with interval-valued data, which may provide more accurate clustering results. Third, AP algorithm simultaneously considers all data points as potential centers, which makes the clustering results more stable and accurate.

The remainder of this paper is organized as follows. Section 2 presents a description of AP algorithm and AP clustering algorithm for interval-valued data (IAP) based on clustering objective function minimization. The KNNI representation of missing attributes and the novel KNNI-AP algorithm are introduced in Section 3. Section 4 presents clustering results of several UCI datasets and a comparative study of our proposed algorithm with KNNI-FCM and other methods for handling missing values using AP. We conclude this work and discuss the future work in Section 5.

#### 2. AP Clustering Algorithm for Interval-Valued Data

##### 2.1. AP Clustering Algorithm

AP algorithm and -means algorithm have similar objective function, but the AP algorithm simultaneously considers all data points as the potential centers.

Let a complete dataset , where . The goal of AP is to find an optimal exemplar set , by minimizing the clustering error function:where represents the exemplar for given . Each data point only corresponds to a cluster, and each exemplar is an actual data point which is the center of the cluster.

First, AP algorithm takes each data point as the candidate exemplar and calculates the attractiveness information between sample points, that is, the similarity between any two sample points. The similarity can be set according to specific applications; similarity measurement mainly includes similarity coefficient function and distance function. Common distance functions are Euclidean distance, Manhattan distance, and Mahalanobis distance. In the traditional clustering problem, similarity is usually set as the negative of squared Euclidean distance:where is stored in a similarity matrix, representing the suitability that the sample is the exemplar of the sample . is set for each sample, called “preference.” The greater the value is, the more possible the corresponding point is selected as the exemplar. Because all samples are equally suitable as centers, the preferences should be set as a common value . The number of identified exemplars is influenced by , which can be changeable for different numbers of clusters. Frey and Dueck [2] suggested preference is the median of the input similarities (resulting in a moderate number of clusters) or their minimum (resulting in a small number of clusters). We also employ [5] to measure the preference values to get more accurate clustering results.

To select appropriate clustering centers, AP algorithm searches for two different pieces of information: responsibility () and availability (). sent from sample to sample reflects how well-suited sample is to be served as the cluster center for sample . sent from sample to sample reflects how appropriate for sample to choose sample as its exemplar. The message-passing procedure terminates after a fixed number of iterations or the changes in the messages fall below a threshold.

##### 2.2. AP Clustering Algorithm for Interval-Valued Data (IAP)

AP algorithm should be adjusted to deal with interval data. Let an -dimensional interval-valued dataset , where the th sample is expressed as and , . To find the optimal exemplar set , we minimize the following clustering error function:where represents the exemplar for given . The similarity is changed as

The Euclidean distance can be defined as

Similarity matrix of can be calculated accordingly. Then the two pieces of information are updated alternately, which are both zero in the initial stage, and the update process is given as follows:

To avoid the numerical oscillation, the damping factor is introduced as follows:

The procedure of IAP can be described as follows.

Input is the similarity matrix and the preference .

Output is the clustering result.

*Step 1. *Initialize responsibility () and availability () to zero: ; .

*Step 2. *Update the responsibilities.

*Step 3. *Update the availabilities.

*Step 4. *Terminate the message-passing procedure after a fixed number of iterations or the changes in the messages fall below a threshold. Otherwise go to Step 2.

#### 3. AP Algorithm for Incomplete Data Based on -Nearest Neighbor Intervals (KNNI-AP)

##### 3.1. -Nearest Neighbor Intervals of Missing Attributes

As a common method to handle missing data, neighbor imputation has been widely used in many areas [16]. Imputation is the problem of approximating the value of a function for a nongiven point in some space when given the value of that function in points around (neighboring) that point. As a simple imputation, the nearest neighbor algorithm selects the nearest sample and does not consider the neighboring samples at all, which is easy to implement and is commonly used. An improved method is -nearest neighbor imputation [17], where missing attributes are supplemented by the mean value of the attributes in the -nearest neighbor values. Subsequently, García-Laencina et al. [18] proposed a -nearest neighbor interpolation method based on weighted distance characteristics of multiple information. Huang and Zhu [19] introduced a pseudodistance neighbor interpolation method. All the approaches mentioned above developed imputation, which are unsuitable to represent the uncertainty of missing attributes completely.

To produce a robust estimation, -nearest neighbor intervals (KNNI) of missing attributes are proposed. Let be an -dimensional incomplete dataset, which contains at least one incomplete sample with some (but not all) missing attribute values. For an incomplete , the -nearest neighbors should be found first.

On the basis of the Improved Partial Data Strategy (IPDS), the attribute information of both complete sample and incomplete sample (nonmissing attributes) can be fully used. The distance between sample and sample can be obtained as follows:wherewhere , . represents the distance on the th attribute between the two samples. is the feature dimension in which the two samples are both not missing, and is the dimensions of all features. and are the maximum and minimum of the observation data when the missing attribute exists. is indicator function to explain whether the variable is missing.

According to the principle of the nearest neighbor approach, sample and its nearest neighbor share same or similar attributes. Therefore, for sample , the ranges of missing attributes are basically between the minimum and maximum values of the corresponding attribute values of its -nearest neighbors. Then the -nearest neighbor interval of the sample can be determined, and the dataset can be converted into interval dataset. The missing attribute is represented by its corresponding -nearest neighbor interval , and nonmissing attribute can also be rewritten into interval form , where . That is, the original values are unchanged. Then the interval dataset is formed.

The selected is critical to make the intervals represent the missing attributes effectively. If is too small, the interval values may not express the missing attribute correctly, which likely leads to a biased estimation. In the extreme situation when is 1, KNNI is degraded into NNI. However, if is too large, the interval values also cannot correctly characterize the missing attribute values. In the extreme situation when is as large as (the number of samples in the dataset), the missing attribute interval is the range of all samples on the attribute, which is too large to represent the missing attribute properly. This will confuse the attribute characteristics among different clusters and result in unreasonable clustering results. is not only related with the ratio of the missing attribute, but also related with the distribution of the sample and the relevant clusters. Thus, how to choose an effective will directly affect the accuracy of clustering.

We randomly selected 3 kinds of three-dimensional data to form a dataset. For example, the number of samples is 900, and the missing rate is 15%. KNNI-AP algorithm is used, respectively, when is selected from 1 to 50; from the test results we can see that clustering results are basically stabilized when is more than 10 and there is uncertainty when is too small. Similar to the above process, for random missing data with different dimensions and different sample numbers, the values of were tested. It can be found that selected as the cube root of the sample numbers is more appropriate. Therefore, in this paper, the selected is the cube root of the sample numbers rounded to the nearest integer.

##### 3.2. AP Algorithm for Incomplete Data Based on KNNI (KNNI-AP)

KNNI-AP proposed here deals with clustering problem for incomplete data by transforming the dataset to an interval-valued one. The range of missing attribute interval will be large if the th attributes are dispersive in clusters and will be small if the th attributes are compact in clusters. So the KNNI can represent the uncertainty of missing attributes better. The lower and upper boundaries of missing attributes interval are determined by the distributions of attributes in clusters, that is, by the geometrical structure of clusters which can present to some extent the shape of clusters and sample distribution of the dataset. The proposed KNNI-AP can validate the robustness of clustering pattern.

For an -dimensional incomplete dataset , the procedure of KNNI-AP can be described as follows.

*Step 1. *Set as the cube root of the sample numbers rounded to the nearest integer.

*Step 2. *The distance between sample and sample can be obtained based on the IPDS, and the similarity matrix can be constructed.

*Step 3. *Form the corresponding interval dataset . For each missing attribute , find its -nearest neighbors using . is represented by , and nonmissing attribute is rewritten into interval form , where .

*Step 4. *Calculate the similarity matrix of . Choose the parameter of AP: maximum number of iterations performed by AP (default 2000); convergence of the algorithm if the estimated cluster centers stay fixed for convits iterations (default 50); decreasing step of preferences (default 0.01); damping factor (default 0.5).

*Step 5. *Apply Preference Range algorithm to computing the range of preference. Initialize the preference: . Update the preference: .

*Step 6. *Apply IAP algorithm to generating clusters. If cluster number is known then judge weather is equal to the given number of clusters; else a series of Sil values corresponding to the clustering result with different numbers of cluster is calculated.

*Step 7. *If cluster number is known, algorithm terminates until is equal to the given number of clusters; else it terminates until Sil is the largest.

#### 4. Simulation Analysis

##### 4.1. Incomplete Datasets

In order to test the proposed clustering algorithm, we use artificially generated incomplete datasets. The scheme for artificially generating an incomplete dataset is to randomly select a specified percentage of components and designate them as missing. The random selection of missing attribute values should satisfy the following [11]:(1)each original feature vector retains at least one component;(2)each attribute has at least one value present in the incomplete dataset .

At least one-dimensional data exists for each vector data and at least one or more kinds of data exist for each dimension. That is, the data in each row are not empty; each column of data cannot be null. In the following experiments, we test the performance of proposed algorithm on commonly used UCI datasets: Iris, Seeds, Wisconsin Diagnostic Breast Cancer (WDBC), and Wholesale customers, which are taken from the UCI machine repository [20] and are often used as standard databases to test the performance of clustering algorithms.

The Iris dataset contains 150 four-dimensional attribute vectors. The Wine dataset used in this paper contains 178 three-dimensional attribute vectors. The WDBC dataset comprises 569 samples and, for each sample, there are 30 attributes. The Wholesale customers dataset refers to clients of a wholesale distributor containing 440 6-dimensional attribute vectors.

##### 4.2. Compared Algorithms

To test the clustering performance, we take AP based on the -nearest neighbor mean (KNNM-AP), AP based on the nearest neighbor (1NN-AP), AP based on IPDS (IPDS-AP), and FCM based on the -nearest neighbor interval (KNNI-FCM) as compared algorithms. IPDS-AP directly deals with incomplete dataset using AP algorithm, and the others are imputation algorithms using different methods to handle missing values. For KNNM-AP, missing attributes are calculated by the -nearest neighbor mean; for 1NN-AP, missing attributes are replaced by the nearest neighbor; for KNNI-FCM, missing attributes are calculated by KNNI similar to KNNM-AP.

##### 4.3. Evaluation Method

To evaluate the quality of clustering results, we use misclassification ratio and Fowlkes-Mallows index [21].

Fowlkes-Mallows (FM) index is used to measure the clustering performance based on external criteria. In general, the larger the FM value is, the better the clustering performance is. The FM index is defined aswhere and .

is a clustering structure of the dataset and is a defined partition of the data. We refer to a pair of samples from the dataset using the following terms.

SS: if both samples belong to the same cluster of the clustering structure and to the same group of partition .

SD: if samples belong to the same cluster of and to different groups of .

DS: if samples belong to different clusters of and to the same group of .

DD: if both samples belong to different clusters of and to different groups of .

, , , and are the number of SS, SD, DS, and DD pairs, respectively. Then , which is the maximum number of all pairs in the dataset (meaning, , where is the total number of samples in the dataset).

The misclassification rate calculates the proportion of an observation being allocated to the incorrect group. It is calculated as follows: the number of incorrect classifications is divided by the total number of samples.

##### 4.4. Experimental Results and Discussion

For the four datasets, damping factor , decreasing step of preferences , max iteration time , and convergence condition . Because missing data was randomly generated, different tests lead to different results, and we noticed significant variation in the results from trial to trial. To eliminate the variation in the results, Figures 1–4 and Tables 1–4 give the averaged results over 30 trials on incomplete Iris, Wine, WDBC, and Wholesale customers datasets. Figures can intuitively reflect the effects of the algorithms and tables can accurately characterize the clustering results of the algorithms. In particular, 30 trials are generated for each row in the table, and the same incomplete dataset is used in each trial for each algorithm, so that the results can be correctly compared. In the tables, the optimal solutions in each row are highlighted in bold, and the suboptimal solutions are italic.