Abstract

Clustering has been widely used in different fields of science, technology, social science, and so forth. In real world, numeric as well as categorical features are usually used to describe the data objects. Accordingly, many clustering methods can process datasets that are either numeric or categorical. Recently, algorithms that can handle the mixed data clustering problems have been developed. Affinity propagation (AP) algorithm is an exemplar-based clustering method which has demonstrated good performance on a wide variety of datasets. However, it has limitations on processing mixed datasets. In this paper, we propose a novel similarity measure for mixed type datasets and an adaptive AP clustering algorithm is proposed to cluster the mixed datasets. Several real world datasets are studied to evaluate the performance of the proposed algorithm. Comparisons with other clustering algorithms demonstrate that the proposed method works well not only on mixed datasets but also on pure numeric and categorical datasets.

1. Introduction

With the development of information technology and with the wide use of computer and networks, the explosion of data in almost all fields provides a totally new perspective for data scientists towards knowledge discovery and future decision. Because of the urgent need of data processing, new techniques that can extract useful information and knowledge from the vast amount of data have been developed by researchers. In this context, data mining is an effective and attractive approach to meet these requirements.

Clustering is one of the most commonly encountered data mining techniques that are implemented to extract knowledge arising from many areas, some of which are community detection [1], pattern recognition [2, 3], bioinformatics [4], and spatial database applications, for example, GIS or astronomical data [5, 6]. The general purpose of clustering is to partition a dataset consisting of points embedded in -dimensional space into clusters, such that the data points within the same cluster are more similar to each other than to data points in other clusters [79]. Because of the simplicity and ease of implementation in a wide variety of scenarios, distance-based clustering methods, for instance, -means, -medians, -medoids, and hierarchical clustering, are widely used and deeply researched. The main problems of distance-based clustering methods are defining a proper similarity measure to discriminate the similarity or dissimilarity between different data points and aggregating most similar elements into appropriate clusters in an unsupervised way. Thus, the problem of clustering can be reduced to the problem of finding a distance function for that data type [1012]. Traditional clustering methods use Euclidean distance measure to calculate the similarity (or dissimilarity) of two data points [13, 14]. It is suitable for the datasets that are purely numeric. Actually, datasets in real world are more complicated. Large amount of data is mixed containing both numeric attributes like height, age, and so forth and categorical attributes like male or female, on or off, and so forth. In this case, however, Euclidean distance measure fails to judge the similarity of two data points when attributes are of categorical or mixed type.

Up to the present, researchers have been developing many ways dealing with mixed data. Similarity based agglomerative clustering (SBAC) [15], a hierarchical agglomerative algorithm, based on Goodall similarity measure [16], presented by Li and Biswas works well with mixed numeric and categorical attributes. But the amount of computation, while clustering large datasets, is rapidly increasing, which is not acceptable. Huang [17] proposed k-prototype clustering method that divides the dataset into two distinct parts, one for numeric attributes and another for categorical attributes, and handles the two components separately. Due to the information loss in dealing with cluster center and the simple binary distance measure between two categorical attributes of Huang’s algorithm, Ahmad and Dey [18] developed a modified cost function alleviating the shortcomings of Huang’s cost function based on a -mean type algorithm. In Ahmad and Dey’s algorithm, the distance computation of two values in a single categorical attribute considers not only the attribute they belong to, but also other attributes including the numeric ones. They also proposed a significance computing approach of a numeric attribute based on the attribute value distributions within the data. Ji et al. [19] improved Ahmad and Dey’s algorithm with a novel fuzzy k-prototype algorithm integrating mean and fuzzy centroid to represent the prototype of a cluster. Like many other fuzzy -means type algorithms, Ji’s algorithm also needs the determination of fuzzy coefficient value.

The novel affinity propagation clustering (APC) algorithm based on message passing is a more powerful approach proposed by Frey and Dueck [20] in 2007. Traditional distance-based clustering methods satisfy the conditions of metric similarities, that is, symmetry, nonnegativity, and the triangle inequality. Compared to the traditional approaches, the affinity propagation algorithm’s ability to take as input also general nonmetric similarities makes it suitable for exploratory data analysis using unusual measures of similarity [21]. For instance, AP has been used to identify key sentences and air-travel routing on the basis of nonstandard optimization criteria [20]. Furthermore, affinity propagation is a completely data-driven analysis technique that partitions the data points to different clusters and identifies exemplars among them by simultaneously considering all data points as possible exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges [22]. However, the original AP method assumes features are numeric valued, which means the algorithm cannot process features of categorical values or mixed type values.

Based on AP algorithm and Ahmad and Dey’s mixed similarities measure architecture [18], this paper proposes an adaption affinity propagation clustering method for mixed numeric and categorical attributes datasets using a novel similarity measure as a cost function. The key innovative points of the paper are as follows. This paper has applied the AP algorithm to cluster mixed type attributes datasets for the first time. This paper proposes a novel mixed similarities measure based on Ahmad and Dey’s work. The method improves the original AP clustering algorithm with adaption strategies.

The rest of the paper is organized as follows. We start in Section 2 with a brief review of affinity propagation clustering algorithm and the distance measure for mixed type datasets. In Section 3, the novel similarity measure for mixed type data is introduced and then the novel adapted AP approach is described in detail. Section 4 presents the experimental methodology and results on several benchmark datasets as well as the comparisons with the selected baseline algorithms. Discussions and conclusions are given in Section 5.

2. Background

2.1. Description of AP

Exemplar-based clustering, such as the popular -centers and -medians clustering methods, partitions the dataset by identifying a subset of representative elements (exemplars), so that the sum of distances between data points and their exemplars is minimized [23]. The traditional clustering analysis methods usually start with an initialization step that the algorithm selects initial data centers as exemplars and allocates other data points based on the distances to exemplars. It is obvious that different initial selection comes to different clustering results. On the contrary, AP runs based on an entirely different mechanism. Firstly, all data points are considered as potential exemplars and they are viewed as nodes in a network. Secondly, a number of real-valued messages are iteratively transmitted along edges of the network so that a relevant set of exemplars and corresponding clusters is identified [20]. Details of the framework are as follows [24].

AP takes as input a matrix of real-valued similarities between data points. Let ,  ,  , be a set of real-valued variables where indicates the similarity between two objects and in it. AP defines as the negative of the square of their Euclidean distance; that is, ,  . The self-similarities are referred to as “preferences” that influence the probability of one point being an exemplar. If there is no a priori knowledge, preferences are set to common values so that each data point is regarded as a potential exemplar with equal probability.

As mentioned above, in AP algorithm, data points exchange information by passing messages. Two kinds of messages are produced in the procedure and each takes into account a different kind of competition. One is called “responsibility” , which is sent from data point to candidate exemplar point . reflects the accumulated evidence for how well suited point is to serve as the exemplar for point , taking other potential exemplars for point into consideration. The “responsibility” is updated as follows: The other one is called “availability” , gathering evidence from data points as to whether each candidate exemplar would make a good exemplar. It is sent from the candidate representative point to point , reflecting the accumulated evidence for how appropriate it would be for point to choose point as its exemplar. Beside, the support from other points that point should be an exemplar is taken into account. The “availability” is updated as follows: The “self-availability” reflects accumulated evidence that point is an exemplar, based on the positive responsibilities sent to candidate exemplar from other points. is updated differently as follows: After iterative message passing, exemplars can be identified by calculating maximum of for point . If , point is selected as an exemplar, or point is the exemplar of point .

Furthermore, in order to avoid numerical oscillations in some circumstances when updating the messages, the damping factor is introduced to iteration process: where indicates the iteration times.

The primary advantage of AP algorithm is that AP does not need to preassign the number of clusters, which is different with -means methods specifying the value. This is because AP considers each data point as a potential exemplar and the probability of being an exemplar depends on the shared value of preference. With greater value of preference, AP generates more clusters. Another advantage is that AP only accepts the collection of similarities as input, which eliminates the need to deal with the raw dataset directly. This is an instrumental feature with which researchers can determine the similarity input matrix using various distance measures that are suitable for the objects of clustering. Moreover, wide-ranging applications [20] of AP demonstrate its ability of processing large datasets rapidly and effectively.

However, AP meets some limitations as well. The specific value of “preferences,” for the procedure of clustering, is a double-edged sword. Frey and Dueck [20] suggested setting the shared value of preferences as the median of the input similarities resulting in a moderate number of clusters. It is difficult to determine suitable value of preference in which different values lead to completely different results if there is no a priori knowledge. In addition, the damping factor acquires an appropriate value. Equation (4) shows that a larger value of damping factor means not easily trapping into oscillations but reducing the convergence rate, while a smaller one results in a fast rate of convergence but with a risk of no convergence when the message-passing procedure is terminated. Frey and Dueck [20] suggested a default damping factor of to keep balance between convergence and oscillation.

2.2. Distance and Significance

Based on Huang’s cost function [17], Ahmad and Dey developed a motivated distance and significance computation framework that not only considers the distances between pairs of distinct values within an attribute, but also takes their cooccurrence with other attributes into account [18]. Two parts are introduced to generate the distance matrix of a mixed dataset.

The first step is to calculate the distance between each pair of values for categorical attributes. For the given mixed type dataset, denote a categorical attribute, in which and are two of the values. Let denote another categorical attribute and a subset of values of . Accordingly, denotes the complementary set of . The conditional probability means the probability that a data point having value for has a value belonging to for . Likewise, denotes the conditional probability that a data point having value for has a value belonging to for .

The distance between the pair of values and of as regards the attribute and a particular subset is defined as follows: Distance between attribute values and for concerning attribute is denoted by and is given by where is the subset of values of that maximizes the quantity . Considering both and lie between and , in order to restrict the value of between and , is defined as

For a dataset with attributes, including categorical and numeric attributes which have been discretized, the distance between two dissimilar values and of any categorical attribute is given by Using (5) to (8), it is possible to compute the distance between two distinct values of categorical attributes and the discretized numeric attributes.

In the second step, significance for each numeric attribute is determined. Significance of an attribute defines the importance of that attribute in the dataset [25, 26]. To compute the significance of a numeric attribute, it must be discretized to intervals first. Thus each interval can be assigned a categorical value . Therefore, using (5) to (8), in the same way as it is computed for categorical values, the distance for every pair of discretized numeric values and can be computed. Eventually, the significance of a numeric attribute, , is computed as the mean of for all pairs :

3. Method

The main idea of our proposed algorithm is that our method attempts to obtain a clustering result using a similarity measure for mixed data based on AP clustering approach. We propose the novel method to handle the problem in the following sections.

3.1. Improved Similarity Measure

The main advantage of Ahmad and Dey’s distance measure method is that it considers the distance between a pair of values for an attribute as the function of their cooccurrence probabilities with a set of values of another categorical attribute. Therefore, the distance is a reflection of the difference between two categorical values rather than being the same or different (0 or 1). On the other hand, significance of an attribute is not user-defined like other algorithms but computed based on the discretization of numeric attributes. In other words, the algorithm, by itself, decides which attribute is more important and assigns a higher weight to it.

However, the distance measure also faces some limitations. In the process of discretization of numeric attributes, a number of the intervals () of numeric values are user-defined referring to different problems, equally for all numeric attributes, which will cause an inaccurate discretization because the algorithm has no consideration on the different distribution of distinct attributes. Beside, one should test the parameter to find a suitable interval for discretizing numeric values.

As mentioned in Section 2.1, AP algorithm separates data objects into suitable clusters without assigning the object number of classes, since each data point is viewed as a potential exemplar. Therefore, we propose an improved similarity measure based on Ahmad and Dey’s work in which the discretization operation is replaced by AP clustering discretization approach. Data objects are allocated to clusters as natural as they are distributed. Furthermore, different intervals emphasize the distinction of each attribute influencing the significance values.

For a given mixed dataset, let ,  , denote a numeric attribute, whose values are , where is the number of numeric attributes and is the number of data points. The similarity between and is defined by where can be viewed as an matrix and the similarity indicates the negative squared error for points and . The novel method for computing significance of numeric attribute is listed in Algorithm 1.

Set ;
for each numeric attribute in dataset   do
 Figure out the similarity matrix based on (10) as the input;
 Calculate the median of similarities as the shared value of preference;
 Perform the AP algorithm using (1)–(4) to obtain an classification result;
 Discretize attribute to intervals according to the clustering result;
;
end for
Establish a new dataset which is a pure categorical dataset composed of the discretized numeric
attributes and the original categorical attributes;
for each attribute in dataset   do
 Calculate the distance between two distinct values of any categorical attribute using (5)–(8);
 Compute the significance (weight) of each numeric attribute using (9) in which the interval is replaced by .
end for

Figure 1 illustrates the performance of three different discretization techniques. Raw data, equal width, equal frequency, and AP method are listed in the figure. It shows that AP method performs the best reflection for the distribution regularity of the data points.

Let denote the similarity matrix, and we define the similarity between two values and as follows: where denotes the distance of objects and for numeric attributes only, is the significance of the th numeric attribute described in Section 3.1, and denotes the distance between data objects and in terms of categorical attributes only. The similarities are set to a negative squared error to coordinate the input of AP algorithm.

3.2. Adaptive AP Algorithm

In Section 2.1, we discussed the advantages and limitations of AP algorithm. The shared value of “preferences” () is the key value that determines the clustering performance as well as the number of classes in the result. In some cases, the objective number of clusters is preassigned while it is hard to define an appropriate value of . This is because there is not a one-to-one correspondence between the output number of classes and the value of which means a certain range of values will arrive at the same number of clusters, with different distributions yet. To search the optimal value for the given number of classes, an adaptive strategy is proposed as follows: where denotes the running time of AP algorithm, is a function of the number of clusters in the th running result, and denotes the searching step. and are negative values.

We named as coarse tuning coefficient while was named as fine tuning coefficient. When the value of is much greater than the target value , a relevant greater value of should be employed, so that the value may reduce quickly. On the contrary, when is close to the target , smaller value of should be defined. So we set the coarse tuning coefficient as . In this case, the algorithm is able to tune the value of dynamically, according to the current cluster number .

Since the coarse tuning strategy makes the algorithm obtain the right number of clusters, fine tuning steps lead to the better clustering performance. In the iteration stage of , is set to 0. Meanwhile, when entering the stage of , is assigned to 1. Value of is important for scanning local area to maximize the energy function. Referring to the settings in [27], is defined as , where denotes the initial value of . The scanning stage may be terminated after the energy function decreases or after a fixed number of iterations of fine tuning.

On the other hand, the damping factor is another parameter that controls the convergence and the speed of algorithm. Our intention is that, in the case of no oscillation, the algorithm is able to acquire a faster convergence speed. An adaptive mechanism of is adopted to balance the contradiction between oscillation and convergence.

Although maintaining a larger close to may avoid numerical oscillations much more easily, a homologous decline of the updating rate for “availability” and “responsibility” becomes inevitable. The algorithm needs more iteration times than that with smaller to obtain a corresponding result. Therefore, a changing along with the iteration of algorithm is a better choice. According to this conception, we have designed an adaptive mechanism for as follows: where denotes current number of iterations and iteration denotes maximum iteration. and denote the maximum and minimum values of , respectively. We introduce the coefficient to adjust the rate of descent for . When the value of is greater than 1, declines from flat to sharp. We recommend to be greater than 1 to guarantee a smooth iteration process.

3.3. The Proposed Algorithm

Based on the above explanations, the pseudocode of the proposed algorithm is listed in Algorithm 2.

Calculate the similarity matrix and preferences as input of AP algorithm.
for each numeric attribute in dataset   do
 Calculate the significance of attribute using the method in Section 3.1;
end for
for each categorical attribute in dataset   do
 Calculate the distance of any pairs of values in based on (5)–(8);
end for
Generate the input similarity matrix of the mixed dataset using (11);
Set the value preference by the median of similarities;
Input the target number of clusters as ;
while the termination conditions are not met do
for each running time of AP algorithm do
  if    then
   The value adaptive strategy is defined as , where ;
  else if    then
   The value adaptive strategy is defined as , where ;
  end if
   adaptive strategy is defined by (13);
end for
end while

4. Experimental Evaluation

In this section, experimental results are presented by our proposed approach and other popular clustering methods on several standard datasets. To validate the effectiveness of the proposed algorithm, we have chosen four different kinds of datasets: pure numeric (Iris), pure categorical (Zoo), and mixed type (Heart diseases and Credit Approval), which are taken from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/index.html).

Four well-performed clustering methods including k-prototype [17], SABC [15], Ahmad and Dey’s algorithm [18], and fuzzy k-prototype [19] are employed for the comparison. All clustering results have been obtained with random initialization if needed. The experiments were carried out on a workstation with a GHz Intel i7-2600 CPU and GB RAM. All the programs were written by C code complying with GCC and the operating system was Fedora Linux.

To evaluate the performance of clustering method, microprecision [28] is introduced. Let denote the number of data points that are correctly assigned to the class , let denote the data objects that are incorrectly assigned to the class , and let denote the data objects that are incorrectly rejected from the class . The precision of the th class is defined as and recall is defined as And microprecision (micro-p) is defined as where is number of data objects in dataset and denotes the number of classes for a given clustering. According to this evaluation, a higher value of micro-p indicates a better clustering result.

4.1. Iris

Iris dataset consists of 150 instances of 3 classes: Iris Setosa, Iris Versicolour, and Iris Virginica. All 4 attributes are pure numeric in this dataset. Because there are no categorical attributes in the dataset, we just need to discretize the numeric values and calculate the significance of each attribute. Here, a comparison of significance between [18] and our proposed algorithm is illustrated in Table 1. From this table, we can see that each attribute obtains different weight value by the two significance calculation algorithms. By [18]’s algorithm, the biggest difference value is between of and of . By our proposed algorithm, the biggest difference value is between of and of . It is obvious that more significant difference of weight was calculated by the proposed algorithm which helps in the discrimination of different attributes.

A comparison of clustering results is presented in Table 2. Ahmad and Dey’s algorithm, -mean, original AP, and our proposed algorithm give the microprecisions 0.947, 0.88, 0.9, and 0.947, respectively. In this case, we can see that the proposed algorithm obtains a better result than the original AP algorithm due to the introduction of the significance value of each attribute and the improvements based on the original AP algorithm. Because of the simplicity of Iris dataset, the proposed method does not show significant advantages to Ahmad and Dey’s algorithm.

4.2. Zoo

Zoo dataset contains 101 instances distributed into 7 classes. Since all 16 attributes are of categorical type, only those algorithms that can deal with categorical data or mixed type data are applied in this section. For the proposed algorithm, only categorical part remains in the similarity measure function 11. Fuzzy k-prototype algorithm uses the same similarity measure as Ahmad and Dey’s algorithm. Table 3 shows the microprecision values of each clustering algorithm on Zoo dataset. K-prototype, SABC, fuzzy k-prototype, and our proposed algorithm give the microprecisions 0.802, 0.426, 0.908, and 0.921, respectively. From the table, we can see that the proposed algorithm allocated 93 data objects in desired clusters from the total 101 instances while the other three algorithms give the numbers 81, 43, and 91, respectively.

4.3. Heart Diseases

Heart diseases database contains 76 attributes, and 14 of them are for the experiments. The 14th attribute is the class index. This dataset is a mixed one with 8 categorical and 5 numeric features. 303 instances belong to 5 classes, that is, 1 for normal (164) and 4 for sick (139). Table 4 shows the results of 4 clustering methods. K-prototype, SABC, fuzzy k-prototype, and our proposed algorithm give the microprecisions 0.545, 0.545, 0.717, and 0.886, respectively. From the table, we can see that the proposed algorithm allocated 263 data objects in desired clusters from the total 303 instances while the other three algorithms give the numbers 162, 162, and 213, respectively. The proposed algorithm performs better than the other three algorithms.

4.4. Credit Approval

Credit Approval dataset, from credit card organization, is also a mixed dataset containing 9 categorical and 6 numeric attributes. 690 instances of data objects are divided into 2 classes: positive (307) and negative (383). Table 5 presents the results of 5 clustering algorithms. K-prototype, SABC, Ahmad and Dey’s algorithm, fuzzy k-prototype, and our proposed algorithm give the microprecision values 0.562, 0.555, 0.883, 0.838, and 0.920, respectively. Ahmad and Dey’s algorithm and fuzzy k-prototype use the same similarity measure so they get close values in the result. 635 data objects from the total 690 instances were clustered in their desired clusters while the other four algorithms give 388, 383, 609, and 578, respectively. Our approach achieves better result in the comparison.

5. Conclusion

Extracting knowledge and information from mixed data meets the urgent needs of real world applications. Affinity propagation is a novel unsupervised clustering method presented in recent years. In this paper, we proposed a new approach for clustering mixed numeric and categorical data based on AP method. We made the contribution of three aspects. Firstly, we extend AP method to deal with the mixed type dataset removing its numeric data limitation and the results have shown the feasibility of this extension. Secondly, an improved mixed similarity measure is proposed to compute distances between pairs of values for categorical attribute and to obtain the weight coefficients for numeric attribute. Finally, we improve the original AP by employing adaption strategies.

Our approach works well not only for mixed data clustering but also for clustering pure numeric or categorical data, which has been demonstrated in the experiments by comparing with other clustering algorithms. The experimental results illustrate the efficiency of the proposed method on several real life mixed type datasets. However, like many other algorithms with parameter tuning problem, we introduce several user-defined parameters, and it is not always clear which is the best value for these parameters. Our future work will focus on the further improvement of AP algorithm and its applications on various fields.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors are grateful to the support of the National Natural Science Foundation of China (Grant nos. 61174040, 61104178, and 61205017), Shanghai Commission of Science and Technology (Grant no. 12JC1403400), and the Fundamental Research Funds for the Central Universities.