Research Article | Open Access
Fast Density Clustering Algorithm for Numerical Data and Categorical Data
Data objects with mixed numerical and categorical attributes are often dealt with in the real world. Most existing algorithms have limitations such as low clustering quality, cluster center determination difficulty, and initial parameter sensibility. A fast density clustering algorithm (FDCA) is put forward based on one-time scan with cluster centers automatically determined by center set algorithm (CSA). A novel data similarity metric is designed for clustering data including numerical attributes and categorical attributes. CSA is designed to choose cluster centers from data object automatically which overcome the cluster centers setting difficulty in most clustering algorithms. The performance of the proposed method is verified through a series of experiments on ten mixed data sets in comparison with several other clustering algorithms in terms of the clustering purity, the efficiency, and the time complexity.
As one of the most important techniques in data mining, clustering is to partition a set of unlabeled objects into clusters, where the objects which fall into the same cluster have more similarities than others . Clustering algorithms have been developed and applied to various fields including text analysis, customer segmentation, and image recognition. They are also useful in our daily life, since massive data with mixed attributes are now emerging. Typically, these data contain both numeric and categorical attributes [2, 3]. For example, the analysis of an applicant for a credit card would involve data of age (integers), income (float), marital status (categorical), and so forth, forming a typical example of data with mixed attributes.
Up to now, most research on data clustering has been focusing on either numeric or categorical data instead of both types of attributes. -means , BIRCH , DBSCAN , -modes , fuzzy -modes , BFCM , COOLCAT , TCGA , AS′ fuzzy -modes , and -means based method  are classic clustering algorithms. -means clustering algorithm  is put forward based on partition, where cluster centers need to be initialized by users or experience. Initialized cluster centers number could decide the clustering purity and efficiency. BIRCH  is short for balanced iterative reducing and clustering using hierarchies. Clustering feature and clustering feature trees are adopted to describe cluster specifically. Two stages are defined to implement BIRCH, including database scanning to build a clustering feature tree and global clustering to improve purity and efficiency. DBSCAN  (Density-Based Spatial Clustering of Applications with Noise) is a classic density-based clustering algorithm, which is capable of dealing with data with noise. Compared with -means, DBSCAN does not need to set cluster numbers priorly. However, two sensitive parameters are essential for DBSCAN, which are eps and minPts. Until now, various revised DBSCANs are brought up to improve the performance of DBSCAN algorithm. However, parameter sensitivity is still a challenge for DBSCAN for its further applications. -modes  is an upgraded version of -means by introducing categorical attributes clustering capability. Fuzzy -modes  is a modified -modes clustering algorithm with fuzzy mechanism to improve its robustness for various types of data sets. BFCM  is short for bias-correction fuzzy clustering algorithm which is an extension of hard clustering and it is based on fuzzy membership partitions. COOLCAT  is an entropy-based algorithm for categorical clustering which brought up a novel idea of clustering on basis of entropy. Data clusters are generated by their entropy values. TCGA  is a two-stage genetic algorithm for automatic clustering. Bioinspired clustering algorithm summarizes clustering process as an optimization problem and genetic algorithm is adopted for convergence to the global optima. These above-mentioned methods face difficulties when dealing with data with mixed attributes, while the latter is emerging very quickly [14–23]. Fast density clustering algorithm is put forward to solve clustering center determination problem . However, its mixed similarity calculation method is based on relationship of all attributes which has high computation complexity. And its cluster center determination method is mainly dependent on parameter which is difficult to set priorly.
For example, distance measure functions for numerical values cannot capture the similarity among data with mixed attributes. Moreover, the representation of a cluster with numerical values is often defined as the mean value of all data objects in the cluster, which, however, is illogical for other attributes. Algorithms have been proposed [14, 15, 17, 21, 22] to cluster hybrid data, most of which are based on partition. First, a set of disjoint clusters are obtained and refined to minimize a predefined criterion function. The objective is maximizing the intracluster connectivity or compactness while minimizing intercluster connectivity . However, most partition clustering algorithms are sensitive to the initial cluster centers which are yet difficult to determine. They are also suitable for spherical distribution data without outliers handling capacity.
The main contributions of our work include four aspects. A novel mixed data similarity metric is come up for mixed data clustering. Clustering center self-set algorithm (CSA) is applied to determine center automatically. Bisection method is adopted to calculate parameter for clustering to overcome parameter sensibility problem. Fast one-time scan density clustering algorithm (FDCA) is brought up to implement fast and efficient clustering for mixed data.
The rest of this paper is organized as follows. Section 2 introduces related works of mixed data clustering. In Section 3, the similarity metric for data with mixed attributes and how FDCA works are presented. In Section 4, the abundant simulations are carried out to testify FDCA’s performance compared with other classic algorithms. Section 5 is a practical application for handwriting number image recognition based on FDCA. And finally Section 6 concludes the paper.
2. Related Works
2.1. Mixed Data Clustering Algorithms Overview
As stated above, mixed data clustering algorithm is designed for data set of mixed attributes including numerical and categorical attributes. Numerical attributes of mixed data are evaluated by real values, while categorical attributes of mixed data represent the fact that those attributes are ordinal. It is still a challenge to cluster data with both numerical and categorical attributes. Lots of novel clustering algorithms are put forward to deal with mixed data. Huang proposed a -prototypes  algorithm which combines -means and -mode algorithms. -prototypes algorithm is an updated version of -means and -mode algorithm, especially designed for dealing with mixed data. It is a very early stage mixed data clustering algorithm. When the data set is uncertain, most clustering algorithm could not achieve purity and efficiency as expected. KL-FCM-GM  algorithm is an extended algorithm of -prototypes proposed by Chatzis. It is a fuzzy -means-type algorithm for clustering data with mixed numeric and categorical attributes by employing a probabilistic dissimilarity functional. It is designed for the Guss-multinormal distributed data. When the data set is large, the data similarity metric processing costs much more time than expected. So it is not quite suitable for big data objects. Zheng et al. developed a new algorithm called EKP , which is an improved -prototypes algorithm to overcome its flaws. EKP algorithm has global search capability by introducing an evolutionary algorithm. Later, Li and Biswas proposed the Similarity-Based Agglomerative Clustering (SBAC) algorithm , which adopts the similarity measure defined by Goodall  to evaluate the similarity. It is an unsupervised analysis method for identifying critical samples in large populations, so the efficiency of the similarity metric is not stable. Hsu and Chen proposed a clustering algorithm based on the variance and entropy (CAVE)  for clustering mixed data. However, the CAVE algorithm needs to build the distance hierarchy for every categorical attribute and the determination of distance hierarchy requires the domain expertise.
Besides the above-mentioned unsupervised similarity metric for clustering, there are further researches on mixed data similarity calculation methods proposed. Ahmad and Dey proposed a -means type algorithm  to deal with mixed data. Cooccurrence of categorical attribute values is used to evaluate the significance of each attribute. For mixed data attributes, Ji et al. proposed IWKM algorithm , in which distribution centroid is applied to represent the prototypes clusters. And the significance of different attributes is taken into account towards the clustering process. Besides, Ji et al. proposed WFK-prototypes  by introducing fuzzy centroid to represent the cluster prototypes. The significance concepts proposed by Ahmad and Dey  are adopted to extend -prototypes algorithm in WFK-prototypes algorithm. WFK-prototypes algorithm is a classic mixed data clustering algorithm until now. David and Averbuch proposed a categorical spectral clustering algorithm for numerical and nominal data, called SpectralCAT . Cheung and Jia  proposed a mixed data clustering algorithm based on a unified similarity metric without knowing clusters number. The embedded competition and penalization mechanisms are used to determine the number of clusters automatically by gradually eliminating the redundant clusters.
In a word, there are a lot of mixed data similarity metrics and clustering algorithms designed for different applications. We still want to develop a universal numerical and categorical data similarity metric and clustering algorithm that could be applied to most cases and practical data sets.
2.2. Fast Data Clustering Algorithm
Rodriguez and Laio had got their novel paper “Clustering by Fast Search and Fine of Density Peaks” published on Science in June 2014 . In their algorithm, clustering centers could be observed from density-distance relationship graph. Inspired by their method, we conclude their method as follows: the cluster centers are surrounded by neighbors with lower density and they are at a relatively large distance from any points with a higher density. Noise points have comparatively larger distance and smaller density.
The density of data point is defined as follows:where denotes data ’s density, represents distance between data and data , and is the threshold distance of each cluster defined priorly. According to (2), if the distance between data and data is less than , then density of data is . In other words, is equal to the number of points that are closer than to point .
is measured by computing the minimum distance between the point and any other point with higher density:
For the point with highest density, we conventionally take . Note that is much larger than the typical nearest neighbor distance only for points that are local or global maxima in the density. Thus, cluster centers are recognized as points for which the value of is anomalously large.
This observation, which is the core of the algorithm, is illustrated by the simple example in Figure 1(a). Then the density and distance of every point are computed. and distribution is shown in Figure 1(b).
There is a mapping between point distribution and ρ and δ distribution. For example, there are three red points A1, A2, and A3 in Figure 1(a) and they are cluster centers in original point distribution; the corresponding points A1, A2, and A3 in Figure 1(b) have larger distance and larger density than other points. In addition, there are three black points B1, B2, and B3 in Figure 1(a) and they are isolated and called the noise points. The corresponding points B1, B2, and B3 in Figure 1(b) have larger distance and smaller density than other points. Other points belong to one cluster and are called border points.
For all the data objects, we sort the density in descending order, as shown in Figure 2.
For any data point , there are some qualitative relationships as follows:(1)If and , the data point is the cluster center.(2)If and , the data point is a noise point.
If the data point does not meet situations 1 and 2, then the data point is a border point. Because cluster center has relatively larger density and larger distance compared to other centers, while noise data only has relatively larger distance from cluster centers and much less density, both cluster centers number and noise amount are relatively small compared with other data objects. The average density value and distance value are mainly dependent on majority of data objects besides centers and noise. So the specific value of and for different data set could be self-determined during the finding cluster center process. For instance, if the data size is 1000, the cluster center is selected from . For one data object , if its density is , then we check if its is or a little bit less than . If so, then data object is one of the cluster centers. And if its density is more like while its distance is more like , then data object is noise data. By checking those data objects according to CSA in Section 3.2.2, we could get all those cluster centers one by one.
In summary, the only points of high and relatively high are the cluster centers. The points have relatively high and low because they are isolated; they can be considered as noise points.
3. Fast Density Clustering Algorithm for Numerical Data and Categorical Data
3.1. Numerical Data and Categorical Data Unified Similarity Metric
3.1.1. Main Idea
Similarity metric is important for a meaningful cluster analysis. Table 1 lists typical similarity metrics for current clustering algorithms.
As shown in Table 1, six classic mixed data similarity metrics are listed and compared. According to each algorithm, different distance measure equations are developed including numerical attributes calculation part and categorical attributes calculation part. For instance, -means algorithm is only suitable for numerical attributed data only, so there is no definition for measuring categorical attribute part for data set. And -modes algorithm is designed for dealing with categorical attributed data which has no numerical attributes similarity metric. The other four similarity metrics are applied to mixed data, so all of them have both numerical attribute and categorical attribute parts distance metrics.
The Euclidean distance is adopted by -means algorithm to deal with the pure numerical data. The simple matching distance is adopted by -modes algorithm to deal with the pure categorical data. -prototypes algorithm integrates -means and -modes to deal with mixed data. Algorithms EKP and WFK-prototypes improved -prototypes algorithm by introducing fuzzy factor or weight coefficient in original distance measure, so that it can more accurately measure the similarity between objects. FPC-MDACC algorithm  adopts three different distance measure methods for mixed data depending on their types which need prior work to determine which type the current mixed data is, and this represents extra time cost and extra algorithm complexity.
Until now, we still need an efficient similarity metric for calculating distance of data objects of mixed data. We believe that one unified similarity metric for both numerical and categorical data is more efficient and reasonable for mixed data instead of independent calculation for each of the other attributes.
3.1.2. Unified Similarity Metric for Numerical and Categorical Data
A unified similarity metric is presented in this section for mixed data, which is applicable for any type of mixed data which has numerical attributes or categorical attributes or both.
Definition 1. Given the data set , each data object has dimensions. The distance between two data objects and is defined aswhere denotes weight of th attribute and is the number of attributes. If the attribute value of th is missing, then ; else . denotes distance of th attribute for data objects and .
If th attribute is numerical, then is defined as follows:where goes through every possible attribute value of data objects and .
Since the numerical attribute for different data could be quite different, in case the value is quite large or small, we have to balance its contribution to the final distance. So numerical attributes need to be normalized into .
If th attribute is categorical or binary, then is defined as follows:where goes through every possible attribute value of data objects and .
The categorical attribute is defined to evaluate whether the data objects and are the same or not on this attribute. If they have the same attribute, then the distance defined equals 0; otherwise the distance is 1.
If th attribute is order, then is defined as follows:where goes through every possible attribute value of data objects and . is defined as follows:where denotes order of each and is the total number of values has among all data objects.
In this paper, ordinal attributes are defined different from categorical attributes. Ordinal attributes are ordered by their values from big to small. For instance, th attribute of data object is represented as , pth attribute of data object is represented as , and attribute of data object is represented as . If the th attribute is categorical, then the distance between and equals 1; distance between and equals 1 as well. However, in our case, the th attribute is ordinal, so these two distances should be distinguished. We calculated their th attribute distance according to (7) and (8).
In this way, similarity for all the data objects could be calculated based on (5) to (8). In order to demonstrate how these three types of attribute are defined and measured according to the above proposed methods, we take data set Heart from UCI as an example.
3.1.3. Illustration for Unified Similarity Metric
As the unified similarity metric is put forward in Section 3.1.2, we would like to take data set Heart from UCI as an example to testify how it works.
Data in Heart has 13 attributes including the following:(1)Age(2)Sex(3)Chest pain type (4 values)(4)Resting blood pressure(5)Serum cholesterol in mg/dL(6)Fasting blood sugar > 120 mg/dL(7)Resting electrocardiographic results (values 0, 1, and 2)(8)Maximum heart rate achieved(9)Exercise induced angina(10)Oldpeak = ST depression induced by exercise relative to rest(11)The slope of the peak exercise ST segment(12)Number of major vessels (0–3) colored by fluoroscopy(13)Thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
According to their practical meanings, five attributes are defined as numerical attributes , two attributes are defined as ordinal attributes , and the remaining six attributes are defined as categorical attributes . Based on (1) to (8), the data similarity of Heart could be measured according to their attribute type. For instance, three data samples data sample 1, data sample 2, and data sample 3 are listed in Table 2 for calculating and explaining how brought up similarity calculation metric works.
According to the unified similarity metric, the distance of each data sample can be measured as in Table 3.
From Table 2, we can conclude that data sample 1 and data sample 3 are more likely to be clustered into one cluster because their distance is less, while data sample 2 has less similarity with data sample 1 and data sample 3. From Heart data set from UCI, original label information of data samples is given. And data sample 1 and data sample 3 are labeled as the same class, while data sample 2 belongs to another class. Our similarity results are correct, and the unified metric for mixed data set is efficient from this illustration.
3.2. Fast Density Clustering Algorithm (FDCA) for Mixed Data
3.2.1. Main Idea
Based on analysis of Figure 3, the only points of relatively larger and larger are the cluster centers. The points which have relatively larger and less can be considered as noise points because they are isolated. In order to realize cluster centers self-determination, more information from all data objects in the descending order of and is explored.
First of all, all data objects are sorted in descending order of their ρ and δ values each. And a fast center set algorithm (CSA) is adopted to choose the clustering centers automatically. After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. The cluster assignment is executed through one-time scan. Different from other partitioned clustering algorithms, FDCA can deal with arbitrary shape cluster. Each remaining point is assigned to the same cluster as its nearest neighbor of higher density. As shown in Figure 3, the number means the level of density: the bigger the number, the larger the density. Data object “3” is a cluster center and the cluster label is CENTER-1. The cluster label of data object “4” should be the same as the nearest neighbor of higher density, so the cluster label should be the same as data object “5,” which is CENTER-1.
For the noise point, FDCA does not introduce a noise-signal cutoff. Instead, we first find for each cluster a border region, defined as the set of points assigned to that cluster but being within a distance from data points belonging to other clusters. We then find, for each cluster, the point with highest density within its border region. Its density is denoted by, and only keep the points that have density larger than or equal to.
The main idea of how CSA algorithm is applied for FDCA is shown as chart in Figure 4.
3.2.2. Clustering Center Set Algorithm (CSA)
CSA algorithm is brought up to find out clustering centers for data clustering automatically based on ρ and δ descending order of all data objects. The process of CSA algorithm is shown in Algorithm 1.
3.2.3. Parameter Optimization
CSA algorithm is sensitive only to the choice of ; proper selection of could help CSA to find the correct clustering centers which would lead to high-efficient FDCA. This section would focus on how to get proper value of .
In Alex algorithm , as a rule of thumb, proper value for is in the scale of 1% to 2% of data objects number in data set. For example, if the total number of data objects is 1000, then . Since our designed FDCA aims to cluster mixed data, the target data set is different from Alex algorithm. Therefore, mixed data set is observed from UCI Machine Learning Repository. Because mixed data has more complicated similarity metric, the distances between cluster center and its data objects are more likely to be of wider scale. We can choose in the scale of 1% to 20% of data objects number in data set for all possibilities. For example, if the total number of data objects is 1000, then . However, in this way, we could only confirm the value scale of but could not achieve the optimal value.
Suppose that the data set has data samples; the scale for could be defined as % and %. For one from ], density and distance for all data objects could be calculated. From the corresponding relationship graph of density and distance for each data object, CSA algorithm is adopted to determine cluster centers. After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. The rest of data objects are divided into those clusters based on FDCA (described in Section 3.2). This whole process is called one iteration for one . Because clustering is an unsupervised method, whether is an optimal value of distance threshold or not could not be evaluated by data samples’ original label class. Another performance evaluation index is designed.
Suppose that there are clusters; each cluster center could be represented as . Data objects clustered into are denoted as , where represents the number of data objects belonging to cluster . Then the performance evaluation index for each is defined aswhere is short for distance between data object and its cluster center .
The value of could reflect the closeness of clusters. So we would optimize value with the minimum of . So finding proper value of could be summarized as an optimization problem. Optimization algorithm is applied for selecting optimal parameters for clustering algorithms such as PSO . PSO based parameter self-adaptive method is proven useful by comprehensive simulations. However, PSO is a bioinspired optimization algorithm based on iterations, which results in high algorithm complexity and time complexity. In order to realize fast data clustering, dichotomy  is adopted instead of bioinspired algorithms to search for optimal .
According to this rule, for each data set, we can get an initial range for as . The only problem is how to get the optimal value of . We already know that proper could make CSA get the optima clustering centers, so we have to get how influences clustering efficiency. We take Iris data set as an example. is set from 0.1 to 0.9 with 0.05 as a step. CSA is adopted to get clustering centers number as in Table 4.
From Table 4, we can conclude that sequential value of from minimum value of 0.1 to maximum value of 0.95 with each 0.05 step could get the optima value of as 0.2, 0.25, or 0.3 whose clustering centers number is 3. So we could use a fast searching algorithm to find the best value of to get the optima value of clustering center.
We apply the self-adaptive strategy of value on Iris data shown in Figure 5 to testify the efficiency of clustering centers numbers of value.
Dichotomy algorithm is applied to search the optima value of for clustering algorithm. We define the value scale of as ], where is from 1% to 20% of total data samples number. For fixed value , dichotomy algorithm uses function to find the approximate zero by the following steps.
Step 1. Fix value range ], verify and make sure that , and set definition .
Step 2. Calculate midpoint for range ], which is denoted as .
Step 3. Calculate according to CSA to get specific clustering centers number based on .
Step 4. If , then is the optima value of . Else if , then . Else if , then .
Step 5. If the definition is achieved, in other words, , then the optima value of is current or ; end the algorithm; else go to step .
Therefore, initial is selected randomly in scale of , and CSA algorithm is executed to determine cluster centers automatically. According to the current result of clustering, compute defined as (9) to evaluate whether current is good enough for clustering. If it is, then we fix current as the optimum and calculate the purity and efficiency of FDCA. Otherwise, dichotomy searching algorithm is applied to find another and repeat CSA and FDCA. The brought up optimal self-adaptive algorithm is faster than PSO based algorithm. For PSO or other bioinspired optimization algorithms, from Table 4, we can conclude that proper value of could help CSA find the correct cluster center. However, with the slight difference of value from 0.1 to 0.5, the clusters number is the same, which means we only have to find the proper scale of from its initial scales instead of finding the optimal value. Dichotomy searching algorithm is a fast searching algorithm to find the proper half area for . In Section 4, abundant simulations and the real-life application testify its efficiency in finding proper .
4. Simulations and Analysis
4.1. Data Settings
Ten data sets from UCI Machine Learning Repository are used for clustering algorithm simulations, as shown in Table 5.
| is the number of numerical attributes, is the number of categorical attributes, “” is for unknown parameters, and is the number of clusters.|
4.2. Performance Analysis
In clustering analysis, the clustering accuracy  is one of the most commonly used criteria to evaluate the quality of clustering results, defined as follows: where is the number of data objects occurring in both th cluster and its corresponding true class and is the number of data objects in the data sets. According to this measure, the larger is, the better the clustering results are, and for perfect clustering .
Another clustering quality measure is the average purity of clusters defined as follows:where denotes the number of clusters. denotes the number of points with the dominant class label in cluster . denotes the number of points in cluster . Intuitively, the purity measures the purity of the clusters with respect to the true cluster (class) labels that are known for our data sets.
4.3. Result Analysis
4.3.1. Clustering Efficiency
There are four 2-dimensional data sets (Aggregation, Jain, Spiral, and Flame) with various shapes of clusters (circular, elongated, spiral, etc.). The results are presented in Figure 6.
The results in Table 6 show that the algorithm is capable of clustering arbitrary shape, variable density clusters and has a good clustering quality.
The performance of FDCA is compared with -prototypes, SBAC, KL-FCM-GM, IWKM, DBSCAN, BIRCH, SpectralCAT, TGCA, and FPC-MDACC algorithms. The experiments results on different data sets show that FDCA algorithm is able to find optimal solution after a small number of iterations. The following reasons contribute to the better performance of our proposed algorithm. FDCA needs to analyse the density and distance of each point, and we then adopt dichotomy analysis techniques to fit the functional relationship . Afterwards, by analysis, the residuals distribution finds the cluster centers automatically. It conforms with the original data distribution of mixed data, which leads to a good clustering result.
4.3.2. Clustering Algorithm Time Complexity
Figure 7 lists the average execution time of our proposed algorithm and other algorithms on the eight data sets.
Because the number of data records in Iris, Soybean, Zoo, and Acute data sets is small, the execution is fast. The KDD CUP sample data sets and Breast data have a relatively large number of data records, and thus the execution time is longer. Since the balanced data, like Heart and Credit, adopt the probability and statistics method in the pretreatment stage, therefore they need more time than others.
4.3.3. Complexity Analysis
Assume that the data set has data objects; the time complexity of FDCA algorithm mainly consists of the computation of the distance and density of each data object, and the computational costs are and . After the cluster centers are found, the cluster assignment is performed in a single step, and the corresponding computational cost is , where denotes the number of cluster centers.
The time complexity of partition-based clustering algorithms and hierarchical clustering algorithms is and . So the time complexity of our proposed algorithm is higher than the partition-based clustering algorithms and hierarchical clustering algorithms. The advantages of our proposed algorithm are that the algorithm can determine the cluster centers automatically, can deal with arbitrary shape clusters, and is not sensitive to parameters.
5. Unsupervised Number Image Recognition Based on FDCA
5.1. Problem Description
Unsupervised number image recognition is defined as recognizing the number automatically from images without any label information in advance. Currently, there are three types of number image recognition methods including statistics, logic decision, and syntax analysis. Based on template matching algorithm and geometry feature extraction algorithm, most recognition algorithms are suitable for printer image recognition. In case of handwriting number images, most unsupervised recognition algorithms are confronted with low recognition rate because of different handwriting styles. For those supervised recognition methods, the most important premise is that there are enough labeled examples for classifiers to train, while in practical cases, it is not always suitable. Aiming at those problems, an unsupervised number image recognition method based FDCA is brought up to improve the recognition rate of handwriting number images without any labeled samples in advance. First of all, number images are clustered based on FDCA. And then a strict filter is designed to extract cluster centers and typical cluster members automatically for classifier to guarantee that those training samples have pure cluster features. Finally, traditional classifiers BP artificial neural network (ANN)  is adopted to classify number images based on those selected cluster centers and typical images as training sample instead of label known images in advance to realize unsupervised method. MNIST data set is recognized to testify our designed unsupervised image recognition method based on FDCA.
5.2. Handwriting Images Clustering Based on FDCA
5.2.1. Similarity Metric for Handwriting Number Images
CW-SSIM (Complex Wavelet Structural Similarity) is applied to evaluate the similarity of number handwriting images. Assume that mother wavelet of symmetric complex wavelet iswhere is central frequency of modulation band pass filter; is a progressive function with symmetry. After stretching and shifting transformation, we can obtain corresponding wavelet clusters:where is stretch factor and is shift factor. Continuous complex wavelet transform of real signal iswhere and , respectively, represent Fourier transform of and .
In the process of complex wavelet transform, assume that and , respectively, represent two coefficient sets of different images to be compared, which are extracted from same wavelet subband and same spatial location:where and are complex conjugates; is positive constant with small value, which is used to improve robustness of at low signal-to-noise ratio.
In order to better understand CW-SSIM, right part of the equation is multiplied by an equivalent factor, whose value is 1:In the first part of right-hand side, each factor is constant or mode of complex wavelet coefficient. For two given images, complex wavelet coefficient corresponds to a certain value. If the condition of for all is met, then the first part of right-hand side has maximum value of 1, and the value of second part is related to phase change of and . If the condition that phase change of and is constant for all is met, then the second part has maximum value of 1. The reason for taking this part as image structural similarity index is mainly based on the following two points:(1)The structural information of local image features is all included in phase pattern related to wavelet coefficients.(2)The constant phase change of all coefficients does not change structure of local image features.
With dual-tree complex wavelet transform with shift invariance and good direction selectivity, CW-SSIM index based on dual-tree complex wavelet transform is given. Firstly, the image is decomposed into 6 levels through dual-tree complex wavelet decomposition, which can avoid serrated subband. And then calculate local CW-SSIM index of each wavelet subband by moving sliding window on subband, whose size is . In the experiment, we found that performance of CW-SSIM will not be obviously affected by slight perturbation of parameter , so that take . However, the value of must be adjusted to obtain the best results under noisy environment. Finally, CW-SSIM of whole image is obtained by weighted sum of each subband. The weight function is obtained by Gauss distribution, whose standard deviation is quarter of best layer image size from controllable pyramid.
The range of CW-SSIM is . The larger the value is, the higher the image similarity is.
5.2.2. Strict Filter Design
In order to guarantee the purity of each cluster, strict filter is designed to kick out the members which lie on the edge of cluster. Therefore, after the cluster centers are determined and the remaining points are assigned to appropriate cluster, boundary region of fixed cluster is set. Data points within the region have the following characteristics: the data points are belonging to the cluster, but within a distance of ( is adjustable) there are objects belonging to the other cluster. By means of the objects in the boundary region, we can determine a local average density of the cluster; the object with density which is larger than the local density will be divided into the cluster, whereas the other objects are rejected, in order to ensure the cluster’s purity. The implementation process is as Algorithm 2 shows.
5.3. Unsupervised Number Image Recognition Based on FDCA
5.3.1. Data Set and Evaluation Index
MNIST data set is applied to testify the performance of image recognition method based on FDCA, which consists of 60000 number handwriting images. Numbers from 0 to 9 are all collected for classifier stored as binary file, each of which is , shown in Figure 8.
In this paper, we adopt the consistent indicators and recognition rate to evaluate the results as follows.
Specific equation of recognition rate is defined as follows:(2) represents the fraction of pair of images of the same subject correctly associated with the same cluster. represents the fraction of pair of images of different subjects erroneously assigned to the same cluster. We define them as follows:where represents the number of objects in data sets, represents the pair number of the data sets, represents the same type of objects assigned to the same cluster, represents objects of different classes assigned to different clusters, and represents objects of different classes assigned to the same cluster.
5.3.2. Application Results and Analysis
The recognition algorithm is processed as follows.
Step 1. Original images are input to calculate their similarity based on CW-SSIM.
Step 2. FDCA is applied to cluster images to get training samples for BP ANN. Those cluster centers and typical members are selected by strict filters.
Step 3. Train BP ANN with cluster label information images.
Step 4. Recognition process is carried out based on BP ANN. In our method, BP ANN is adopted according to paper .
First of all, we select 600 images for clustering to get cluster centers and other typical images for classifier to train. Those images contain numbers from 0 to 9, and each number has 60 images. Figure 9(a) is the cluster center self-determination process based on FDCA based on density and distance values. Different color is used to denote the different cluster centers. Figure 9(a) is the distribution of and . Figure 9(b) is the result of number image cluster based on FDCA for all 600 images.
(a) Cluster center self-determination process based on FDCA
(b) The same color numbers belong to the same cluster, while those grey images do not belong to any cluster. For each cluster, those images with tiny black circle are cluster centers for each cluster
Before the strict filter is added into the method, the cluster results consist of two situations. For image x, it is clustered into cluster A, while its true label is X_label, and A_label = X_label; then image x has been clustered correctly. If case is not established, then image x has been wrongly clustered. In order to make sure that training samples for classifier have been clustered as correctly as possible, we adopt strict filter to keep cluster pure through deleting cluster edge members.
As shown in Table 7, is the radius parameter of strict filter denoted as the distance from cluster center. In other words, for strict filter with radius , if the distance between cluster member and center is larger than , then this member would be removed from the cluster to guarantee the purity of the cluster. With different filter , we could achieve different clustering efficiency as shown in Table 7. denotes clustering accuracy, while denotes error rates. We can conclude from Table 7 that, without filters, recognition based on FDCA could achieve % and %. The higher is, the higher is at the same time. On the contrary, the lower is, the lower is. The reason for this result is that the more strict filter is, more cluster members would be excluded from cluster and the purer cluster would be, so would be low, with lower at the same time.
A novel fast density clustering algorithm (FDCA) for mixed data which determines cluster centers automatically is proposed. A unified mixed data similarity metric is defined to calculate data distances. Moreover, the CSA is used to fit the relationship of density and distance of every data object, and residual analysis is used to determine the centers automatically, which conforms to the original mixed data distribution. Finally, dichotomy analysis is adopted to eliminate parameter sensitivity problem. The experiments validated the feasibility and effectiveness of our proposed algorithm. Furthermore, our proposed FDCA is applied to number image recognition as an unsupervised method. MNIST data set is adopted to testify the high recognition rate with low false rate of our FDCA based method as a typical application. The future research will focus on the clustering data stream to achieve high clustering quality based on this work.
The authors declare that they have no competing interests.
This work was supported by a grant from the National Natural Science Foundation of China (no. 61502423), Zhejiang Provincial Natural Science Foundation (Y14F020092), and Zhejiang Natural Science Foundation (LY17F040004).
- J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, San Francisco, Calif, USA, 2001.
- C.-C. Hsu, C.-L. Chen, and Y.-W. Su, “Hierarchical clustering of mixed data based on distance hierarchy,” Information Sciences, vol. 177, no. 20, pp. 4474–4492, 2007.
- C.-C. Hsu and Y.-P. Huang, “Incremental clustering of mixed data based on distance hierarchy,” Expert Systems with Applications, vol. 35, no. 3, pp. 1177–1185, 2008.
- S. P. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.
- T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 103–114, ACM, Montreal, Canada, June 1996.
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD '96), Portland, Ore, USA, August 1996.
- Z. Huang, “A fast clustering algorithm to cluster very large categorical data sets in data mining,” in Research Issues on Data Mining and Knowledge Discovery, pp. 1–8, ACM Press, Tuscon, Ariz, USA, 1997.
- Z. Huang and M. K. Ng, “A fuzzy k-modes algorithm for clustering categorical data,” IEEE Transactions on Fuzzy Systems, vol. 7, no. 4, pp. 446–452, 1999.
- M.-S. Yang and Y.-C. Tian, “Bias-correction fuzzy clustering algorithms,” Information Sciences, vol. 309, pp. 138–162, 2015.
- D. Barbara, J. Couto, and Y. Li, “COOLCAT: an entropy-based algorithm for categorical clustering,” in Proceedings of the 11th International Conference on Information and Knowledge Management, pp. 582–589, ACM Press, McLean, Va, USA, November 2002.
- H. He and Y. Tan, “A two-stage genetic algorithm for automatic clustering,” Neurocomputing, vol. 81, no. 1, pp. 49–59, 2012.
- A. Saha and S. Das, “Categorical fuzzy k-modes clustering with automated feature weight learning,” Neurocomputing, vol. 166, pp. 422–435, 2015.
- S. Zahra, M. A. Ghazanfar, A. Khalid, M. A. Azam, U. Naeem, and A. Prugel-Bennett, “Novel centroid selection approaches for KMeans-clustering based recommender systems,” Information Sciences, vol. 320, pp. 156–189, 2015.
- Z. Huang, “Clustering large data sets with mixed numeric and categorical values,” in Proceedings of the the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21–34, World Scientific Publishing, Singapore, 1997.
- S. P. Chatzis, “A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional,” Expert Systems with Applications, vol. 38, no. 7, pp. 8684–8689, 2011.
- I. Gath and A. B. Geva, “Unsupervised optimal fuzzy clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 7, pp. 773–780, 1989.
- Z. Zheng, M. Gong, J. Ma, L. Jiao, and Q. Wu, “Unsupervised evolutionary clustering algorithm for mixed type data,” in Proceedings of the IEEE Congress on Evolutionary Computation, pp. 1–8, Barcelona, Spain, 2010.
- C. Li and G. Biswas, “Unsupervised learning with mixed numeric and nominal data,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 4, pp. 673–690, 2002.
- D. W. Goodall, “A new similarity index based on probability,” Biometrics, vol. 22, no. 4, pp. 882–907, 1966.
- C.-C. Hsu and Y.-C. Chen, “Mining of mixed data with application to catalog marketing,” Expert Systems with Applications, vol. 32, no. 1, pp. 12–23, 2007.
- A. Ahmad and L. Dey, “A k-mean clustering algorithm for mixed numeric and categorical data,” Data & Knowledge Engineering, vol. 63, no. 2, pp. 503–527, 2007.
- J. Ji, T. Bai, C. Zhou, C. Ma, and Z. Wang, “An improved k-prototypes clustering algorithm for mixed numeric and categorical data,” Neurocomputing, vol. 120, pp. 590–596, 2013.
- J. Ji, W. Pang, C. Zhou, X. Han, and Z. Wang, “A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data,” Knowledge-Based Systems, vol. 30, no. 1, pp. 129–135, 2012.
- J.-Y. Chen and H.-H. He, “A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data,” Information Sciences, vol. 345, no. 1, pp. 271–293, 2016.
- B. Everitt, S. Landau, and M. Leese, Cluster Analysis, Arnold, London, UK, 2001.
- G. David and A. Averbuch, “SpectralCAT: categorical spectral clustering of numerical and nominal data,” Pattern Recognition, vol. 45, no. 1, pp. 416–433, 2012.
- Y.-M. Cheung and H. Jia, “Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number,” Pattern Recognition, vol. 46, no. 8, pp. 2228–2238, 2013.
- A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
- J.-Y. Chen and H.-H. He, “Research on density-based clustering algorithm for mixed data with determine cluster centers automatically,” Acta Automatica Sinica, vol. 41, no. 10, pp. 1798–1813, 2015.
- I. H. Witten, E. Frank, and M. A. Hall, Data Mining, Morgan Kaufmann, 2011.
- Z. Xiao, S.-J. Ye, B. Zhong, and C.-X. Sun, “BP neural network with rough set for short term load forecasting,” Expert Systems with Applications, vol. 36, no. 1, pp. 273–279, 2009.
Copyright © 2017 Chen Jinyin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.