Abstract

With the advent of the -modes algorithm, the toolbox for clustering categorical data has an efficient tool that scales linearly in the number of data items. However, random initialization of cluster centers in -modes makes it hard to reach a good clustering without resorting to many trials. Recently proposed methods for better initialization are deterministic and reduce the clustering cost considerably. A variety of initialization methods differ in how the heuristics chooses the set of initial centers. In this paper, we address the clustering problem for categorical data from the perspective of community detection. Instead of initializing modes and running several iterations, our scheme, CD-Clustering, builds an unweighted graph and detects highly cohesive groups of nodes using a fast community detection technique. The top- detected communities by size will define the modes. Evaluation on ten real categorical datasets shows that our method outperforms the existing initialization methods for -modes in terms of accuracy, precision, and recall in most of the cases.

1. Introduction

Clustering task is a form of unsupervised learning that aims at finding underlying structures in unlabeled data. Objects are partitioned into homogeneous groups or clusters so that intracluster items have high similarity but are very dissimilar to objects in other clusters. A lot of clustering methods have been proposed and developed over decades (for a recent survey, see [1]). Hierarchical clustering and partitional clustering are two main types of clustering algorithms. While hierarchical clustering produces a hierarchy of partitions (i.e., a dendrogram) over the dataset by applying agglomerative or divisive strategies, partitional clustering usually assumes a fixed number of clusters and tries to maximize the homogeneity within the clusters.

For numerical data, the -means algorithm is a well-known and widely used method in practice due to its simplicity and efficiency. -means finds a set of cluster centers for a dataset such that the sum of squared distances of each point to its nearest cluster center is minimized. Lloyd’s algorithm [2] begins with arbitrary centers, typically chosen uniformly at random from the data points. Each point is then assigned to the nearest center, and each center is recomputed as the center of mass of all points assigned to it. These two steps are repeated until the process stabilizes. To remove the numeric-only limitation of the -means algorithm, Huang [3] developed the -modes algorithm which extends the -means algorithm by using (1) a simple matching dissimilarity measure for categorical attributes (2) modes in place of means for clustering and (3) a frequency-related strategy to update modes to minimize the clustering cost. The algorithm is shown to achieve convergence with linear time complexity with respect to the number of data items.

However, the -modes algorithm is also very sensitive to the choice of initial cluster centers and an improper choice may result in highly undesirable cluster structures. The same phenomenon happened in -means which led to the better seeding solutions such as k-means++ [4] and its derivatives -means [5] and - [6]. To better initialize cluster centers in -modes, a lot of methods have been developed [710]. The common point of [7, 8, 10] is to use the density of each data point together with the distance to determine sequentially the initial cluster centers. Khan and Ahmad [9] proposed to use multiple clustering of the data based on attribute values in different attributes.

In this paper, we develop a new clustering method for categorical data based on community detection techniques [11]. Considering each data point as a node, we build a simple graph in which an edge connects any two nodes if the Hamming distance between them is less than a threshold. The threshold is simply estimated via the number of data points , the number of clusters , and the pairwise Hamming distance distribution. Given the graph , we run the Louvain algorithm [12] to detect nonoverlapping cohesive communities within . The top- communities by size will be retained as the core clusters, each of which is represented by a mode. The remaining data points (if any) are assigned to the nearest mode. Note that our algorithm is not an initialization technique as [3, 710] because it produces clusters directly.

Compared to prior work, our scheme highlights the following features:(i)We propose a novel clustering method called CD-Clustering for categorical data using community detection techniques. Our scheme uses a simple heuristic to determine the distance threshold for graph construction. It is also deterministic as opposed to the traditional -modes with random initialization of cluster centers.(ii)We evaluate our scheme on ten real categorical datasets and compare it against random initialization and two other initialization methods. The results show that our technique performs better than the competitors in terms of accuracy for most of the cases.

The remainder of the paper is organized as follows. Section 2 briefly reviews related work in -modes clustering and community detection. Section 3 discusses several key concepts used in this paper via some illustrative examples. In Section 4, we describe a simple estimation of the distance threshold and our main algorithm. The evaluation and comparison are shown in Section 5. Finally, Section 6 concludes the paper with pointers to future work.

2.1. -Modes and Initialization Techniques

As in -means, the random initialization method has been widely used in -modes clustering for its simplicity. However, the random method does not guarantee a unique clustering result, and very poor clustering results may occur. To obtain desirable clustering results with low distortion, the -modes algorithm must be executed many times.

In [3], Huang proposed two simple initialization methods for -modes, in which the first method selects the first objects from the dataset as initial cluster centers, and the second method assigns the most frequent categories equally to the initial cluster centers. However, the first method works only if the first objects come from disjoint clusters while the second method lacks a uniform criterion for selecting initial clusters.

Wu et al. [10] proposed a density based initialization method for -modes. Cao et al. [8] presented a method to select initial cluster centers by considering the distance between objects and the density of each object. Bai et al. [7] proposed an initialization method that is similar to [8] but tries to avoid selecting the boundary objects among clusters as the first cluster center. However, the evaluation of results in [7] has some problems: for several datasets, the accuracy, precision and recall values are computed incorrectly as reported by Khan and Ahmad [9]. In [9], Khan and Ahmad presented an initialization algorithm for -modes by performing multiple clustering of data based on the attribute values present in different attributes.

A long with the heuristics for cluster initialization discussed above, there are many ideas on improving the dissimilarity scores for the standard -modes algorithm [1316]. Ng et al. [15] gave a rigorous proof that the object cluster membership assignment method and the mode updating formula under the dissimilarity measure proposed in [14] indeed minimize the objective function. Cao et al. [13] proposed a new dissimilarity measure to take into account of the distribution of attribute values on the whole universe. In [16], Zhou et al. took a step further by defining the Global-Relationship dissimilarity (GRD) measure.

2.2. Community Detection in Graphs

There is a vast literature on community detection in graphs. For a recent comprehensive survey, we refer to [11]. In this section, we discuss several classes of techniques.

Newman and Girvan [17] propose modularity as a quality of network clustering. It is based on the idea that a random graph is not expected to have a modular structure, so the possible existence of clusters is revealed by the comparison between the actual density of edges in a subgraph and the density one would expect to have in the subgraph if the nodes of the graph were connected randomly (the null model).

Many methods for optimizing the modularity have been proposed over the last ten years, such as agglomerative greedy [18], simulated annealing [19], spectral optimization [20], and Louvain method [12], just to name a few. Other methods include random walks [21], statistical mechanics [22], label propagation [23], and InfoMap [24]. The recent multilevel approach, also called Louvain method, by Blondel et al. [12] is among top performance schemes. It scales very well to graphs with hundreds of millions of nodes/edges.

3. Preliminaries

In this section, we review several key concepts in the -modes algorithm and community detection techniques. We also discuss how the clustering problem of categorical data can be solved from the perspective of community detection.

Notations summarizes the notation used in this paper.

3.1. Clustering Categorical Data

Let be a categorical dataset with data points . Each data point has categorical attributes from the set . In other words, the dataset can be represented by a table with rows and columns in which indicates the th attribute of the data point .

The -modes clustering algorithm [3] is an extension of the -means algorithm for clustering categorical data by using a simple dissimilarity measure. It also adopts a frequency-related strategy to update modes in the clustering to minimize the clustering costs. The simplest matching dissimilarity measure between two data points and is defined by Hamming distance: where denotes the th attribute of and if or otherwise. Obviously, the Hamming distance between any two data points lies in the set .

Given a set of data points , a mode of is an object where that minimizes the sum . In other words, is the most frequent value in with respect to the th attribute [3]. Note that is not necessarily an object of . When a mode is not an object of a set, it can be assumed as a virtual object.

The original -modes algorithm [3] tries to minimize the following cost function: where and . The -modes algorithm [3] runs the following steps:(1)Select initial modes, one for each cluster.(2)Allocate an object to the cluster whose mode is the nearest to it. Update the mode of the cluster after each allocation using the most frequent attribute values.(3)After all objects have been allocated to clusters, retest the dissimilarity of objects against the current modes. If an object is found such that its nearest mode belongs to another cluster rather than its current one, reallocate the object to that cluster, and update the modes of both clusters.(4)Repeat (3) until no object has changed clusters after a full cycle test of the whole dataset.

3.2. Community Detection via Modularity Optimization

Given a simple graph with disjoint communities, the modularity is defined aswhere is the number of clusters, is the total number of edges joining nodes in community , and is the sum of the degrees of the nodes of . Modularity is a scalar value in the range with larger values implying better clustering.

Example 1. Using Figure 1, we illustrate how to compute the modularity of a graph with respect to a clustering . The graph has six nodes and seven edges (). In Figure 1(a), the clustering is , so (two clusters). For the first cluster, . For the second cluster, . Hence, following formula (3), the modularity is .
Similarly, for the clustering on Figure 1(b)   , the modularity is . Clearly, the modularity of is higher than that of . This fact is also confirmed by looking at the two types of clustering in which partitions the nodes into more homogeneous groups.

Since its introduction in 2008, Louvain method [12] becomes one of the most cited methods for the community detection task. It optimizes the modularity by a bottom-up folding process. The algorithm is divided into passes, each of which is composed of two phases that are repeated iteratively. Initially, each node is assigned to a different community. So, there will be as many communities as there are nodes in the first phase. Then, for each node , the method considers the gain of modularity if we move from its community to the community of a neighbor (a local change). The node is then placed in the community for which this gain is maximum and positive (if any); otherwise it stays in its original community. This process is applied repeatedly and sequentially for all nodes until no further improvement can be achieved and the first pass is then complete.

Example 2. We demonstrate Louvain method in Figure 2 with a graph of 13 nodes and 20 edges. If each node forms its own singleton community, the modularity will be . In the first pass of Louvain method, each node moves to the best community selected from its neighbors’ communities. We get the partition with modularity 0.46375. The second phase of first pass builds a weighted graph corresponding to the partition by aggregating communities. The second pass repeats the folding process on this weighted graph to reach the final partition with modularity 0.47.

This greedy agglomerative algorithm has several advantages as stated in [12]. First, its steps are intuitive and easy to implement, and the outcome is unsupervised. Second, the algorithm is extremely fast, that is, computer simulations on large modular networks suggest that its complexity is linear on typical and sparse data. This is due to the fact that the possible gains in modularity are easy to compute and the number of communities decreases drastically after just a few passes so that most of the running time is concentrated on the first iterations. Third, the multilevel nature of the method produces a hierarchical structure of communities which allows multiresolution analysis, that is, the user can zoom in the graph to observe its structure with the desired resolution.

Note that in Louvain method, the move of nodes to gain better modularity is restricted to neighbor (connected) communities. Therefore, detected communities belong to one and only one connected component. In other words, a community never spans different connected components of a graph.

4. Algorithm

4.1. Estimation of Hamming Distance Threshold

To build the graph for the dataset , we need to estimate the distance threshold so that any two data points and are connected if the Hamming distance . As mentioned in Section 3.1, the Hamming distance lies in the set . At one extreme , the graph has least edges which exist between duplicate data points only. At the other extreme , we get a complete graph : any two nodes are connected. Obviously, some values of the distance threshold will make look more modular than the others; that is, its nodes are well clustered in communities and therefore easier to detect.

In this paper, we propose a simple heuristic to estimate based on the distribution of Hamming distances between data points in given the number of clusters . With data points, there are pairwise distances. Trivially assuming that clusters are of equal size, each cluster will have points and the number of intracluster distances in each cluster is . In total, there are intracluster distances. In practice , so the ratio of intracluster distances over the number of pairwise distances is

In other words, given the cumulative distribution function (CDF) of pairwise distances, we can estimate at the point that and . Figure 3 illustrates this idea for the ten datasets used in our experiments.

We also observe that the expected Hamming distance between two random data points is large when the attribute values are assumed to be uniformly distributed. Specifically, given the set of attributes , the expected Hamming distance between two random data points and is where is the cardinality of the th nonsingleton attribute. The larger , the larger the expected Hamming distance.

4.2. Clustering Algorithm

Now we describe our community detection-based clustering scheme (named CD-Clustering) which is outlined in Algorithm 1. The scheme consists of two phases. In the first phase, we compute all pairwise Hamming distances and the CDF of distance distribution (Lines (1)-(2)). Then, we estimate the distance threshold using a simple assumption in Section 4.1 (Lines (4)–(6)). In the second phase, we build the graph in which each node represents a data point. Two nodes are connected by an edge if their Hamming distance is not larger than (Lines (8)–(11)). In Line (12), we run the Louvain method [12] on to detect highly cohesive groups of nodes. The top-   detected communities by size will be retained (Line (13)). Then, we determine the mode for data points in each community (Lines (14)-(15)). The remaining data points (i.e., data points that do not belong to any of the top- communities) are assigned to the nearest mode (Lines (16)-(17)). As we show later in Section 5, except the dataset Mushroom, the number of remaining data points is very small.

Input: dataset with data points, each data point has
attributes. The number of clusters .
Output: clusters of data points .
compute pairwise Hamming distances
compute CDF for pairwise Hamming distances
// estimate distance threshold
for     do
(5) if   and   then
(6)break
(7) // clustering
(8)
(9)for     do
(10) if     then
(11)
(12) run Louvain method [12] on
(13) keep top- communities by size
(14)for each cluster   do
(15) compute the mode of
(16)for each remaining data point   do
(17) assign to the nearest mode ,
(18)return  

The complexity of CD-Clustering is dominated by the computation of all pairwise Hamming distances and the Louvain method. All pairwise Hamming distances are computed in . The Louvain method runs empirically in the time linear to the number of edges [12]. Again, using the simple assumption that all clusters are of equal size, the number of intracluster distances is approximated as . So the number of edges in is also , making the runtime of Louvain method is . In total, the complexity of CD-Clustering is . The quadratic complexity is a main drawback of our CD-Clustering scheme which restricts its application to datasets of 50,000 data points or less. A similar scalability limitation appears in [25] in which the authors need a similarity matrix of size . An approximation of Hamming distance distribution is possible by considering, for example, the distances from any point to (instead of all ) other points. The complexity in this approximation scheme will be reduced to . We leave this idea for future work. Table 1 compares the time complexity of our CD-Clustering scheme with the two initialization methods [8, 9].

5. Evaluation

In this section, we evaluate the performance of the proposed scheme. The real-world datasets and evaluation metrics are described in Sections 5.1 and 5.2. We show the performance of our method in Section 5.3. The clustering algorithm is implemented in C++ and run on a desktop PC with Intel® Core i7-6700@ 3.4 Ghz, 16 GB memory. For the sake of reproducibility, we provide our source code with the data (https://gitlab.com/hiepnh.duytan/Research/tree/master/k-modes-community).

5.1. Datasets

We pick ten purely categorical datasets from the UCI Machine Learning Repository [26] with a short description for each dataset as follows. Compared to the datasets used in [9], we add three new datasets: Nursery, Chess, and Heart. Note that we consider missing attribute values “?” as a new attribute value.

Soybean Small. This dataset consists of 47 cases of soybean disease each characterized by 35 multivalued categorical variables. These cases are drawn from four populations, each one of them representing one of the following soybean diseases: D1-Diaporthe stem canker, D2-Charcoat rot, D3-Rhizoctonia root rot, and D4-Phytophthorat rot. We keep only 21 nonsingleton attributes.

Mushroom Data. Mushroom dataset consists of 8,124 data objects described by 22 categorical attributes distributed over 2 classes. The two classes are edible (4208 objects) and poisonous (3916 objects). It has missing values in attribute 11.

Zoo Data. It has 101 instances described by 16 attributes and distributed into 7 categories. The first attribute contains a unique animal name for each instance and is removed because it is noninformative. All other characteristics attributes are Boolean except for the character attribute which corresponds to the number of legs that lies in the set .

Lung-Cancer Data. This dataset contains 32 instances described by 56 attributes distributed over 3 classes with missing values in attributes 5 and 39.

Breast-Cancer Data. This data has 699 instances with 9 attributes. Each data object is labeled as benign (458 or 65.5%) or malignant (241 or 34.5%). There are 9 instances in attributes 6 and 9 that contain missing attribute values.

Dermatology Data. This dataset contains six types of skin diseases for 366 patients that are evaluated using 34 clinical attributes, 33 of them are categorical and one is numerical. The categorical attribute values signify degrees in terms of whether the feature is present and contain largest possible amount or relative intermediate values. In our experiment, we discretize the numerical attribute (representing the age of the patient) into 10 categories.

Congressional Vote Data. This dataset includes votes for each of the US House of Representatives Congressmen on the 16 key votes. Each of the votes can either be a yes, no, or an unknown disposition. The data has 2 classes with 267 democrats and 168 republicans instances.

Nursery. This dataset was derived from a hierarchical decision model originally developed to rank applications for nursery schools. It contains 12,960 instances with 8 input attributes distributed over 5 classes.

Chess. This dataset contains 3,196 instances, each of which is a board-description for the chess endgame with 36 features. Each game is labeled one of the two classes: “win” and “nowin.”

Heart Disease. This dataset is the Cleveland heart disease database of 303 patients. The class represents presence of heart disease in the patient from 0 to 4. There are 13 attributes used in the experiments. We convert 5 numeric attributes (1st, 4th, 5th, 8th, and 10th) to categorical ones using the intervals of 10, 20, 60, 30, and 0.7, respectively. The 12th and 13th attributes contain missing attribute values.

Table 2 lists the characteristics of the chosen datasets. The columns avg.intra.dist and avg.inter.dist show the average intracluster and intercluster distances for each dataset, respectively.

The column R shows the value of Hamming distance threshold estimated from the CDF (see Figure 3). AC is the accuracy of our CD-Clustering. We highlight the accuracy values that are larger than 0.8. The columns m, #comp, and top- display the number of edges in , the number of connected components in , and the total number of data points in top- clusters, respectively. Except Mushroom and Chess datasets, the top- clusters detected by CD-Clustering include all or nearly all the data points. This result verifies the effectiveness of the simple estimation of and the Louvain method. Finally, the column Runtime shows the runtime of CD-Clustering in millisecond which is almost linear in .

5.2. Evaluation Metrics

To evaluate the performance of clustering algorithms, we use the same metrics as in [8, 9, 15]. If dataset contains classes for a given clustering, let denote the number of data objects that are correctly assigned to class , let denote the data objects that are incorrectly assigned to the class , and let denote the data objects that are incorrectly rejected from the class . The precision, recall, and accuracy are defined as

We demonstrate how to find the best confusion matrix and compute the precision, recall, and accuracy metrics in the following example.

Example 3. Assume that a dataset of objects clustered in clusters with the ground-truth and predicted cluster labels as in Table 3.
To find the best confusion matrix , we evaluate mappings from the set predicted labels to the ground-truth . For example, the mapping gives us the confusion matrix in Table 4. The value in each cell, for example, , is the number of pairs appearing in Table 3. Note that the sum of each column is equal to the number of objects in the corresponding cluster. The values of precision, recall, and accuracy areThe best mapping in this case is with , , and .

5.3. Clustering Results

For comparison, we choose the algorithms by Cao et al. [8] and Khan and Ahmad [9] as well as the random -modes [3]. We rerun the Java implementation provided in [9] and get a confusion matrix for each dataset. Then we find the best evaluation metrics using the brute-force technique in Example 3. Surprisingly, the metrics for seven datasets reported in [9] are not so good. They only match for the case of Vote data and get better value for Soybean and worse values on the other five datasets. For Cao’s algorithm, our C++ implementation provides the matching results on three out of four datasets tested in [8], namely, Soybean, Mushroom, and Breast-Cancer. The worse metrics appear on Zoo data. However, our results agree with the Python implementation of Cao’s algorithm at [27]. The results for random -modes with 10,000 runs/dataset are also more or less different from [8, 9].

The clustering results for the ten categorical datasets are summarized in Tables 514. In terms of accuracy, precision, and recall, our scheme achieves the following results:(i)Accuracy: our scheme outperforms or equals other methods in 7 cases, in particular with large margins in Lung-Cancer, Breast-Cancer, Dermatology, and Nursery datasets.(ii)Precision: our scheme outperforms or equals other methods in 7 cases.(iii)Recall: our scheme outperforms or equals other methods in 7 cases.

To better understand the performance of CD-Clustering, we revisit Table 2. There is a strong correlation between the accuracy metric and the gap between , avg.intra.dist and avg.inter.dist. If the ground-truth average intracluster and intercluster distances are far apart and is close to the former distance, we can get high accuracy (larger than 0.8). This is the cases of Soybean, Zoo, Breast-Cancer, Dermatology, and Vote. The two datasets with the lowest accuracy Nursery and Heart have smallest gaps between the ground-truth intracluster and intercluster distances. The three remaining datasets have medium accuracy despite the small distance gap. Out of the ten datasets, our CD-Clustering performs worst, that is, only comparable to or worse than the random -modes, on Mushroom and Chess. This is reflected in the ratio of top- to : 5,366/8,124 (Mushroom) and 2,389/3,196 (Chess). Also, is equal to 2 in these two datasets. These facts suggest that when and the intra/intercluster distance gap are both small, CD-Clustering must struggle harder for the top- communities.

6. Conclusion

Rather than using the -modes algorithm with heuristic initialization methods, we propose in this paper a novel clustering scheme CD-Clustering for categorical data. By applying the Louvain method, a widely used community detection technique, CD-Clustering can uncover the highly homogeneous groups of categorical data points using only the distance information. CD-Clustering builds the simple graph by limiting all pairwise Hamming distances by a threshold which is estimated simply using the number of clusters and the distance distribution. The evaluation against two -modes initialization techniques confirms the effectiveness of CD-Clustering. In future work, we plan to reduce the complexity of CD-Clustering for better scalability.

Notations

: Dataset with data points
:Set of categorical attributes
: modes of
:Number of clusters
:Hamming distance between and
:Hamming distance threshold
:Simple graph for with parameter .

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.