Abstract

The diameter of a cluster is the maximum intracluster distance between pairs of instances within the same cluster, and the split of a cluster is the minimum distance between instances within the cluster and instances outside the cluster. Given a few labeled instances, this paper includes two aspects. First, we present a simple and fast clustering algorithm with the following property: if the ratio of the minimum split to the maximum diameter (RSD) of the optimal solution is greater than one, the algorithm returns optimal solutions for three clustering criteria. Second, we study the metric learning problem: learn a distance metric to make the RSD as large as possible. Compared with existing metric learning algorithms, one of our metric learning algorithms is computationally efficient: it is a linear programming model rather than a semidefinite programming model used by most of existing algorithms. We demonstrate empirically that the supervision and the learned metric can improve the clustering quality.

1. Introduction

Clustering is the unsupervised classification of instances into clusters in a way that attempts to minimize the intracluster distance and to maximize the intercluster distance. Two criteria commonly used to measure the quality of a clustering are diameter and split. The diameter of a cluster is the maximum distance between pairs of instances within the same cluster, and the split of a cluster is the minimum distance between instances within the cluster and instances outside the cluster. Clearly, the diameter of a cluster is a natural indication of homogeneity of the cluster and the split of a cluster is a natural indication of separation between the cluster and other clusters.

Many authors studied optimization problems related to the diameter or the split of cluster, for example, to minimize the maximum cluster diameter [14]; minimize the sum of cluster diameters or radii [58]; or maximize the ratio of the minimum split to the maximum diameter [9]. The well-known single-linkage clustering and the complete-linkage clustering also optimize the two criteria, respectively: the former maximizes the minimum cluster split, and the later attempts to minimize the maximum cluster diameter.

Ackerman and Ben-David [10] defined a set of axioms that a measure of cluster-quality should satisfy scale invariance, isomorphism invariance, weak local consistency, and cofinal richness, and they showed that the RSD clustering criterion, that is, maximizing of the ratio of the minimum split to the maximum diameter, satisfies those axioms. Given data , let be the maximum of among all possible partitions of into clusters. If , the optimal solution with respect to the RSD criterion has the following property: the distance between each pair of instances in different clusters is larger than that of each pair of instances within the same cluster. Hence, we say that data is well-clusterable if , and are more clusterable than if .

Ackerman and Ben-David [11] showed that if , then the optimal solution with respect to the criterion can be found in time , where is the number of instances in . In this paper, we further show that if , then the optimal solution with the following criteria can be found using Gonzalez’s algorithm [1] in linear time: maximizing RSD, maximizing the minimum split, and minimizing the maximum diameter.

However, the condition of is too strong and unrealistic for real world data. So, a natural problem arises if is poorly clusterable (), whether can be made more clusterable by a metric learning approach and thus Gonzalez’s algorithm together with the learned metric can perform better than together with the original metric.

In the clustering literature, there are commonly two methods to add supervision information into clustering. First, adding a small portion of the training data into unlabeled data, this method is also called semisupervised learning [12, 13]. Second, instead of specifying the class labels, pairwise constraints are specified [14, 15]: a pairwise must-link constraint corresponds to the requirement that the involved two instances must be within the same cluster, whereas the two instances involved in a cannot-link constraint must be in different clusters.

Metric learning can be grouped into two categories, that is, unsupervised and supervised metric learning. In this paper, we focus on supervised metric learning. Supervised metric learning attempts to learn distance metrics that keep instances with the same class label (or with a must-link constraint) close and separate instances with different class labels (or with a cannot-link constraint) far away. Since there are many possible ways to realize this intuition, a great number of algorithms have been developed for supervised metric learning, for example, Local Linear Discriminative Analysis (LLDA) [16], Relevant Components Analysis (RCA) [17], Xing et al.’s algorithm [18], Locally Linear Metric Adaptation (LLMA) [19], Neighborhood Component Analysis (NCA) [20], Discriminative Component Analysis (DCA) [21], Local Fisher Discriminant Analysis (LFDA) [22], Large Margin Nearest Neighbor (LMNN) [23], Local Distance Metric (LDM) [24], Information-Theoretic Metric Learning (ITML) [25], Laplacian Regularized Metric Learning (LRML) [26], Generalized Sparse Metric Learning (GSML) [27], Sparse Distance Metric Learning (SDML) [28], Multi-Instance MEtric Learning (MIMEL) [29], online-reg [30], Constrained Metric Learning (CML) [31], mixture of sparse Neighborhood Components Analysis (msNCA) [32], Metric Learning with Multiple Kernel Learning (ML-MKL) [33], Least Squared residual Metric Learning (LSML) [34], and Distance Metric Learning with eigenvalue (DML-eig) [35].

Overall, empirical studies showed that supervised metric learning algorithms can usually outperform unsupervised ones by exploiting either the label information or the side information presented in pairwise constraints. However, despite extensive studies, most of the existing algorithms for metric learning have one of the following drawbacks: it needs to solve a nontrivial optimization problem, for example, a semidefinite programming problem, there are parameters to tune, and the solution is local optimal.

In this paper, we present two simple metric learning models to make data more clusterable. The two models are computationally efficient, parameter-free, and local-optimality-free. The rest of this paper is organized as follows. Section 2 gives some notations and the definitions of clustering criteria used in the paper. Section 3 gives Gonzalez’s farthest-point clustering algorithm for unsupervised learning, presents a nearest neighbor-based clustering algorithm for the semi-supervised learning, and discusses the properties of the two algorithms. In Section 4, we formularize the problem of making data more clusterable as a convex optimization problem. Section 5 presents the experimental results. We conclude the paper in Section 6.

2. Notations and Preliminary

We use the following notations in the rest of the paper.: the cardinality of a set.: the set of instances (in -dimension space) to be clustered.: the Euclidian distance between and .: the small subsets of with given labels, that is, the supervision. In this paper, we assume that either for (the case of semisupervised learning) or for (the case of unsupervised learning).: the set of all partitions of objects into nonempty and disjoint clusters .

Definition 1. Given , we say that a partition respects the semi-supervised constraints if satisfies the following conditions. (1)All instances in must be within the same cluster of for , and(2)Any pair of instances and , and must be in different clusters of for , and .

In the rest of the paper, we use to denote the subset of that respects the semisupervised constraints, and we require that any partition in the context of semisupervised learning should respect the semisupervised constraints.

Definition 2. For a set of objects, the split of is defined as For a partition , the split of is the minimum among .

Definition 3. For a set of objects, the diameter of is defined as For a partition , the diameter of is the maximum diameter of among .

Definition 4. The unsupervised and semisupervised max-min split problems are defined as, respectively,

Definition 5. The unsupervised and semisupervised min-max diameter problems are defined as, respectively,

Definition 6. The unsupervised and semisupervised max-RSD problems are defined as, respectively,For the unsupervised max-RSD problem, Wang and Chen [9] presented an exact algorithm for and a 2-approximation algorithm for ; however the worst-case time complexity of both algorithms is and thus impractical for large-scale data.
Let ; we use to denote the maximum distance between the instance and instances in ; that is, ; similarly, .

3. Well-Clusterable Data: Find the Optimal Solution Efficiently

In this section, we show that if , the max-RSD problem, the max-min split problem, and the min-max diameter problem can be simultaneously solved by Gonzalez’s algorithm for unsupervised learning in Section 3.1 and by a nearest neighbor-based algorithm for semisupervised learning in Section 3.2, respectively. At the same time, we also discuss the properties of the two algorithms for the case of .

3.1. Unsupervised Learning

The farthest-point clustering (FPC) algorithm proposed by Gonzalez [1] is shown in Algorithm 1, where the meaning of nearest neighbor is its literal one as (9); that is, ’s nearest neighbor in is ,

Algorithm: FPC
Input: The input data , and the number of clusters.
Output: The partition of .
;
Randomly select an instance from ;
;
while ()
;
;
end while
Let be the partition by assigning each instance of to
its nearest neighbor in (if , the nearest neighbor of
in is itself);
return;

Theorem 7. For unsupervised learning, if , then the partition returned by FPC is simultaneously the optimal solution of the max-RSD problem, the max-min split problem, and the min-max diameter problem.

Proof. (a) The proof of the max-RSD problem: let ,… be the optimal partition of the max-RSD problem; then , and we haveWe prove the following proposition: any pair of instances in (see Algorithm 1) must be in different clusters of ; that is, contains exactly one instance of each cluster ,  . If this holds, then by (10), for any instance , , its nearest neighbor in must be the instance such that also belongs to , and hence .
We prove the proposition by contradiction. Assume that there exists a pair of instances and in so that they belong to the same cluster for some . Without loss of generality, let be selected into before . Then when selecting into . Note that | before selecting ; there exists at least one cluster   () such that no instance in belongs to . By (10), for any , we have ; has no chance to be selected into since we should select the instance with the maximum , and thus the proposition holds.
(b) Since separating any pair , of instances within the same cluster of into different clusters will strictly decrease the split of the resulted partition, the conclusion for the max-min split problem holds.
(c) Since grouping any pair , of instances in different clusters of into the same cluster will strictly increase the diameter of the resulting partition, the conclusion for the min-max diameter problem holds.

Clearly, the time complexity of is by maintaining a nearest neighbor table that records the nearest neighbor in of each instance and the corresponding distance between and its nearest neighbor in . The space complexity is . So, the time complexity and the space complexity are both linear with for a fixed . Using a more complicated approach, the algorithm can be implemented in , but the implementation was exponentially dependent on the dimension [3].

Now, a natural problem arises: if , how does the FPC algorithm perform? Although, in this paper, we cannot give performance guarantee of the FPC algorithm for the max-RSD problem and the max-min split problem if , Gonzalez [1] proved the following theorem (see also [2, 3]).

Theorem 8 (see [1]). The FPC is a 2-approximation algorithm for the unsupervised min-max diameter problem with the triangle inequality satisfied for any . Furthermore, for , the ()-approximation of the unsupervised min-max diameter problem with the triangle inequality satisfied is NP-complete for any .
So as far as the approximation ratio is concerned, the FPC algorithm is the best for the unsupervised min-max diameter problem unless P = NP.

3.2. Semi-Supervised Learning

For semisupervised learning, we present a nearest neighbor-based clustering (NNC) algorithm as shown in Algorithm 2. The algorithm is self-explanatory, and we do not give a further explanation.

Algorithm: NNC
Input: The input data , the number of clusters, and the
labeled subsets of .
Output: The partition of .
for each unlabelled instance , compute
for ;
Let for ;
for each unlabelled instance
;
;
end for
return;

Theorem 9. For semiunsupervised learning, if , then the partition returned by NNC is simultaneously the optimal solution of the semisupervised max-RSD problem, the semisupervised max-min split problem, and the semisupervised min-max diameter problem.

Proof. The proof of max-RSD() problem: let be the optimal partition of the semisupervised max-RSD problem. Since respects the supervision, we can replace by a super-instance for ; then each cluster contains exactly one super-instance for (without loss of generality, here we assume that is in the cluster for ). Let ; then according to the algorithm NNC, each cluster also contains exactly one super-instance, and without loss of generality, we also assume that is in the cluster for . For each unlabeled instance for , since , we have for any , and the nearest neighbor of in is , so for , and thus .
The proofs for the semisupervised max-min split problem and the semisupervised min-max problem are similar to (b) and (c) in the proof of Theorem 7 respectively, and here we omit it.

The time complexity of NNC using a simple implementation is

The space complexity of NNC is . Since we assume that are small sets for , the time and space complexities are also linear with when || are regarded as constants for .

Similar to Theorem 8, we have the following theorem for the semisupervised min-max diameter problem.

Theorem 10. NNC is a 2-approximation algorithm for the semisupervised min-max diameter problem with the triangle inequality satisfied.

Proof. Let = (see the proof of Theorem 9), , and is an unlabelled instance} and let be any unlabelled instance such that . Since the optimal partition of a semisupervised min-max diameter problem must respect the supervision, we have , where denotes the diameter of the optimal solution of the semisupervised min-max diameter problem; at the same time, and for some must be within the same cluster of the optimal solution, so ; therefore . Now consider the partition returned by NNC. Since each unlabeled instance is assigned into its nearest neighbor in , so, for any cluster of for (assume that the super-instance in is ), we have , and by the triangle equality. So, , and the theorem holds.

4. The Metric Learning Models

If the given data are poorly clusterable, that is, the is far less than one, the algorithms FPC and NNC may perform poorly. Given the supervision, we use metric learning to make the supervised data more clusterable, and then the two algorithms can be used with the new metric.

Supervised metric learning attempts to learn distance metrics that keep instances with the same class labels (or with a must-link constraint) close and separate instances with different class labels (or with a cannot-link constraint) far away. As discussed in the first section, there are many possible ways to realize this intuition; for example, Xing et al. [18] presented the following model:

In the above model, denotes the set of must-link constraints, denotes the set of cannot-link constraints, is a Mahalanobis distances matrix, and denotes the distance between two instances and with respect to ; that is,where denotes the transpose of a matrix or a vector. The constraint (14) requires that should be a positive semidefinite matrix; that is, , . The choice of the constant 1 on the right hand side of (13) is arbitrary but not important, and changing it to any other positive constant results only in being replaced by .

Note that the matrix can be either a full matrix or a diagonal matrix. In natural language, Xing et al.’s model minimizes the sum of the square of distance with respect to between pairs of instances with must-link constraints subject to the following constraints: (a) the sum of distances with respect to between pairs of instances with cannot-link constraints is greater than or equal to one, and (b) is a positive semidefinite matrix.

Xing et al.’s model, as well as most of the existing metric learning, is a semidefinite programming problem and thus computationally expensive and even intractable in high dimensional space for the case of full matrix.

Inspired by the RSD clustering criterion, we propose two metric learning models: one learns a full matrix and the other learns a diagonal matrix. In this section, the supervision can be given either in the form of labeled sets or in the form of pairwise constraints.

4.1. The Labeled Sets

Given the supervision , we want to learn a Mahalanobis distances matrix such that the minimum split with respect to among ,  , is maximized subject to the following constraints: (a) the distance between each pair of instances with the same class label is less than or equal to one and (b) is a positive semidefinite matrix. Formally, we have the following optimization problem (the case of full matrix).

The Case of Full Matrix. Consider

The constraint (17) requires that the scalar variable (the minimum split) is the minimum among distances between pairs of instances with different class labels. The constraint (18) requires that the distance between each pair of instances with the same class label is less than or equal to one. The optimization objective is to maximize . Similar to (13), the choice of the constant 1 on the right hand side of (18) is arbitrary but not important and can be set to any positive constant.

The full matrix model is a SDP optimization problem, and, theoretically, the global optimal solution can be solved efficiently [36]. However, when is a full matrix, the number of variables is quadratic in , and thus it is prohibitive for problems with a large number of dimensions. To avoid this problem, we can require that is a diagonal matrix. Since is a diagonal matrix, is a positive semidefinite matrix if and only if for , where is the th diagonal entry. So, learning a diagonal matrix is equivalent to learning a vector using the following model (the case of diagonal matrix).

The Case of Diagonal Matrix. Considerwhere

The constraint (24) requires that each component of should be greater than or equal to zero.

Now since the optimization objective and all constraints are linear, the above optimization problem is a linear programming problem with variables, and inequality constraints (assume that has equal size). When is small for , the global optimal solution can be efficiently found using some optimization tool package, for example, the MATLAB linprog function, or the CVX—MATLAB software for disciplined convex programming (http://cvxr.com/cvx/download/).

4.2. Pairwise Constraints

If the supervision is given in the form of pairwise constraints, that is, the must-link and cannot-link constraints, the models also work after a minor modification. Let ML be the set of must-link constraints, and let CL be the set of cannot-link constraints; then the full matrix model and the diagonal matrix model should be modified as follows: substituting for (17), for (18), for (22), and for (23), respectively,

However, if the supervision is given in the form of pairwise constraints, it is nontrivial to decide whether there is a partition of such that satisfies all of those pairwise constraints (and we call it the feasibility problem). For constraints, Davidson and Ravi showed that the feasibility problem is equivalent to the -colorability problem [37] and thus NP-complete [38], whereas the feasibility problem is trivial if the supervision is given in the form of labeled sets. Of course, if we do not require that all of those pairwise constraints should be satisfied, the FPC algorithm can be naturally used together with the metric learned from the pairwise constraints.

Clearly, the metric learning models proposed in this paper are practicable only when the cardinality of sets of labeled instances or the number of pairwise constraints is small. Otherwise, the problem is usually overconstrained and there is no feasible solution.

5. The Experimental Results

5.1. The Compared Algorithms and Benchmark Datasets

To validate whether semisupervised learning performs better than unsupervised one, whether metric learning can improve clustering quality, and whether our metric learning model performs better than Xing et al.’s one for the and algorithms, we implemented the following algorithms:(i)the FPC algorithm as shown in Algorithm 1;(ii)the NNC algorithm as shown in Algorithm 2;(iii)the FPC with our metric learning model (the case of diagonal matrix) (FPC_Diag); that is, we first use our metric learning model to learn a vector and then use the FPC clustering algorithm with the learned vector; that is, the distance is computed using (26);(iv)the NNC with our metric learning model (the case of diagonal matrix) (NNC_Diag);(v)the FPC with Xing et al.’s metric learning algorithm (also using the diagonal matrix) (FPC_Xing); that is, we first use Xing et al.’s metric learning algorithm to learn a vector and then use the FPC clustering algorithm with the learned vector;(vi)the NNC with Xing et al.’s metric learning algorithm (also using the diagonal matrix) (NNC_Xing).

We also implemented the following algorithms as baseline approaches. The reason that we select -means to compare is that -means is very simple and also a linear time algorithm when regarding and the repetition times as constants:(i)the constrained -means [39] with Xing et al.’s metric learning algorithm (CopK_Xing);(ii)pairwise constrained -means with Xing et al.’s metric learning algorithm (PCK_Xing) [40, 41].

For Xing et al.’s metric learning method, the code is downloaded from Xing’s home page: http://www.cs.cmu.edu/~epxing/publications.html.

We conduct experiments on twenty UCI real world datasets obtained from the Machine Learning Repository of the University of California, Irvine [42]. The information about those datasets is summarized in Table 1.

5.2. The Experiments Setup

We first make the following preprocessing: for a nominal attribute with different values, we replace these values by integers , and then all attributes are normalized to the interval [1, 2].

Except Ecoli, is set to five for . Because the smallest number of instances is two among eight classes in the dataset Ecoli, is set to two for .

Xing et al.’s metric learning is carried out on the original pairwise constraints: and . In the phase of clustering for CopK_Xing and PCK_Xing, it is the centroid of that participates in the clustering process, which guarantees that all must-link constraints are satisfied.

The stop condition is either the repetition times are more than 100 or the objective difference between two consecutive repetitions is less than 10−6.

We use the Rand Index [43] to measure the clustering quality in our experiments. The Rand Index reflects the agreement of the clustering result with the ground truth. Here, the ground truth is given by the data’s class labels. Let be the number of instance pairs that are assigned to the same cluster and have the same class label, and let be the number of instance pairs that are assigned to different clusters and have different class labels. Then, the Rand Index is defined as

All algorithms are implemented in MATLAB R2009b, and experiments are carried out on a 2.6 GHz double-core Pentium PC with 2 G bytes of RAM.

5.3. The Mean Rand Index

Table 2 summarizes the mean Rand Index and the standard deviation over 20 random runs on twenty datasets, and the value with bold in each row is the highest. Table 2 shows that although no algorithm performs better than the other algorithms on all datasets, in general we can draw the following conclusion.(1)The supervision can significantly improve the clustering quality: compared with FPC and FPC_Diag, the mean Rand Index of NNC and NNC_Diag over twenty datasets increases about 27 percent and 22 percent, respectively. Note that the increment of the Rand Index that resulted from the addition of supervision itself is very small.(2)The introducing of metric learning into an existing algorithm does not always increase its performance. However, in general, the effect of our metric learning model is positive: the win/loss ratio of FPC_Diag to FPC is 11/3, and the win/loss ratio of NNC_Diag to NNC is 8/1, where algorithm A defeating algorithm B means that the Rand Index of A is higher at least 0.03 than that of B since the standard deviation is a bit large.(3)Compared with CopK_Xing and PCK_Xing, NNC_Diag performs a little better: the win/loss ratio of NNC_Diag to CopK_Xing is 5/3, and the win/loss ratio of NNC_Diag to PCK_Xing is 6/3.(4)For the FPC and NNC clustering algorithms, the proposed metric learning model is better than Xing et al.’s method, especially for FPC. For FPC, Xing et al.’s method resulted in the fact that the performance of FPC significantly decreased on nine datasets and the mean Rand Index of FPC_Xing even decreases about 6 percent compared with FPC. The win/loss ratio of NNC_Diag to NNC_Xing is 8/1. This fact seems to advise that when selecting a metric learning model for an existing clustering algorithm, the metric learning model should correspond to the clustering criterion of the clustering algorithm.

5.4. The Runtime

Figure 1 depicts the logarithm graph of the mean runtime (milliseconds) over 20 random runs, where the runtime of FPC, NNC, CopK, and PCK does not include the metric learning time. The legend Diag denotes the runtime of the metric learning time of our diagonal matrix model, and the legend Xing denotes the metric learning time of Xing et al.’s model (the diagonal matrix). So, the runtime of FPC_Diag (NNC_Diag) is the sum of the FPC (NNC) and the Diag. Similarly, the runtime of CopK_Xing (PCK_Xing) is the sum of the CopK (PCK) and the Xing.

Figure 1 shows that both NNC and FPC are much faster than CopK and PCK, which is consistent with their time complexities: the complexity of FPC and NNC is , whereas the complexity of CopK and PCK is , where is repetition times of -means. Figure 1 also shows that Xing et al.’s model is slower than our model when the number of dimensions is relatively large, for example, Ionosphere, Promoters, Sick, and Splice. On the other hand, since the number of inequality constraints is quadratic with the number of class labels, our Diag model is slower than Xing et al.’s model on datasets with relatively large number of class labels, for example, Ecoli, Mfeat-fac, Mfeat-pix, Yeast, and Zoo.

The experimental results in Table 2 and Figure 1 show that the FPC algorithm is very fast, but the clustering results are unsatisfactory. The NNC algorithm proposed in this paper has the same time complexity as FPC, but the clustering quality is much more satisfactory than FPC if a few labeled instances are available.

6. Conclusion

In this paper, we studied the problem related to clusterability. We showed that if the input data are well clusterable, the optimal solutions with respect to the min-max diameter criterion, the max-min split criterion, and the max-RSD criterion can be simultaneously found in linear time for both unsupervised and semisupervised learning. For the max-RSD criterion, we also proposed two convex optimization models to make data more clusterable.

The experimental results on twenty UCI datasets demonstrate that both the supervision and the learned metric can significantly improve the clustering quality. We believe that the proposed NNC algorithm and metric learning models are useful when only a few labeled instances are available.

Usually, the term semisupervised learning is used to describe scenarios where both the labeled data and the unlabeled date affect the performance of a learning algorithm, which is not the case here: the supervised data is used either to induce a nearest neighbor classifier on the unlabeled data or to find a metric vector. Hence, the supervision information can be more elaborately utilized in the future.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by China Natural Science Foundation under Grant no. 61273363 and Natural Science Foundation of Guangdong Province under Grant no. 06300170.