Mathematical Problems in Engineering

Volume 2015, Article ID 352849, 12 pages

http://dx.doi.org/10.1155/2015/352849

## Chi-Squared Distance Metric Learning for Histogram Data

^{1}Laboratory of Spatial Information Processing, School of Computer and Information Engineering, Henan University, Kaifeng 475004, China^{2}Department of Information Engineering, Shengda Trade Economics and Management College of Zhengzhou, Zhengzhou 451191, China

Received 11 December 2014; Revised 25 March 2015; Accepted 27 March 2015

Academic Editor: Davide Spinello

Copyright © 2015 Wei Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Learning a proper distance metric for histogram data plays a crucial role in many computer vision tasks. The chi-squared distance is a nonlinear metric and is widely used to compare histograms. In this paper, we show how to learn a general form of chi-squared distance based on the nearest neighbor model. In our method, the margin of sample is first defined with respect to the nearest hits (nearest neighbors from the same class) and the nearest misses (nearest neighbors from the different classes), and then the simplex-preserving linear transformation is trained by maximizing the margin while minimizing the distance between each sample and its nearest hits. With the iterative projected gradient method for optimization, we naturally introduce the norm regularization into the proposed method for sparse metric learning. Comparative studies with the state-of-the-art approaches on five real-world datasets verify the effectiveness of the proposed method.

#### 1. Introduction

Histograms are frequently used tools in natural language processing and various computer vision tasks, including image retrieval, image classification, shape matching, and object recognition, to represent texture and color features or to characterize rich information in local/global regions of objects. In particular, a histogram in the statistics is the frequency distribution of a set of specific measurements over discrete intervals. For many computer vision tasks, each object of interest can be presented as a histogram by using visual descriptors, such as SIFT [1], SURF [2], GIST [3], and HOG [4]. As a result, the resulting histogram obtains some merits of the descriptors, for example, rotation-invariant, scale-invariant, and translation-invariant. These make it an excellent representation method for performing classification and recognition of objects.

When the histogram representations are adopted, the choice of histogram distance metric has a great impact on the classification performance or recognition accuracy of the specific task. Since a histogram can be considered as a vector of probability, many metrics such as distance, chi-squared distance, and Kullback-Leibler (KL) divergence can be used directly. These metrics, however, only account for the difference between the corresponding bins and are hence sensitive to distortions in visual descriptors as well as quantization effects [5]. To mitigate these problems, many cross-bin distances have been proposed. Rubner et al. [6] propose the Earth Movers Distance (EMD), which is defined as the minimal cost that must be paid to transform one histogram into the other, by considering the cross-bin information. Diffusion distance [5] exploits the idea of diffusion process to define the difference between two histograms as a temperature field. The Quadratic-Chi distances (QCS and QCN) [7] take into account cross-bin relationships and meanwhile reduce the effect of large bins. In particular, for the cross-bin distance, most of the work mainly focuses on how to improve the EMD and hence many variants have been proposed. EMD- [8] uses the distance as the ground distance and significantly simplifies the original linear programming formulation of the EMD. Pele and Werman [9] propose a different formulation of the EMD with a linear-time algorithm for nonnormalized histograms. FastEMD [10] adopts a robust thresholded ground distance and was shown to outperform the EMD in both accuracy and speed. TEMD [11] uses a tangent vector to represent each global transformation. For the methods mentioned above, the determinations of metrics are all based on a priori knowledge of features or handcraft. However, distance metric is problem-specific and designing a good distance metric manually is extremely difficult. Aiming at this problem, some researchers have attempted to learn a proper distance metric from histogram training data. Considering that the ground distance, which is the unique variable of the EMD, should be chosen according to the problem at hand, Cuturi and Avis [12] propose a ground metric learning algorithm to learn the ground metric adaptively by using the training data. Subsequently, EMDL [13] formulates the ground metric learning as an optimization problem in which a ground distance matrix and a flow-network for the EMD are learned jointly based on a partial ordering of histogram distances. Noh [14] uses a convex optimization method to perform chi-squared metric learning with relaxation. -LMNN [15] employs a large-margin framework to learn a generalized chi-squared distance for histogram data and obtains a significant improvement compared to standard histogram metrics and the state-of-the-art metric learning algorithms. Le and Cuturi [16] adopt the generalized Aitchison embedding to compare histograms by mapping the probability simplex onto a suitable Euclidean space.

In this paper, we present a novel nearest neighbor-based nonlinear metric learning method, chi-squared distance metric learning (CDML), for normalized histogram data. CDML learns a simplex-preserving linear transformation by maximizing the margin while minimizing the distance between each sample and its -nearest hits. In the original space, the learned metric can be considered as a cross-bin metric. For sparse metric learning, the norm regularization term is further introduced to enforce row sparsity on the learned linear transformation matrix. Two solving strategies, the iterative projected gradient and the soft-max method, are used to induce the linear transformation. We demonstrate that our algorithms perform better than the state-of-the-art ones in terms of classification performance.

The remainder of this paper is organized as follows. Section 2 provides a review of supervised metric learning algorithms. Section 3 describes the proposed distance metric learning method. The experimental results on five real-world datasets are given in Section 4. Meanwhile, we discuss the difference between our method and -LMNN in detail. Section 5 concludes the paper.

#### 2. Related Work

In this section, we review the related work on supervised distance metric learning. Due to the seminal work of Xing et al. [17], which formulates metric learning as an optimization problem, supervised metric learning has been extensively studied in machine learning area and various algorithms have been proposed. In general, the proposed methods can be roughly cast into three different categories: Mahalanobis metric learning, local metric learning, and nonlinear metric learning. For the Mahalanobis metric learning, its main characteristic is to learn a linear transformation or a positive semidefinite matrix from training data under the Mahalanobis distance metric. The representative methods include neighborhood component analysis [18], large-margin nearest neighbor [19], and information-theoretic metric learning [20]. Neighborhood component analysis [18] learns a linear transformation by directly maximizing the stochastic variant of the expected leave-one-out classification accuracy on the training set. Large-margin nearest neighbor (LMNN) [19] formulates distance metric learning into a semidefinite programming problem by forcing that the -nearest neighbors of each training sample belong to the same class while examples from different classes are separated by a large margin. Information-theoretic metric learning (ITML) [20] formulates distance metric learning as a particular Bregman optimization problem by minimizing the differential relative entropy between two multivariate Gaussians under constraints on the distance function. Bian and Tao [21] formulate metric learning as a constrained empirical risk minimization problem. Wang et al. [22] propose a general kernel classification framework, which can unify many representative and state-of-the-art Mahalanobis metric learning algorithms such as LMNN and ITML. Chang [23] uses boosting algorithm to learn a Mahalanobis distance metric. Shen et al. [24] propose an efficient and scalable approach to the Mahalanobis metric learning problem based on the Lagrange dual formulation. Yang et al. [25] propose a novel multitask framework for metric learning by using common subspace. For the local metric learning, its motivation is to increase the expressiveness of learned metrics so that more complex problems, such as heterogeneous data, can be better handled. In virtue of involving more learning parameters compared to its global counterpart, local metric learning is prone to overfitting. One of early local metric algorithms is discriminant adaptive nearest neighbor classification (DANN) [26], which estimates local metrics by shrinking neighborhoods in directions orthogonal to the local decision boundaries and enlarging the neighborhoods parallel to the boundaries. Multiple metrics LMNN [19] learns multiple locally linear transformations in different parts of the sample space under the large-margin framework. By using an approximation error bound of the metric matrix function, Wang et al. [27] formulate local metric learning as linear combinations of basis metrics defined on anchor points over different regions of the instance space. Mu et al. [28] propose a new local discriminative distance metrics algorithm to learn multiple distance metrics. For nonlinear metric learning, there are two ways to conduct metric learning. One strategy is to use kernel trick to learn a linear metric in the high-dimensional nonlinear feature space induced by a kernel function. The kernelized variants of many Mahalanobis metric learning methods, such as KLFDA [29] and large-margin component analysis [30], have been shown to be efficient in capturing complicated nonlinear relationships between data. Soleymani Baghshah and Bagheri Shouraki [31] formulate nonlinear metric learning as constrained trace ratio problems by using both positive and negative constraints. By combining metric learning and multiple kernel learning, Wang et al. [32] propose a general framework for learning a linear combination of a number of predefined kernels. Another strategy is to learn nonlinear forms of metrics directly. Based on convolutional neural network, Chopra et al. [33] propose learning a nonlinear function such that the norm in the target space approximates the semantic distance in the input space. GB-LMNN [15] learns a nonlinear mapping directly in function space with gradient boosted regression trees. Support vector metric learning [34] learns a metric for radial basis function kernel by minimizing the validation error of the SVM prediction at the same time as it trains the SVM classifier. For a comprehensive review of metric learning and its applications we refer the readers to [35–37] for details.

Although metric learning about Mahalanobis distance has been widely studied, metric learning for chi-squared distance is largely unexplored. Unlike Mahalanobis distance, chi-squared distance is a nonlinear metric and its general form requires the learned linear transformation to be simplex-preserving. Therefore, the existing linear metric learning algorithms cannot naturally apply to chi-squared distance. -LMNN adopts the LMNN model to learn chi-squared distance, but its additional margin hyperparameter is sensitive to the used data and needs to be evaluated on a hold-out set. In addition, it exploits the soft-max method to optimize the objective function, which makes the regularizers unable to be introduced naturally. The proposed method utilizes the margin of sample to construct the objective function and adopts the iterative projected gradient method for optimization and hence overcomes the weaknesses of the -LMNN. The regularizers can be incorporated into our model naturally and no additional parameter needs to be evaluated compared to the -LMNN.

#### 3. Chi-Squared Distance Metric Learning

In this section, we will propose a metric learning algorithm termed as chi-squared distance metric learning (CDML). This algorithm uses the margin of sample to construct the objective function. It is more suitable to metric learning for histogram data.

In the following, we will first introduce the definition of the margin of sample. Then the motivation and the objective function of CDML will be proposed. Finally, the optimization method of the algorithm will be discussed.

##### 3.1. The Margin of Sample

Let training data be , where is sampled from a probability simplex and let be the associated class label; the symbol denotes a -dimensional column vector whose all components are one. The chi-squared distance between two samples and can be computed bywhere indicates the th feature of the sample .

For each instance in the original input space, we can map it into an -dimensional probability simplex space by performing a simplex-preserving linear transformation , where is an element-wise nonnegative matrix of size () and the sum of each column element is one. In particular, the set of such simplex-preserving linear transformations can be defined as . With the linear transform matrix , the chi-squared distance between two instances and under the transformed space can be written as

For each sample , we call a hit if has the same class label with , and the nearest hit is defined as the hit which has the minimum distance with the sample . Similarly, we call a miss if the class label of is different from , and the nearest miss is defined as the miss which has the minimum distance with the sample . Let and be the th nearest hit and miss of , respectively. The margin of sample [38] with respect to its th nearest hit and th nearest miss is defined aswhere , . Note that and are determined by the generalized chi-squared distance and the transformation matrix affects the margin through the distance metric.

##### 3.2. The Objective Function

Similar to many metric learning algorithms about Mahalanobis distance, the goal of our algorithm is to learn a simplex-preserving linear transformation optimizing NN classification. Given an unclassified sample point , NN first finds its -nearest neighbors in the training set and then assigns the label by the class that appears most frequently in the -nearest neighbors. Therefore, for robust NN classification, each training sample should have the same label with its -nearest neighbors. Obviously, if the margins of all the samples in the training set are bigger than zero, then the robust NN classification can be obtained. By maximizing the margins of all training samples, our distance metric learning problem can be formulated as follows:Here, the utility function is used to control the contribution of each margin term to the objective function. The introduction of constraint is to ensure that the chi-squared distance in the transformed space is still a well-defined metric.

Note that in (4) maximizing the margins can also be attained by increasing the distances between each sample and its nearest hits and the distances to its nearest misses simultaneously, where the latter obtain the much larger increase. However, we expect that each training sample and its nearest hits form a compact clustering. Therefore, we further introduce a term to constrain the distances between each sample and its nearest hits and obtain the following optimization problem:where is a balance parameter trading off the effect between two terms.

Moreover, considering the sparseness of some high-dimensional histogram data, the direct transformation matrix learning probably overfits the training data, resulting in poor generalization performance. To address this problem, we introduce the norm regularizer to regularize the model complexity. With the norm regularization, the metric learning problem can be written aswhere the regularization term guarantees that the parameter matrix is sparse in rows and is a nonnegative regularization parameter.

##### 3.3. The Optimization Method

For the constrained optimization problem in (5), there are two methods that can be used to solve it. The first strategy is the iterative projected gradient method, which uses a gradient descent step to minimize followed by the method of iterative projections to ensure that is a simplex-preserving linear transformation matrix. Specifically, we will take a gradient step and then project into the set on each iteration, where is a learning rate and is the gradient of the objective function about the matrix parameter . Note that the constraints on can be seen as separated probabilistic simplex constraints on each column of . Therefore, the projection onto the set can be done by performing a probabilistic simplex projection, which can be efficiently implemented with a complexity of [39], on each column of . In addition, in order to compute the gradient , we need to obtain the partial derivative of the chi-squared distance in (2). Let and ; the partial derivative of with respect to the matrix can be given byGenerally speaking, the iterative projected gradient method needs a matrix of size to initialize the linear transformation matrix . In our work, the rectangle identity matrix is always used to initialize it. When the iterative projected gradient method is used, in particular, various regularizers, such as Frobenius norm regularization and norm regularization, can be naturally incorporated into the objective function in (5) and without influencing the solving of the problem.

Another strategy is that we first transform the constrained optimization problem in (5) into an unconstrained version by introducing a soft-max function, and then the steepest gradient descent method is used for learning. Here the soft-max function is defined aswhere the matrix is an assistant parameter. Obviously, the matrix is always in the set for any choice of . Thus, we can use the gradient of the objective function with respect to the matrix to minimize (5). In particular, the partial derivative of the chi-squared distance in (2) with respect to the matrix can be computed bywhich will be used to compute the gradient . The initial value of the matrix used for optimization is set to , where is a rectangle identity matrix and denotes the matrix of all ones. This solving strategy is named as the soft-max method. In particular, when the soft-max method is used for optimization, it is not easy for us to introduce the regularization directly. For the two solving methods, the proposed algorithm can always perform both metric learning and dimensionality reduction.

#### 4. Experiments

In this section, we perform a number of experiments on five real-world image datasets to evaluate the proposed methods. In the first experiment, two solving strategies, the iterative projected gradient and the soft-max method, are compared according to training time and classification error. In the second experiment, we evaluate the proposed method with the state-of-the-art methods, including four histogram metrics (, QCN (available at http://www.ariel.ac.il/sites/ofirpele/QC/), QCS (available at http://www.ariel.ac.il/sites/ofirpele/QC/), and FastEMD (available at http://www.ariel.ac.il/sites/ofirpele/FastEMD/code/)) and three metric learning methods (ITML (available at http://www.cs.utexas.edu/~pjain/itml/), LMNN (available at http://www.cse.wustl.edu/~kilian/code/files/mLMNN2.4.zip), and GB-LMNN (available at http://www.cse.wustl.edu/~kilian/code/files/mLMNN2.4.zip)), on the image retrieval dataset corel. As the source code of the closely related method -LMNN [15] is not publicly available, we further perform the full-rank and low-rank metric learning experiments on the four datasets (dslr, webcam, amazon, and caltech). Since the -LMNN has also been tested on the above datasets, we can make a direct comparison. There are several parameters to be set in our model. The parameter is empirically set to , , 10%#NumberofTrainingSamples/#NumberofClasses. We fix the parameters and to 0.5 and 50 in our experiments, respectively. Moreover, the parameter is set to 1 if the regularization is used. The proposed methods are implemented in standard C++. In our work, all the experiments are executed in a PC with 8 Intel(R) Xeon(R) E5-1620 CPUs (3.6 GHz) and 8 GB main memory.

##### 4.1. Datasets

Table 1 summarizes the basic information of the five histogram datasets used in our experiments. The dataset corel is often used in the evaluation of histogram distance metric [7, 10, 11], which contains 773 landscape images in 10 different classes: people in Africa, beaches, outdoor buildings, buses, dinosaurs, elephants, flowers, horses, mountains, and food. There are 50 to 100 images in each class. All images have two types of representation: SIFT and CSIFT. For SIFT, Harris-affine detector [1] is used to extract orientation histogram descriptor. The second representation CSIFT is a SIFT-like descriptor for color image. CSIFT takes into account color edges in time of computing the SIFT and skips the normalization step to preserve more distinctive information. The size of final histogram descriptor is also . For more detailed information readers can be referred to [10]. As in [7], for each kind of descriptions we select 5 images (numbers ) from each class to construct the test data of 50 samples and the remaining image as training data. Moreover, each histogram descriptor of dimension 384 is further normalized to sum to one.