Complexity

Volume 2017 (2017), Article ID 2691474, 14 pages

https://doi.org/10.1155/2017/2691474

## Kernel Negative *ε* Dragging Linear Regression for Pattern Classification

^{1}Key Laboratory of Modern Teaching Technology, Ministry of Education, Xi’an 710062, China^{2}Engineering Laboratory of Teaching Information Technology of Shaanxi Province, Xi’an 710119, China^{3}School of Computer Science, Shaanxi Normal University, Xi’an 710119, China

Correspondence should be addressed to Shigang Liu

Received 27 August 2017; Accepted 9 November 2017; Published 10 December 2017

Academic Editor: Chuan Zhou

Copyright © 2017 Yali Peng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Linear regression (LR) and its variants have been widely used for classification problems. However, they usually predefine a strict binary label matrix which has no freedom to fit the samples. In addition, they cannot deal with complex real-world applications such as the case of face recognition where samples may not be linearly separable owing to varying poses, expressions, and illumination conditions. Therefore, in this paper, we propose the kernel negative dragging linear regression (KNDLR) method for robust classification on noised and nonlinear data. First, a technique called negative dragging is introduced for relaxing class labels and is integrated into the LR model for classification to properly treat the class margin of conventional linear regressions for obtaining robust result. Then, the data is implicitly mapped into a high dimensional kernel space by using the nonlinear mapping determined by a kernel function to make the data more linearly separable. Finally, our obtained KNDLR method is able to partially alleviate the problem of overfitting and can perform classification well for noised and deformable data. Experimental results show that the KNDLR classification algorithm obtains greater generalization performance and leads to better robust classification decision.

#### 1. Introduction

Least squares regression (LSR) has been widely used for many fields of pattern recognition and computer vision. Owing to LSR being mathematically tractable and computationally efficient, in the past, many variants have been proposed. Notable LSR algorithms include weighted LSR [1], partial LSR [2], and other extensions (e.g., nonnegative least squares (NNLS) [3]). In the pattern recognition community, LSR is also referred to as minimum squared error algorithm [4–6]. Moreover, very competent extensions of least squares regression such as regularized least squares regression [7] are also proposed. Among extensions of least squares regression, sparse regression [8] and low-rank regression [9, 10] can obtain notable performance. The relationship between the regression and other methods such as locally linear embedding and local tangent space alignment is also studied [11]. In addition, LSR is also applied to semisupervised learning. Nie et al. [12] proposed adaptive loss minimization for semisupervised elastic embedding. Fang et al. [13] proposed learning a nonnegative sparse graph for linear regression for semisupervised learning, in which linear regression and graph learning were simultaneously performed to guarantee an overall optimum.

LSR can be simply described as follows. Before conventional least squares regression (CLSR) is applied for classification [3, 14, 15], it assigns different fixed class labels to training samples of different classes. Then it employs the least squares regression algorithm to achieve a mapping that is able to transform training samples into approximations of their class labels. Finally CLSR uses the obtained mapping to predict the class label of every test sample. In addition to classification problems, least squares regression is also applied to subspace segmentation [16], matrix recovery [17], and feature selection [18].

The sparse representation classification (SRC) [19–21], recently proposed, can be regarded as a special form of least squares regression. Differing from LSR, it achieves an approximation of a test sample via a sparse linear combination of all training samples. Also collaboration representation [22] and linear regression classification [23] are similar. An overview of sparse representation is provided in [24]. However, for classification tasks, because SRC must solve a set of equations for classifying every sample, CLSR is computationally much more efficient than SRC.

Xiang et al. proposed discriminative least squares regression (DLSR) [25]. The core idea is, under the conceptual framework of least squares regression, to achieve a larger class margin than the class margin obtained using CLSR for classification algorithms by using the dragging technique, which plays a similar role in enlarging the margin as other large margin classifiers proposed in [26–28]. The idea of using slack variable to relax the model has been widely used in the related field [29]. When the distribution of training samples is in accordance with that of test samples, the classifier learned from training samples can well adapt to test samples. Under the condition, since the classifier learned from training samples has a very large class margin, it can also obtain a satisfactory class margin for test samples. Accordingly the original dragging technique can perform well. In other words, a high classification accuracy can be produced. However, in real-world applications, owing to the noise or deformability of the object, the difference between training samples and test samples from the same class may be much. For example, it is well known that face images are a kind of deformable objects (owing to varying poses, expressions, and illumination conditions). Two-face images from the same subject have much difference. This difference may be even greater than that of two-face images obtained from two distinctive subjects. In this case, a large margin classifier obtained by using training samples is not usually suitable for test samples. In other words, it probably performs badly in classifying the test samples. On the contrary, reducing the class margin usually achieves better classification accuracy for classification problems on noised data. Thus, we focus on determining a proper margin by using the negative dragging technique and producing a robust classifier for pattern classification on noised and deformable data.

Furthermore, we focus on introducing the kernel trick to improve the dragging linear regression. In machine learning, the kernel trick is originally utilized to construct nonlinear support vector machines (SVMs) [30–32]. In the last more than 10 years, many kernel based approaches have been proposed, such as well-known kernel principal component analysis (KPCA) [33, 34] and kernel Fisher discriminant analysis (KFDA) [35]. For classification, Yu et al. presented the kernel nearest neighbor (KERNEL-NN) classifier [36]. KERNEL-NN applies the nearest neighbor classification method in the high dimensional feature space. The KERNEL-NN classifier could perform better than the NN classifier by utilizing an appropriate kernel. Kernel sparse representation classification (KSRC) is also presented [37, 38]. So far, by using kernel tricks [39], almost all linear learning methods can be generalized to the corresponding nonlinear ones. The kernel trick [40] goes a large step toward the goal of classifying heterogeneous data. These kernel based algorithms improve the computational ability of the linear algorithms. They first implicitly map the data in the input space into a high or even infinite dimensional kernel feature space [18, 41] by a nonlinear mapping and then perform linear processing in the kernel feature space by using the inner products, which can be computed by a kernel function. As a result, these kernel based algorithms perform a nonlinear transformation with respect to the input space.

As is well known, kernel approach can change the distribution of samples by the nonlinear mapping. If an appropriate kernel function is utilized, kernel approach is able to make the data of different classes more linearly separable. Therefore, kernel based algorithms can perform classification well. This motivated us to integrate kernel method into linear regression for classification. If an appropriate kernel function is utilized, more samples from the same class are close to each other and samples from distinct classes are far from each other in the high dimensional feature space. Hence, in the high dimensional feature space, it is easy to learn a mapping that can well convert training samples into their class labels. Namely, linear transformation matrix learned in the high dimensional feature space can more appropriately map samples into their class labels and has more powerful discriminating ability.

Based on the above two aspects, we propose the kernel negative dragging linear regression (KNDLR) method in this paper. For KNDLR, samples are implicitly mapped into a high dimensional feature space first, and then linear regression with the negative dragging is performed in this new feature space. We prove that KNDLR in the high dimensional feature space can be formulated in terms of the inner products, while the inner products could be computed by kernel function. Thus KNDLR is easy to be implemented and has low computation cost. The classifier can generalize well because we propose and use the negative dragging technique, and kernel approach is also integrated into KNDLR. Comprehensive experiments demonstrated the superior characteristics of KNDLR. In summary, the contributions of the proposed method are as follows.

(1) It relaxes the strict binary label matrix that is used in conventional LR into a slack variable matrix which has more freedom to fit the sample. The proper margins between different classes are achieved by using the negative dragging technique. Previously researchers usually focus on enlarging the margin between different classes, whereas the negative dragging technique proposed by us seems to be a new contrary idea, which is useful to overcome the overfitting problem and to enhance the robustness of the algorithm on unseen samples, for example, test samples.

(2) The kernel approach is also integrated into our method. We show that KNDLR in the high dimensional feature space can be formulated in terms of the inner products, and the inner products could be computed by the kernel function. Thus KNDLR only needs to calculate the kernel function rather than directly calculating data in the high dimensional feature space corresponding to the kernel function.

(3) An algorithm named KNDLR is devised for the proposed method. The validity of the algorithms is tested on six image datasets.

The other parts of the paper are organized as follows. Section 2 briefly reviews works related to this paper. In Section 3, our method is presented. In Section 4, analysis of our method is provided. Experimental results are reported in Section 5. Finally, Section 6 offers the conclusion of this paper.

#### 2. Related Works

In this section, we first introduce the CLSR for classification. Then, the kernel trick is briefly reviewed.

##### 2.1. Conventional Least Squares Regressions for Classification

The collection of training samples is represented as a matrix . is a training sample in the form of column vector. If the training sample is a two-dimensional image, then it is converted into one column vector in advance. The objective function of conventional least squares regression (CLSR) for classification is as follows:where ( is the number of class) is the binary class label matrix and the th row of is the class label vector of the th sample.

For a three-class classification problem, in CLSR the class label matrix of four samples may be indicates that the first and second samples are from the first class, the third sample is from the third class, and the fourth sample is from the second class. is the transformation matrix which converts the sample matrix into the class label binary matrix . stands for Frobenius norm of matrix. In the above CLSR for classification, the class label is predefined and fixed.

##### 2.2. Kernel Trick

The kernel trick is a very powerful technique in machine learning. It has been successfully applied to many methods, such as SVM [31, 32], KPCA [33, 34], and KFDA [35]. By using kernel tricks, a linear algorithm can be easily generalized to a nonlinear algorithm.

Mercer kernel is generally used in kernel methods. It is a continuous, symmetric, positive semidefinite kernel function. Given a Mercer kernel , there is a unique associated reproducing kernel Hilbert space (RKHS) . Usually, a Mercer kernel can be expressed aswhere denotes the transpose of a matrix or vector, and are any two points in , and is the implicit nonlinear mapping associated with the kernel function . When implementing kernel methods, we do not need to know what is and just adopt the kernel function defined as (3). Here the kernel function is the connection between the learning algorithm and data. The linear kernels, polynomial kernels, Gaussian radial basis function (RBF) kernels, and wavelet kernels [18, 40, 41] are commonly used kernels in kernel methods. The polynomial kernel has the form ofwhere is a constant, is the order of polynomial, and RBF kernels can be expressed aswhere is the parameter for RBF kernels and is the distance between two vectors.

#### 3. Our Method

##### 3.1. Solving the Optimization Model

Training samples in the input space are represented as a matrix . Let be the nonlinear mapping function corresponding to a kernel . Firstly, we implicitly employ to map the data from input space to a high dimensional kernel feature space . We have

Then, for classification, we should transform samples set to a class label matrix. But the class label matrix in CLSR is a strict binary label matrix which has less freedom to fit the samples. It is expected that the original strict binary constraints in can be relaxed into the soft constraint so that it has more freedom to fit the samples and simultaneously produce a classifier with well generalization. To this end, the slack variable matrix which is different from in DLSR is used to substitute for the original class label matrix . The four samples in Section 2.1 are also taken as an example here and then the slack variable class label matrix is defined as follows:

It can be seen that can help to properly reduce the class margins of CLSR to generalize well. Formally, let be a dragging matrix and defined asMeanwhile, let be the dragging coefficient matrix and defined asthen , where is a Hadamard product operator of matrices. Relaxing into has an idea opposite to that of the dragging technique in DLSR; therefore we call this relaxation the negative dragging.

By virtue of the kernel feature space , our method tries to construct a bridge between and . In particular, our goal is to learn a linear function that makes be approximately satisfied. Thus our method has the following objective function: where is the transform matrix and is a positive regularization parameter.

Since is relaxed into , (10) has more freedom than (1) to fit the samples. Based on the knowledge of Linear Algebra, we know that

It is easy to prove that objective function (10) is convex. Thus it has a unique solution. An iterative updating algorithm is devised to solve it. The first step of the algorithm is to solve by fixing .

Theorem 1. *Given , the optimal in (10) can be calculated as*

*Proof. *According to matrix theory, the optimal can be obtained by making the derivation of (10) with respect to and set it to zero. That is,The second step of our algorithm is to solve by fixing . Then (10) can be rewritten as . can be obtained by solving the following optimization problem:whereConsidering the th row and th column element of , we haveAccording to [25], the formula to calculate isTherefore, the optimal solution of isIn a word, the first step of the algorithm is to solve by fixing , and the second step of the algorithm is to solve by fixing . In other words, (12) should be calculated in the first step, and (15) and (18) should be calculated in the second step. These two steps should be repeatedly calculated till the termination condition is satisfied.

##### 3.2. Integrating the Kernel Trick into the Optimization Model

As mentioned above, we should repeatedly calculate (12) and (18). However, for (12) and (18), exists in kernel feature space . Fortunately, we do not need to know what is and just adopt the kernel function (3). How to use the kernel function to eliminate denotation is presented as follows.

Let

By using the following formula [42] on matrix manipulations:we use , , and instead of , , and , respectively, havingThen, we substitute it into (12); thereforewhere .

Actually, in (23) is changeless because it only depends on and the utilized kernel function, while is changeable during the iteration; hence, for avoiding to directly calculate , in the first step we only need to calculate

The second step of algorithm is to solve by calculating (15) and (18). By substituting (23) into (15), we have

Hence, in the second step we need to calculate (25) and (18).

Then the predicted label for a test sample is

Intuitively, should be calculated by iteration and then it is utilized to calculate the predicted label for test sample . However, by substituting (23) into (26), we have where .

Because depends on and the utilized kernel function, we only need to calculate out by the iteration, and after the iteration is performed, the predicted label for a test sample can be obtained by (27). As presented above, directly calculating can be avoided by utilizing the kernel function.

In summary, we do not need to know what is and just adopt the kernel function during the iteration. The complete algorithm is summarized in Algorithm 1.