Mathematical Problems in Engineering

Volume 2016, Article ID 7347986, 7 pages

http://dx.doi.org/10.1155/2016/7347986

## Feature Scaling via Second-Order Cone Programming

School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China

Received 20 January 2016; Accepted 3 April 2016

Academic Editor: Julien Bruchon

Copyright © 2016 Zhizheng Liang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Feature scaling has attracted considerable attention during the past several decades because of its important role in feature selection. In this paper, a novel algorithm for learning scaling factors of features is proposed. It first assigns a nonnegative scaling factor to each feature of data and then adopts a generalized performance measure to learn the optimal scaling factors. It is of interest to note that the proposed model can be transformed into a convex optimization problem: second-order cone programming (SOCP). Thus the scaling factors of features in our method are globally optimal in some sense. Several experiments on simulated data, UCI data sets, and the gene data set are conducted to demonstrate that the proposed method is more effective than previous methods.

#### 1. Introduction

Selecting relevant and important features has been an active research area in statistics and data mining [1–6]. In some real-world applications, one is often confronted with the high-dimensional data such as text data and gene data. One important characteristic of these data sets is that some features of data may contain irrelevant or redundant information. It is shown that a large number of irrelevant or redundant features will degrade the performance of classifiers. In order to improve the comprehensibility of classification modes and their classification performance, it is interesting to explore how to reduce or remove irrelevant features of data.

Owing to their better generalization performance of support vector machines, they have become an effective tool to select relevant features in the past several years. In [1], SVMs are used as a subroutine in the feature selection process and the SVMs accuracy is optimized on the resulting subset of features. In [7], the gradient descent method is used to obtain the weights of features in terms of the SVM criterion. Moreover, the relevance of features can also be measured by their scaling factors. Thus feature scaling is performed to measure the importance of features. In [8], scaling factors of features are tuned by minimizing the standard SVM empirical risk. Further, some estimates of the generalization error of SVMs [9] are used to automatically tune scaling factors of features by using the gradient descent algorithm. In [10], the smooth leave-one-out error is optimized to obtain the scaling factors of features, while an iterative feature scaling method for linear SVM is proposed in [11]. In addition, some methods depend on other criteria to perform feature selection. For example, Maji [12] proposed a rough hypercuboid approach in approximation spaces to select relevant features of data. Liu et al. [13] combined global and local structures of data to perform feature selection. Li et al. [14] developed a stable feature selection algorithm. Li and Tang [15] combined nonnegative spectral analysis and redundancy control to select relevant features in the unsupervised case. Wang et al. [16] minimized global redundancy of data to obtain the optimal features. In order to improve the discriminant power, Tao et al. devised an effective discriminative feature selection method in [17]. Although these algorithms continue to contribute to the development of feature selection, some of these methods are often dependent on the gradient descent method, which may lead to their sensitivity to the choice of the gradient step and often falling into local minima.

In this paper, we propose a novel method for feature scaling in terms of the SVM criteria which avoids the scaling factors of features falling into locally optimal solutions. The proposed method first introduces the scaling factors of features in linear support vector machines and then uses a kind of generalized performance measure to learn the scaling factors of features. To make the generalized performance measure suitable for feature scaling, the measure is modified and formulated as a second-order cone programming problem. Finally, the scaling factors of features are obtained by solving a convex optimization problem. In addition, we also carry out experiments on some data sets to evaluate the proposed method.

#### 2. Linear Support Vector Machines (LSVMs)

In this section, we briefly recall the basic idea of LSVMs. This class of algorithms introduced by in [11] has been shown to perform well in real applications.

Given the training data with input data and the corresponding binary class label , the SVM is to find an optimal hyperplane that separates two classes such that the hyperplane is the greatest distance from the closed training vectors of each class. Often, the hyperplane is obtained by solving the following optimization problem:where and is the inner product of two vectors. When data cannot be perfectly separated, a penalty term is added to the objective function in (1), where is a positive number. Accordingly, the following optimization problem is constructed:

It is found that (2) can be solved in the dual space of Lagrange multipliers , . In such a case, (2) can be formulated as the following optimization problem:

After and are obtained, the following decision function is used to classify samples:

Note that although there are training samples in (4), only the samples with play a role in the decision function. The samples with are called support vectors.

#### 3. Feature Scaling Using Second-Order Cone Programming (SOCP)

In this section, we propose a feature scaling method based on the generalized performance measure. First, it would be helpful to introduce the generalized performance measure.

##### 3.1. Generalized Performance Measure

Note that the generalized performance measure [18] is optimized not only over the function space but also over a convex cone of positive semidefinite matrices. In the case of the SVM criteria, the following generalized performance measure is used to choose proper kernel parameters:where is a linear combination of different kernel matrices such as Gaussian kernels and polynomial kernels, , , and .

Instead of adopting multiple kernels, we consider the linear kernel in this paper. It is clear that the Gram matrix in linear support vector machines can be written as a linear combination of dimensions of features. In other words, the following equation is constructed:

In [19], each feature of data is associated with an indicator value. In this paper, based on the similar idea, we introduce the scaling factors () of features of data in the Gram matrix . Accordingly, one can obtain where . From (7), it is observed that the th feature is removed if . Further, if the scaling factors () are sparse, this corresponds to selecting part of the features. If the scaling factors of features are not sparse, a large value of indicates a useful and important feature. As a result, feature selection can be performed by removing the features that correspond to small scaling factors [8, 9]. In addition, it should be pointed out that the condition should be imposed in order that the matrix is semidefinite. If the kernel in (5) takes in (7), then one has

Applying an idea in [18], one can recast (8) as a semidefinite programming problem which can be solved using a general-purpose program such as SeDuMi [20] or SDPT3 [21]. Note that is a linear combination of rank-one matrices . Equation (8) can also be formulated as the quadratically constrained quadratic programming (QCQP), which is a special form of SDP.

As will be shown in the following, directly solving (8) may not be suitable for feature scaling. To determine why this happens, in the following we first analyze the characteristics of (8). To this end, we start by stating the following definition.

*Definition 1. *If a pair of variables () exists such that for all and then () is a saddle point of (8).

The right-hand side of (9) says that minimizes for . The left-hand side of (9) says that maximizes for . Accordingly, it follows that () is a saddle point if and only if [22]. If there exist saddle points of (8), strong duality holds and the optimal values for and are obtained simultaneously. One can also observe that (8) is a linear programming problem with respect to if the optimal value of is given. This shows that the optimal value of can be obtained by searching for extreme points in a linear programming (LP) problem if is known. Although nonextreme points may be optimal values of LP, there always exist extreme points corresponding to optimal values. Based on these facts, one may obtain the extreme points in the linear programming problem as the optimal . Thus, the solution to will be overly sparse, since there exists one equation with respect to in (8). Further speaking, only few scaling factors are not zero in such a case. As a result, one cannot evaluate the importance of most features in such a case.

So now we know that there is a possibility that there might be some cases when there are just few scaling factors of data that are not zero, making it become unsuitable for feature selection in most cases due to the fact that only few features are chosen. To deal with this difficulty, we add the L2-norm of to the objective function of (8). Accordingly, (8) has the following form:

Likewise, one can transform (10) into a semidefinite programming (SDP) problem.

##### 3.2. Second-Order Cone Programming (SOCP) Problems

Solving SDP remains computationally expensive even with the advances in interior point methods. One way to reduce this computational complexity is to transform (10) into a second-order cone programming (SOCP) problem. Second-order cone programming problems are convex optimization problems in which a linear function is minimized over the intersection of an affine linear manifold with the Cartesian product of second-order cones. Interior point methods for solving SOCP directly have a much better worst complexity than those for SDP. As a result, solving SOCP problems is more efficient than solving SDP problems in practice. It is also noted that SOCP has obtained many applications in machine learning in recent several years [23, 24]. Miyashiro and Takano [25] used mixed integer second-order cone programming formulations for variable selection in linear regression. More interesting, the SOCP technique can be used to solve the problems of support vector machines [26, 27].

Applying the techniques in [28, 29], one can formulate (10) as the following SOCP problem:where , , and , , . One can solve (11) by using MOSEK optimization software (http://www.mosek.com/). From (11), one can see that it contains the L1-norm in the constraint. We refer to (11) as L1L2SOCP for clarity. In fact, the constraint may be removed in (11), since there exists one constraint . In such a case, we refer to (11) as L2-norm SOCP (L2SOCP). Here it should be pointed out that computational complexity of solving (11) is , where is the number of features and is the number of samples. It is obvious that this method is very effective when the training set contains a large number of samples and becomes less effective when the number of features is very huge. It should be noted that one generally obtains the nonsparse optimal solutions of in the case of the L2-norm of . It is obvious that scaling factors of features are globally optimal by solving (11), which is different from previous methods, where scaling factors are often locally optimal. In order to obtain the decision function, Lagrange multipliers and the bias can be achieved by the following equations:where denotes the pseudoinverse of the matrix and is an identity matrix.

#### 4. Experimental Results

In this section, we carry out the experiments on simulated data, UCI data sets, and the gene data set to evaluate the proposed optimization model.

##### 4.1. Effect of Irrelevant Features

To evaluate the proposed method, where irrelevant features are present, we generate -dimensional Gaussian data from two classes, where , , and the covariance matrices are both identity matrices. We compare the proposed method with classical support vector machines (CSVMs), SVM with multiple parameters based on the radius-margin bound (RW), and SVMs with multiple parameters based on the span bound (Span). All the algorithms are trained on a training set consisting of 100 samples of each class and are tested on an independent test of 1000 samples. In order to reduce variations of performance, the experimental results are reported by averaging over 20 runs. Figure 1(a) shows the error rate of each method with the increase of irrelevant features. Figure 1(b) shows the scaling factors of our method.