Journal of Optimization

Volume 2016 (2016), Article ID 2659012, 7 pages

http://dx.doi.org/10.1155/2016/2659012

## Evidence Maximization Technique for Training of Elastic Nets

^{1}Moscow Institute of Physics and Technology, Moscow 141700, Russia^{2}Institute for Systems Analysis, Russian Academy of Sciences, Prospekt 60-Let Octyabria 9, Moscow 117312, Russia^{3}MV Lomonosov Moscow State University, Leninskie Gory 1, Moscow 119991, Russia

Received 15 February 2016; Revised 10 May 2016; Accepted 15 May 2016

Academic Editor: Manlio Gaudioso

Copyright © 2016 Igor Dubnov et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This paper presents a technique of evidence maximization for automatic tuning of regularization parameters of elastic nets, which allows tuning many parameters simultaneously. This technique was applied to handwritten digit recognition. Experiments showed its ability to train either models with high accuracy of recognition or highly sparse models with reasonable accuracy.

#### 1. Introduction

One of the important aspects of machine learning is to choose an appropriate subset of the (possibly huge) set of all virtually available features, such as the trained model which depends only on this subset of features. A good choice (*feature selection*, [1]) can both speed up the training and improve the quality of its result. It depends not only on the particular problem, but also on the data available for training.

Feature selection can either precede the learning itself (e.g., entropy-based or correlation analysis) or be a built-in part of the learning process (e.g., learning with -regularization, such as LASSO regression and -SVM) [2]. This paper deals with the latter case only.

It is known that learning with -regularization can produce rather sparse models which depend on rather few features, but learning with -regularization usually produces more accurate models. In [3] some mixed regularization called “*elastic net*” was proposed. Let be a model parameterized by , predicting response by feature vector , and let be the cost of prediction provided the true response is . Then training of such a model with elastic net regularization on the set of samples using loss minimization (a.k.a. ERM—empirical risk minimization) method or, briefly, “training of an elastic net” is the minimization problem:where and stand for - and -norms, respectively, and and are nonnegative regularization parameters. It is shown experimentally in [3] that varying the parameters and one can balance between the sparsity of the model and the accuracy of its prediction.

In this paper elastic nets are used to regularize multiclass logistic regression. A method of tuning more general regularization parameters than and above is described. This method is tested on a handwritten digit recognition problem.

The rest of this paper is organized as follows. Section 2 presents the mathematical model and the elastic net in details. Section 3 describes the learning algorithm and the evidence maximization technique for tuning regularization parameters of the elastic net; this technique is the main subject of this paper. Section 4 describes experiments with elastic nets for digit recognition. Section 5 exposes the results of experiments. Section 6 summarizes the main results of experiments and discusses further possible applications of the proposed technique.

#### 2. Mathematical Model

Consider multinomial classification in its both deterministic and probabilistic variants: given a feature vector either to predict the correct label of one of classes to which the vector belongs or to estimate the conditional probability of each class label. Probabilistic classification is considered primary and in deterministic classification a class label (usually* the* class label) will be predicted.

Let stand for* augmented feature vector*. To estimate multinomial linear logistic regression modelwill be trained. The model parameter matrix consists of -dimensional rows . To train the model means to choose some “good” parameter .

To do this we use a training dataset of couples which are supposed to be i.i.d. random. can also be written in a transposed way where and . Training tries to maximize the posterior of given some prior and the training set . Since and the denominator does not depend on , maximization of posterior probability is equivalent to maximization of the numerator or of its logarithm: The second summand in (4) is the log likelihood of the model , while the first one depends on the choice of the prior.

Let -matrix stand for without the bias column . The prior is usually taken independent of the bias, so . In the simplest cases when spherical Gaussian or Laplacian distributions are taken as priors, training (4) turns to an optimization problem with - or -regularization, respectively.

Similarly, elastic nets are obtained from the priorwhere(remember that the space of is -dimensional) and denotes the cumulative function of the standard one-dimensional Gaussian distribution:To simplify calculations instead of the function we useFor instance, the normalization factor becomes

Plugging (5) into (4) turns training of elastic net into the optimization problem:

Both prior (5) and regularization summands in (10) are isotropic with respect to all features. However the features themselves might be unequal by their nature. To respect such an inequality we partition all features into groups of features of the same nature. For example, all pixel values of the image have the same nature and will belong to the same group of features, while computed features or the aspect ratio falls to other groups.

Let us fix a partition of the set of indices into subsets of cardinalities and define separate regularization parameters and for each group. Then training of generic elastic net (10) turns into and training of the elastic net for linear logistic regression (2) turns into

It is easy to see that optimization problem (13) is convex for any training set and nonnegative and . Choice of values of regularization parameters and , which is the subject of this paper, will be discussed later in Section 3.2.

#### 3. Learning Technique

##### 3.1. Nonsmooth Convex Optimization

Standard gradient methods are not applicable to minimization problems (10) and (13) because they contain nonsmooth terms and . So the algorithm proposed by Nesterov in [4] for minimization of sums of smooth and simple nonsmooth convex functions is used. Nesterov’s algorithm provides the best convergence rate at moderate number of steps (less than the number of variables, which is equal to in (10) and (13)) among all known methods of nonsmooth optimization [5].

Nesterov’s algorithm can exploit strong convexity (-convexity) of the target function and converges the faster, the bigger can be guaranteed in advance. The target function in (13) is not strongly convex in the bias column , but it would be strongly convex if -regularization was applied to all parameters including .

Consider the following modification of problem (13).(1)Estimate the bias column : where is the number of training samples of class . The estimate is the solution of minimization problem: which is nothing but maximum likelihood training of the featureless logistic regression model.(2)Choose some and instead of (13) solve

The target function in (16) is strongly convex with nonnegative parameter .

##### 3.2. Evidence Maximization

To train elastic nets (10), (13), or (16) successfully some reasonable values of regularization parameters and (hyperparameters) are required. In machine learning problems with one or at most two hyperparameters (e.g., in SVM [1]) their values can be found by grid search. However, there are hyperparameters in generalized elastic net (16) and we are interested in the case . In this case, a reasonable way to optimize them is evidence maximization. The use of evidence maximization for estimation of hyperparameters of ridge regression and other Gaussian-based models is well known [6]. For non-Gaussian elastic nets the evidence of hyperparameters can be neither computed nor maximized exactly and will be approximated rather roughly.

Let prior depend on two hyperparameters and like in (5). Then posterior (3) with and indicated explicitly is The denominator is ignored in maximization of posterior (3) because it does not depend on . However it depends on and . This denominator is called the evidence of parameters and with respect to the training set . Despite its special name, it is a usual likelihood, not the likelihood of a single model like in (4), but the likelihood of the whole probability space of models defined by hyperparameters and .

For prior (5) the evidence of pair isand the evidence maximization is equivalent to minimization

The normalization factor is rewritten using formula (9) here.

The gradient of (19) is where stands for the expectation of with respect to posterior distribution of proportional to :

To minimize (19) instead of traditional gradient steps the transformation is used iteratively.

Formulas (20) imply that each point of maximum of the evidence is a fixed point of transformation (22). No convergence of transformation (22) is guaranteed. But in the experiments several iterations of this transformation allowed training more accurate model.

For modified elastic net (16), transformation (22) turns into for and

Expectations , , and cannot be computed exactly because posterior is rather complicated and high-dimensional. They are estimated using diagonal Laplace approximation [7] of posterior at trained model (16) instead of the itself.

##### 3.3. Stopping Criterion

To stop either training (16) with fixed regularization parameters or iterations of transformations (23) and (24) of , the following validation technique is used. The available dataset is partitioned into training set of samples and validation set of samples. The first one is used to train elastic nets (16) while the second one is used to decide whether further training becomes senseless and should be stopped. Namely, training of the elastic net is stopped if likelihood has not increased after several (about 30) last optimization steps, and tuning of the regularization parameters is stopped if likelihood of the trained model has not increased after several (about 5) last iterations.

This criterion is a kind of well-known early stopping method [8]. On one hand, such an early stopping speeds up the training significantly. On the other hand, it is a regularization technique [9] by itself and can hide the effect of tuning the regularization parameters via evidence maximization, which is the subject of the study here. To find a balance, the delays between nonincreasing of the validation likelihood and stopping were chosen empirically.

#### 4. Experiments

The method described in Sections 2 and 3 was applied to recognition of handwritten digits from MNIST database (see [10]). This database contains grayscale raster images of pixels each, which belong to one of classes. Traditionally it is partitioned into samples for training and for testing. of training samples were left out for validation, so and .

Both to make linear logistic regression more powerful and to test the proposed method of estimation of numerous regularization parameters more features were added to the model. Besides the primary features (the pixel intensities) several groups of secondary features were generated. Then all the features, both secondary and primary, were normalized to zero mean and unit variance.

The following groups of secondary features were used in experiments.(1)Horizontal and vertical components of the gradient of the pixel intensity ( features).(2)Amplitudes and phases of the discrete Fourier transform [11] of the pixel intensity ( features).(3)Projection histograms [11], that is, the number of nonzero pixels and positions of the first and the last one within each row and each column of the image ( features).(4)The corner metric matrix of the image, which for each pixel of the image contains the estimated “likelihood” to be its corner point. The corner metric matrix is calculated by MATLAB function* cornermetric* [12] ( features).(5)The local standard deviation matrix, which for each pixel of the image contains the standard deviation of the intensity over 9-by-9 neighborhood of the pixel. The local standard deviation is calculated by MATLAB function* stdfilt* [12] ( features ).This amounts to primary and secondary features in total.

Remember that the proposed learning technique consists of two levels: the inner level is training of elastic net (16) with fixed regularization parameters using Nesterov’s optimization algorithm and the outer level inspired by maximum evidence principle is iterative transformations (23) and (24) of and . Several different partitions (11) of features into groups were tried.

Each line in Tables 1, 2, and 3 represents single experiment for training of elastic net. Each row of the table represents elastic net (16) trained with some and . Each experiment was repeated for times. Estimated intervals of the measured values, shown in tables, are intervals of two standard deviations around the mean.