Evidence Maximization Technique for Training of Elastic Nets
This paper presents a technique of evidence maximization for automatic tuning of regularization parameters of elastic nets, which allows tuning many parameters simultaneously. This technique was applied to handwritten digit recognition. Experiments showed its ability to train either models with high accuracy of recognition or highly sparse models with reasonable accuracy.
One of the important aspects of machine learning is to choose an appropriate subset of the (possibly huge) set of all virtually available features, such as the trained model which depends only on this subset of features. A good choice (feature selection, ) can both speed up the training and improve the quality of its result. It depends not only on the particular problem, but also on the data available for training.
Feature selection can either precede the learning itself (e.g., entropy-based or correlation analysis) or be a built-in part of the learning process (e.g., learning with -regularization, such as LASSO regression and -SVM) . This paper deals with the latter case only.
It is known that learning with -regularization can produce rather sparse models which depend on rather few features, but learning with -regularization usually produces more accurate models. In  some mixed regularization called “elastic net” was proposed. Let be a model parameterized by , predicting response by feature vector , and let be the cost of prediction provided the true response is . Then training of such a model with elastic net regularization on the set of samples using loss minimization (a.k.a. ERM—empirical risk minimization) method or, briefly, “training of an elastic net” is the minimization problem:where and stand for - and -norms, respectively, and and are nonnegative regularization parameters. It is shown experimentally in  that varying the parameters and one can balance between the sparsity of the model and the accuracy of its prediction.
In this paper elastic nets are used to regularize multiclass logistic regression. A method of tuning more general regularization parameters than and above is described. This method is tested on a handwritten digit recognition problem.
The rest of this paper is organized as follows. Section 2 presents the mathematical model and the elastic net in details. Section 3 describes the learning algorithm and the evidence maximization technique for tuning regularization parameters of the elastic net; this technique is the main subject of this paper. Section 4 describes experiments with elastic nets for digit recognition. Section 5 exposes the results of experiments. Section 6 summarizes the main results of experiments and discusses further possible applications of the proposed technique.
2. Mathematical Model
Consider multinomial classification in its both deterministic and probabilistic variants: given a feature vector either to predict the correct label of one of classes to which the vector belongs or to estimate the conditional probability of each class label. Probabilistic classification is considered primary and in deterministic classification a class label (usually the class label) will be predicted.
Let stand for augmented feature vector. To estimate multinomial linear logistic regression modelwill be trained. The model parameter matrix consists of -dimensional rows . To train the model means to choose some “good” parameter .
To do this we use a training dataset of couples which are supposed to be i.i.d. random. can also be written in a transposed way where and . Training tries to maximize the posterior of given some prior and the training set . Since and the denominator does not depend on , maximization of posterior probability is equivalent to maximization of the numerator or of its logarithm: The second summand in (4) is the log likelihood of the model , while the first one depends on the choice of the prior.
Let -matrix stand for without the bias column . The prior is usually taken independent of the bias, so . In the simplest cases when spherical Gaussian or Laplacian distributions are taken as priors, training (4) turns to an optimization problem with - or -regularization, respectively.
Similarly, elastic nets are obtained from the priorwhere(remember that the space of is -dimensional) and denotes the cumulative function of the standard one-dimensional Gaussian distribution:To simplify calculations instead of the function we useFor instance, the normalization factor becomes
Both prior (5) and regularization summands in (10) are isotropic with respect to all features. However the features themselves might be unequal by their nature. To respect such an inequality we partition all features into groups of features of the same nature. For example, all pixel values of the image have the same nature and will belong to the same group of features, while computed features or the aspect ratio falls to other groups.
Let us fix a partition of the set of indices into subsets of cardinalities and define separate regularization parameters and for each group. Then training of generic elastic net (10) turns into and training of the elastic net for linear logistic regression (2) turns into
It is easy to see that optimization problem (13) is convex for any training set and nonnegative and . Choice of values of regularization parameters and , which is the subject of this paper, will be discussed later in Section 3.2.
3. Learning Technique
3.1. Nonsmooth Convex Optimization
Standard gradient methods are not applicable to minimization problems (10) and (13) because they contain nonsmooth terms and . So the algorithm proposed by Nesterov in  for minimization of sums of smooth and simple nonsmooth convex functions is used. Nesterov’s algorithm provides the best convergence rate at moderate number of steps (less than the number of variables, which is equal to in (10) and (13)) among all known methods of nonsmooth optimization .
Nesterov’s algorithm can exploit strong convexity (-convexity) of the target function and converges the faster, the bigger can be guaranteed in advance. The target function in (13) is not strongly convex in the bias column , but it would be strongly convex if -regularization was applied to all parameters including .
Consider the following modification of problem (13).(1)Estimate the bias column : where is the number of training samples of class . The estimate is the solution of minimization problem: which is nothing but maximum likelihood training of the featureless logistic regression model.(2)Choose some and instead of (13) solve
The target function in (16) is strongly convex with nonnegative parameter .
3.2. Evidence Maximization
To train elastic nets (10), (13), or (16) successfully some reasonable values of regularization parameters and (hyperparameters) are required. In machine learning problems with one or at most two hyperparameters (e.g., in SVM ) their values can be found by grid search. However, there are hyperparameters in generalized elastic net (16) and we are interested in the case . In this case, a reasonable way to optimize them is evidence maximization. The use of evidence maximization for estimation of hyperparameters of ridge regression and other Gaussian-based models is well known . For non-Gaussian elastic nets the evidence of hyperparameters can be neither computed nor maximized exactly and will be approximated rather roughly.
Let prior depend on two hyperparameters and like in (5). Then posterior (3) with and indicated explicitly is The denominator is ignored in maximization of posterior (3) because it does not depend on . However it depends on and . This denominator is called the evidence of parameters and with respect to the training set . Despite its special name, it is a usual likelihood, not the likelihood of a single model like in (4), but the likelihood of the whole probability space of models defined by hyperparameters and .
For prior (5) the evidence of pair isand the evidence maximization is equivalent to minimization
The normalization factor is rewritten using formula (9) here.
The gradient of (19) is where stands for the expectation of with respect to posterior distribution of proportional to :
To minimize (19) instead of traditional gradient steps the transformation is used iteratively.
Formulas (20) imply that each point of maximum of the evidence is a fixed point of transformation (22). No convergence of transformation (22) is guaranteed. But in the experiments several iterations of this transformation allowed training more accurate model.
Expectations , , and cannot be computed exactly because posterior is rather complicated and high-dimensional. They are estimated using diagonal Laplace approximation  of posterior at trained model (16) instead of the itself.
3.3. Stopping Criterion
To stop either training (16) with fixed regularization parameters or iterations of transformations (23) and (24) of , the following validation technique is used. The available dataset is partitioned into training set of samples and validation set of samples. The first one is used to train elastic nets (16) while the second one is used to decide whether further training becomes senseless and should be stopped. Namely, training of the elastic net is stopped if likelihood has not increased after several (about 30) last optimization steps, and tuning of the regularization parameters is stopped if likelihood of the trained model has not increased after several (about 5) last iterations.
This criterion is a kind of well-known early stopping method . On one hand, such an early stopping speeds up the training significantly. On the other hand, it is a regularization technique  by itself and can hide the effect of tuning the regularization parameters via evidence maximization, which is the subject of the study here. To find a balance, the delays between nonincreasing of the validation likelihood and stopping were chosen empirically.
The method described in Sections 2 and 3 was applied to recognition of handwritten digits from MNIST database (see ). This database contains grayscale raster images of pixels each, which belong to one of classes. Traditionally it is partitioned into samples for training and for testing. of training samples were left out for validation, so and .
Both to make linear logistic regression more powerful and to test the proposed method of estimation of numerous regularization parameters more features were added to the model. Besides the primary features (the pixel intensities) several groups of secondary features were generated. Then all the features, both secondary and primary, were normalized to zero mean and unit variance.
The following groups of secondary features were used in experiments.(1)Horizontal and vertical components of the gradient of the pixel intensity ( features).(2)Amplitudes and phases of the discrete Fourier transform  of the pixel intensity ( features).(3)Projection histograms , that is, the number of nonzero pixels and positions of the first and the last one within each row and each column of the image ( features).(4)The corner metric matrix of the image, which for each pixel of the image contains the estimated “likelihood” to be its corner point. The corner metric matrix is calculated by MATLAB function cornermetric  ( features).(5)The local standard deviation matrix, which for each pixel of the image contains the standard deviation of the intensity over 9-by-9 neighborhood of the pixel. The local standard deviation is calculated by MATLAB function stdfilt  ( features ).This amounts to primary and secondary features in total.
Remember that the proposed learning technique consists of two levels: the inner level is training of elastic net (16) with fixed regularization parameters using Nesterov’s optimization algorithm and the outer level inspired by maximum evidence principle is iterative transformations (23) and (24) of and . Several different partitions (11) of features into groups were tried.
Each line in Tables 1, 2, and 3 represents single experiment for training of elastic net. Each row of the table represents elastic net (16) trained with some and . Each experiment was repeated for times. Estimated intervals of the measured values, shown in tables, are intervals of two standard deviations around the mean.
Sparseness. It is the share of features unused in the model that is .
Mean Log Likelihood. It is the mean over the -element test set of minus logarithm of the predicted probability of the true class label of the sample, .
Error. It is the misclassification rate measured on the same -element test set, provided the most probable class is predicted, .
Sparseness of the trained model appears due to -regularization in the elastic net and increases with .
4.1. Constant Regularization Parameters
First, several control experiments with fixed scalar values of regularization parameters λ and μ were performed. Their results are shown in Table 1.
The minimal average test error 1,81% was achieved with parameters and .
4.2. Tuning Regularization Parameters by Evidence Maximization
Next, experiments with automatic tuning of regularization parameters and were performed. Since all features had been normalized, the learning was started from and for all . The results are shown in Table 2. Each row represents the elastic net obtained by the described two-level learning process for certain partition (11) of features.
Several different partition schemes were tested. , trivial partition: all features belong to the same group. , rough partition: primary features, horizontal and vertical components of the gradients, amplitudes and phases of the Fourier transform, and three other types of secondary features each form a separate group. : the whole image ( pixels) is split into equal squares and, roughly speaking, the groups are formed by features of certain type calculated for certain squares. The exceptions are projection histograms calculated not for squares, but for rows or columns of squares ( groups for each of histograms) and amplitudes and phases of the Fourier transform, both partitioned into equal squares in the frequency space. So the total number of groups of the partition is equal to . For this gives . , fine partition: each feature forms a separate group.
These experiments show that the evidence maximization technique allows one to obtain more accurate elastic nets than elastic nets with guessed scalar regularization parameters. Indeed, compare the last column of Table 1 with lines , , and of Table 2. These lines represent elastic nets trained with certain values of -, -, and -dimensional regularization parameters, which can hardly be guessed.
4.3. Sparse Elastic Net
Last, we performed a series of experiments trying to train very sparse but reasonably accurate models. Sparseness of the model trained with elastic net depends mostly on its parameter(s) or . In the described technique these parameters are tuned in order to get elastic nets with higher evidence. However, experiments show that iterations of the transformations (23) and (24) with the stopping criterion of Section 3.3 tend to stop before they reach any (local!) maximum of the evidence, and where they stop depends on the initial parameters and .
Experiments of Section 4.2 (Table 2) started from and for all . Then sparseness was low but the trained models made more accurate predictions. If , optimization problem (16) has unique solution , the most sparse one, but not accurate. Starting iterations from allows one to get sparse elastic net with reasonable accuracy.
Table 3 shows the results of training elastic net with starting parameters , . These results are discussed in the following section.
5. Results and Discussion
5.1. Accuracy of the Trained Model
The best model trained with the evidence maximization technique shown in Table 2 has 1,69% average test error, which is significantly less than 1,81% obtained by guessing of scalar regularization parameters (Table 1). In our experiments each learning with evidence maximization took only reestimations of the regularization parameters. So the numbers of elastic nets trained to fill in Tables 1 and 2 are comparable (moreover, not all guesses are shown in Table 1).
The evidence maximization technique allows one to guess only an appropriate partitioning of the features instead of particularly good values of the regularization parameters. Still, this technique is not fully automated. None of the two obvious extreme partitions (the roughest and the finest ones) leads to the best model. 1,83% in the first line of Table 2 compared to 1,81% achieved in Table 1 shows that the evidence maximization not necessarily leads to the best accuracy. But it can be used when regularization parameters are multidimensional and naive attempts to guess a good value of them are unfeasible.
The obtained accuracy is much lower than best state-of-the-art results obtained by convolutional neural networks, deep learning, and augmentation of training dataset. But the elastic net with precisely tuned regularization parameters can achieve higher accuracy than other traditional models of the same complexity (e.g., 1- or 2-layer neural networks or SVM with Gaussian kernel) (see ).
5.2. Sparseness of the Trained Model
In some practical classification problems high sparseness of the model takes priority over its high accuracy. The proposed method allows one to train models with various tradeoff between sparsity and accuracy.
The last elastic net shown in Table 3 provides test error 2,28% and sparseness 88,62%, so only of features are used. Compared to the most accurate elastic net from Table 2, the error increased by 0,59%, while the number of used features decreased more than sevenfold, from to . This result was achieved by tuning individual regularization parameters for each feature starting from the biggest reasonable .
This paper describes a method of machine learning based on a technique of adjusting of regularization parameters of elastic nets inspired by evidence maximization principle. The method is able to cope with multidimensional regularization parameters using only rough simple ideas about their initial values and about the nature of the features used in the models to be learned.
This method was tested on MNIST database of handwritten digits and allowed training more accurate elastic net than could be trained with traditional grid search of one or two scalar regularization parameters. It allowed also training very sparse models with reasonable accuracy.
Still the primary goal of the proposed method of learning lies beyond the scope of this paper. It is to develop a mechanism of feature selection based on training of elastic nets with controlled tradeoff between their sparseness and accuracy. In future the proposed method is going to be applied to other machine learning problems, including problems with very large number of features.
The authors declare that they have no competing interests.
This work was partially supported by the Russian Foundation for Basic Research Grants no. 15-29-06081 “ofi-m” and no. 16-07-00616 “A.”
I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” The Journal of Machine Learning Research, vol. 3, pp. 1157–1182, 2003.View at: Google Scholar
Y. Nesterov, Gradient Methods for Minimizing Composite Objective Function, 2007.
P. Richtarik and M. Schmidt, “Modern convex optimization methods for large-scale empirical risk minimization,” in Proceedings of the International Conference on Machine Learning, July 2015.View at: Google Scholar
A. I. Prilepko and D. Ph. Kalinichenko, Asymptotic Methods and Special Functions, MIPI, 1980.
D. F. Morgado, A. Antunes, and A. M. Mota, “Regularization versus early stopping: a case study with a real system,” in Proceedings of the 2nd IFAC Conference Control Systems Design, Bratislava, Slovakia, 2003.View at: Google Scholar
The MathWorks Inc, MATLAB Image Processing Toolbox documentation, http://www.mathworks.com/help/images/.