Abstract

Partial least squares (PLS) regression is an alternative to the ordinary least squares (OLS) regression, used in the presence of multicollinearity. As with any other modelling method, PLS regression requires a reliable model selection tool. Cross validation (CV) is the most commonly used tool with many advantages in both preciseness and accuracy, but it also has some drawbacks; therefore, we will use L-curve criterion as an alternative, given that it takes into consideration the shrinking nature of PLS. A theoretical justification for the use of L-curve criterion is presented as well as an application on both simulated and real data. The application shows how this criterion generally outperforms cross validation and generalized cross validation (GCV) in mean squared prediction error and computational efficiency.

1. Introduction

Partial least squares regression is an alternative to ordinary least squares regression in the presence of multicollinearity; it was first developed by Wold [1], as an algorithm that calculates the principal components of a dataset with missing values. PLS regression is used in many fields such as chemometrics, where collinearity is a common problem; this problem also arises in many other fields such as genomics and computational biology. When using PLS regression, one is confronted with the problem of stopping rule, (the number of components to retain in the final model); cross validation is used in the literature as an optimal way of choosing the final PLS model.

Despite its efficiency, cross validation can fail to detect the optimal model; in this paper, we will propose the L-curve criterion as an alternative model selection tool. The first use of L-curve plots in regularization methods goes back to Lawson and Miller [2, 3], and Hansen was the first who used these plots to calculate regularization parameters [4]. A cubic spline-based algorithm was developed later [5], for discrete L-curves, as well as the adaptive pruning algorithm [6] used in our paper. We will be comparing cross validation with an L-curve algorithm used commonly in regularization methods such as conjugate gradient algorithm and Tikhonov optimization, investing in the equivalence of PLS regression with an implementation of conjugate gradient algorithm called CGLS algorithm (conjugate gradient least squares algorithm). Previous research has tackled similar comparisons such as Gao et al. [7], which presented a comparison of general cross validation and L-curve criterion to assess the performance of a new method for determining the optimal regularization parameter. The comparison will include generalized cross validation as well, even though it has not been commonly used in PLS regression.

The first section will be a short presentation of PLS regression, the second one will be about cross validation and its possible drawbacks, the third will present the generalized cross validation method, and finally, we will present the L-curve algorithm proving that it is suitable for PLS regression and finally an application on both real and simulated data to conclude. Our comparison shows that L-curve algorithm outperforms both cross validation and generalized cross validation in real data and gives promising results in the simulated datasets.

2. PLS Regression and Model Selection

Consider the minimization problem:with is a matrix presenting the values of the predictive variables, is the dependent variable, and is the coefficient of the regression.

Instead of regular predictors, PLS regression uses a set of latent variables called scores (with being the deflated matrix X during kth iteration of PLS regression). The latent variables or PLS components are iteratively calculated, based on the next composition:with being the error, a set of vectors called loadings, and being the weight vector of length .

When retaining all PLS components, the PLS estimator becomes the ordinary least squares estimator, that is s why it is necessary to know the optimal number of components to include in the final model. The main goal behind PLS regression is to be able to predict future observations, that is where cross validation comes in handy, because it tests the predictive ability of the model regardless of its fit.

3. Cross Validation

The main idea behind cross validation is that the power of the model depends on its ability to predict future values. In K-fold cross validation, observations are randomly split into K equal subsets, for K times; we fit the model on and use the estimated parameters to predict the values of , which provides an estimation for the prediction error.

In PLS regression, K is often equal to n (the number of observations), and this cross validation is called leave one out. Here are the steps of cross validation in PLS according to Tenenhaus [8]:

Step 1. In the kth PLS iteration, the estimation error and the predictive error are calculated using the expression:where is the estimation of by the model that includes all observations, and by the model that excludes the observation.

Step 2. Calculate:A PLS component is retained, if .
Cross validation is computationally expensive especially when working with big data, not to mention that it performs poorly when the number of predictors is greater than the number of observations; there is also the risk of overfitting when the number of observations is large, that is what motivated the choice of a new model selection tool that will be an alternative for cross validation, one that takes into consideration the regularizing nature of PLS regression.

4. Generalized Cross Validation

Generalized cross validation (GCV) [9] serves the purpose of choosing the optimal parameter of a regularization method. The method is based on predictive error estimates, mainly used in the numerical applications of regularization methods, such as Tikhonov regularization and conjugate gradient regularization. The method is described as successful by Hansen [10], as it is based on statistical considerations. The chosen parameter in GCV minimizes the function [9]:

This formula was first conceived for Ridge regression, where

The difference between the existing versions of the GCV is due to the estimation of the trace of the matrix in the denominator. To our knowledge, GCV has not yet been used with PLS regression, but it has been used in regularized conjugate gradient method [11].

The above formula of the GCV in the case of PLS regression becomes [11]with the projection of the matrix on the Krylov subspace of dimension k:where being the matrix formed by the elements of a basis of the Krylov subspace of dimension k.

Various estimations are proposed for the trace of the matrix ; we will use one proposed by [11], and similar to that used in [6]. This results in the GCV function:

Using the expression above, our stopping rule will be the parameter k that minimizes , throughout all the iteration of the PLS algorithm. The work by Hansen et al. [6] has demonstrated the robustness of the L-curve criterion when compared with the GCV in the case of truncated singular value decomposition and regularizing conjugate gradient iterations.

It is also worth noting that GCV is not as popular as the old fashion cross validation method proposed by Tenenhaus [8], at least in the field of chemometrics and its applications. The main drawback of the GCV method is that it is computationally expensive, as it requires fitting all possible models, which becomes quite hassle when dealing with massive datasets that sometimes contain thousands of predictive variables.

5. The L-Curve Algorithm

To understand the problem of parameter choice, we will consider solving , where A is a matrix and a vector. When A is ill conditioned, the solution becomes unstable due to the difficulty of finding an inverse to A, that is where regularization interferes, offering an alternative set of solutions through an iterative process, with k being the regularization parameter that differs according to the chosen method.

The L-curve criterion is a graphical parameter choice method used in regularization algorithms such as Tikonov and truncated singular value decomposition (principal component regression); these regularization methods have the main goal of stabilizing the solution by including additional information.

There are two types of L-curve algorithms, the first one is for continuous regularization methods, and the second one is for the discrete ones; the L-curve is discrete when the parameter k varies in a discrete interval. In PLS regression, k represents the dimension of the space chosen for the optimization; it also represents the number of latent factors retained in the model. The goal behind this criterion is to find the right balance between the two quantities plotted in scale, the norm of the residual error and the norm of the solution; this right balance is the corner of the L-curve, which is the point with maximum curvature; Figure 1 presents an example of the L-curve on a test problem.

The corner is calculated analytically by fitting a function to a discrete set of points, Hansen and O’Leary [5] used a cubic spline to fit the plot or geometrically by using the method of triangles [12]. We have chosen an algorithm called the adaptive pruning algorithm (Algorithm 1) [6], which is an algorithm developed to calculate the corner of a discrete L-curve.

(1)Initialize
(2)Stage one: while
(3)
(4) Create a pruned L-curve consisting of the largest line segments.
(5) For each corner location routine
(6)  Locate the corner using the pruned L-curve.
(7)  Add the corner to the list:
(8)
(9)Stage two: if then ; return.
(10)Otherwise for
(11) Compute the slope associated with point in L.
(12)If then ; return.
(13)Otherwise let .

6. The Adaptive Pruning Algorithm [6]

Consider a discrete L-curve, with a set of points ; specifically let there be three points , , and with . We define the oriented angle . The principle is to calculate all possible angles and associate the corner to the angle closest to .

To avoid finding a local corner, this algorithm defines a sequence of pruned L-curves; from each, we choose two candidates for the corner. This produces a smaller list of candidate points to be corners. The next step is to choose the best point that meets the angle criteria, which we consider our final corner for the L-curve; for more details, we recommend Hansen et al. [6].

This algorithm is implemented in the “regularization package” in MATLAB developed by Hansen [13]. The advantage of the adaptive pruning algorithm is that it takes into consideration the possibility of finding a local corner, and such an occurrence is common when data is of a high dimension.

7. The L-Curve Criterion and PLS Regression

The L-curve criterion uses the logic behind every regularization method, which is to find a compromise between the increasing size of the solution in function of k, and the decreasing residual error in each parameter k: , by plotting both quantities and finding the point of curvature.

The necessary assumptions for using the L-curve criterion is that the norm of residual error and the norm of the regularized solution are monotonic functions of the parameter k [10], and we will show next that PLS regression meets these assumptions.

7.1. The Norm of the Solution

In order to prove that the norm of the solution is a monotonic function of the parameter k, we will use the equivalence between PLS regression and the conjugate gradient algorithm applied to the normal equation.

Proposition 1. Conjugate gradient algorithm performed on the normal equation: with a starting vector is equivalent to PLS regression performed on and .
Proposition 1 means that PLS regression is a particular case of conjugate gradient applied on the normal equation, and we have written the proof in our previous paper [14]. We will use this link to justify the choice of L-curve and to prove that PLS meets its assumptions.
The next proposition is a direct result of the previous one, itis an application of theorem Heitens and Steifel, which will be helpful to prove that the norm of PLS is a monotonic function with respect to the parameter k.

Proposition 2. The PLS estimators are distinct, and they all minimize the expression:with .

The residuals are orthogonal according to Hestenes and Stiefel [15]; therefore, they form a basis for the Krylov subspace:which means that the norm of the PLS estimator in the iteration is

Since is a decreasing function of k, we can easily demonstrate that the norm of the PLS estimator is an increasing function of parameter k; therefore, it is monotonic.

7.2. The Norm of the Residual Error

We have

According to Hansen [10], the expression above for the residual error of conjugate gradient algorithm applied on the normal equation is a decreasing function of k, when the initial vector: , which is equivalent to saying that the residual error of PLS regression is also a decreasing function of parameter k.

8. Application

The goal of this simulation is to compare the performance of three model selection tools in PLS regression, cross validation, generalized cross validation, and L-curve criterion used in regularization methods. The comparison will include both simulated and real datasets; the algorithm used in L-curve is the adaptive pruning algorithm, used in discrete regularization [6].

8.1. Real Data

We will use a dataset called Cornell (Table 1), presented in [8, 16]. These data measure the effect of 7 chemical components on octane motor rating, on a sample of 12 different mixtures.

Since we cannot calculate mean squared error of the regression coefficient, we will compare the values of R squared and mean squared error of the response variable in cross validation GCV and L-curve.

The comparison between CV, GCV, and L-curve on the real dataset Cornell (Table 2) shows that the last one gives a better model, with an improvement over the values of the mean squared error of y, and R squared, which means a better predictive power of the PLS model.

We note that L-curve selected the model that contains 5 components (Figure 2), GCV selected the model with four components, and cross validation selected the one with only the first three components.

The comparison of the elapsed time for each method shows the next result.

According to Table 3, GCV takes slightly more time to select the optimal model than the other two methods. The difference will increase as the dimension of the space variables increases, and this will be more palpable in the simulation.

8.2. Simulated Data

We compare cross validation, generalized cross validation, and L-curve algorithm on simulated datasets, and each has observations. We chose the number of variables for each dataset randomly, by selecting m in the interval [10, 90]. The goal is to explore the robustness of the methods in the space of variables.

The datasets are generated using a multivariate normal distribution, with zero mean and a covariance matrix that takes two values corresponding to medium and high collinearity.

The regression vector is generated from a uniform distribution in the interval [0, 3], and we consider three different signal-to-noise ratios for the standard deviation of the error.

We have 6 different datasets, generated 500 times each, and we estimate the PLS model for each one, using CV, GCV, and L-curve algorithm, to choose the number of components to keep in the final model. We compare the mean of 500 values of mean squared error (MSE), for both the regression coefficient and the estimated value of the dependent variable for each model selection method, and the results are summed up in Tables 46.

The MSE of the regression coefficient is calculated using the expression (Euclidian norm): being the real regression coefficient.

The first problem encountered in the result of cross validation is that in case of a signal-to-noise ratio equal to 1, the quantity of is negative in all cases throughout the 500 datasets. This means that it fails to select a model (in this case, we chose the first component for comparison). GCV seems to give the smallest MSE for the response variable, but it gives the highest one for the regression coefficient; cross validation yields a low regression coefficient MSE, but it inflates that of the response variable remarkably. In the meantime, L-curve gives reasonable results, giving a better MSE of the regression coefficient than GCV and giving almost a similar result for the MSE of the response variable.

For the comparison of the speed of the three methods in simulated data, we chose ten datasets having 100 observations each, with variables varying from 90 to 99. Table 6 presents the mean elapsed time for the ten datasets. GCV takes roughly 75.04 seconds to select the optimal model, making it the slowest one, followed by CV with 11.75 seconds, with L-curve consistently giving the lowest execution time with the value of 0.047.

9. Conclusion

In this paper we presented a model selection tool that is commonly used in regularization methods, and we used it in PLS regression, investing in the link that exists between partial least squares regression and the conjugate gradient algorithm. To assess the contribution of this model selection tool in PLS, we compared it with two versions of cross validation, one is basically the most popular method for model choice in PLS, due to its simplicity, the other is popular in the field of numerical analysis, and less common in the field of statistical modelling. L-curve seemed suitable because it takes into account the balance between the size of the shrunk solution and the residual error. Our simulation study showed that L-curve outperforms the two methods in real data, and is much less computationally expensive in the case of artificial data, with results that are almost similar to the GCV method. Weighing in the main disadvantage of GCV, which is a high computational cost, L-curve seems to be the better choice when working with massive datasets, which is common in many fields of research that require the use of PLS regression.

Data Availability

Previously reported data were used as real data to support this study and are available at [8, 16], which was cited in the text.

Disclosure

An earlier version of this work has been presented as an oral communication in the International SM2A Conference in Meknes, Morocco, 2017.

Conflicts of Interest

The authors declare that they have no conflicts of interest.