Abstract

Partial least squares regression (PLS regression) is used as an alternative for ordinary least squares regression in the presence of multicollinearity. This occurrence is common in chemical engineering problems. In addition to the linear form of PLS, there are other versions that are based on a nonlinear approach, such as the quadratic PLS (QPLS2). The difference between QPLS2 and the regular PLS algorithm is the use of quadratic regression instead of OLS regression in the calculations of latent variables. In this paper we propose a robust version of QPLS2 to overcome sensitivity to outliers using the Blocked Adaptive Computationally Efficient Outlier Nominators (BACON) algorithm. Our hybrid method is tested on both real and simulated data.

1. Introduction

After it was developed by Wold [1], PLS regression became a classic way to overcome correlation in regression analysis; this method is popular in many fields such as genomics and chemometrics. Many statisticians showed interest in the mathematical properties of the method; De Jong [2] proved that the PLS estimator is a regularized version of the ordinary least squares estimator. The same result was later demonstrated algebraically by Goutis et al. [3]. With the arising of data that show nonlinear behavior in many fields, it was necessary to have a new version of PLS regression that captures the nonlinearity and provides more parsimonious models. Wold [4] developed the first nonlinear version of the PLS algorithm by substituting OLS with a quadratic regression to calculate the PLS components. Wold [5] also proposed the spline PLS algorithm. Another nonlinear algorithm based on neural networks to deal with the nonlinearity of meteorological data was proposed [6].

PLS regression is sensitive to outliers and leverages. Thus several robust versions have been proposed in the literature, but only for linear PLS. Hubert [7] proposed two robust versions of the SIMPLS algorithm by using a robust estimation for the variance-covariance matrix. Kondylis and Hadi [8] used the BACON algorithm to eliminate outliers, resulting in a robust linear PLS.

In this work we attempt to obtain a robust version of the quadratic PLS algorithm QPLS2, by using the BACON algorithm. An application on real and simulated data is used to validate the method.

2. Nonlinear PLS Regression

Every linear regression method is based on the following optimization problem:where is a matrix presenting the values of the independent variables, is the dependent variable, and is the coefficient of the regression.

Instead of regular predictors, PLS regression uses a set of latent variables called scores: (with the deflated version of the initial matrix ). The latent variables (also called the PLS components) are iteratively calculated, based on the decomposition:where is the error, is a set of vectors called the loadings, and a weight vector of length . As mentioned in the introduction, owing to the encounter of data that showed nonlinear behavior, many researchers proposed new PLS algorithms to capture the nonlinearity of these datasets. In this work we use the quadratic nonlinear PLS as proposed by Wold [4].

The quadratic nonlinear PLS is a PLS algorithm that supposes the existence of nonlinear relations between the two blocks of variables. Instead of the OLS regression presented in the linear PLS algorithmWold et al. [4] used a quadratic regression:Every regression method performs poorly in the presence of outliers. As a result of the instability of the estimations, many approaches have been developed to overcome this problem, such as filtering the outliers from the dataset, or giving them lower weights to minimize their effect on the estimation process. The next section will focus on the BACON algorithm, as an approach that deletes the outliers to obtain a clean dataset.

3. Robust PLS Regression

3.1. Outliers Detection and Robust Regression

Robust regression is a way of dealing with outliers, which are observations that come from a different distribution. They can also be the result of error measurements, and can harm the quality of the estimation. Just like OLS regression, PLS regression is also sensitive to outliers [8]. Hence their detection is a necessary procedure, in order to have stable estimations, and accurate predictions.

Many researchers proposed methods of dealing with the outlier problem in PLS regression. Hubert [7] used two robust estimations of the variance-covariance matrix in the SIMPLS algorithm, and Kondylis and Hadi [8] used the BACON algorithm for outlier detection. Both approaches proved to be a significant improvement over the regular PLS.

The BACON algorithm [9] starts with a subset of observations of size that is supposedly free of outliers, and then it iteratively adds the observations that are consistent with the initial set. The observations left out are the outliers.

The first set is chosen. Then the distance is defined and used as a criterion for including the observation in the initial subset. Here are two distances used in the literatureand is the variance-covariance matrix of the entire data set, represents the observation, the first distance is called the Mahanalobis distance, and the second is simply the distance of the observation from the median . Here are the detailed steps of the algorithm:(1)Select an initial set (2)Compute the distances ( is the mean of , and is the matrix of covariance of ):(3)Set the new subset with all the points that havewhere is the Chi-square percentile and(4)Repeat (2) and (3) until the subset does not change.(5) is the dataset free from outliers.

3.2. Robust Nonlinear PLS

We merge the BACON algorithm with the quadratic PLS, with the goal of obtaining a robust version of the algorithm:(1)Run the BACON algorithm on the dataset using distance (6), and keep the outcome . Then delete the observations in the dependent variable related to the outliers to obtain (free from outliers).(2)For every PLS dimension, repeat until convergence of ( is a the first column of )(i)Calculate the weights:(ii)Calculate the scores:(iii)Fit to using the quadratic function and calculate the prediction of using the nonlinear estimates:(iv)Calculate(v)Update (vi)Update as described in (i).(vii)Calculate the new value of t:(3)Calculate the loadings using the final value of t:(4)Deflate and :(5)If an additional dimension is required, replace and with E and F and repeat the steps from (2) to (4).

4. Application

The goal of this application is to compare the performance of the robust quadratic PLS with the original quadratic PLS. The comparison is conducted on both simulated and real data.

4.1. Real Data

We use the dataset presented in [4], which contains 8 different formulations of cosmetic products, as predictive variables, and 11 dependent variables presenting quality indicators collected in an experiment on 17 individuals.

Since we cannot calculate the mean squared error, we will compare the percentage of explained variance in both the robust and original quadratic PLS:and is the latent component of the PLS iteration, is the number of dependent variables, and p is the number of predictive variables.

In Table 1, a comparison of the original and robust quadratic PLS shows that the latter improves the explained variance in the dependent variables from 68% to 91%, which is a considerable amount. This is an indicator that the dataset contained outliers that affected the estimation in the case of the original quadratic PLS.

4.2. Simulated Data

In this section, a contamination study is used to assess the quality of the proposed robust method, by following these steps:(1)The nonlinear function presented in [10] which is used to generate a dataset with 500 observations and 6 variables (where is generated by a uniform distribution):(2)The dataset is randomly contaminated by adding a small percentage of data (5%, 10%, and 15%) from a multivariate normal distribution.(3)We first apply the quadratic PLS to the generated data, and then we apply the robust quadratic PLS described previously.(4)We compare the original quadratic PLS with the proposed robust PLS using the explained variance, as well as the predictive mean squared error and the predicted residual error sum of squares (PRESS).

The dataset is simulated 1000 times. The explained variance, predictive mean squared error, and PRESS are the mean of all values calculated for each dataset.

In case of a 5% contamination rate (Table 2), the original quadratic PLS yields a total explained variance of 73%, but when applying the robust quadratic PLS, this explained variance becomes 99% which is a considerable improvement. The same can be said about the 10% and 15% contamination rates, where we see an improvement in the explained variance of the dependent variable.

The dataset of 500 observations was then split in two parts. The first contained 400 observations used in the estimation of two models: one with the original quadratic PLS and one with the robust quadratic PLS. Then we calculate the predictive residual mean squared error (RMSEP) of the dependent variable on the 100 left out observations.

The results of a comparison (Table 3) of the three contamination rates show that the robust quadratic PLS yields a smaller mean squared prediction error in every case. The same table presents the values of the PRESS for each rate, calculated by leaving 10% of the observations. The same can be said about the predictive error sum of squares as it is improved in the case of the robust quadratic PLS.

Figures 1, 2, and 3 show a comparison of the predicted values and the actual values of the simulated dataset, for both quadratic and robust quadratic PLS regression. For all contamination rates the prediction is improved significantly in the case of the proposed robust quadratic PLS, as it gives better predictions than the original one.

5. Conclusion

PLS regression has developed considerably since it was first introduced. The nonlinear nature of data encountered in the field of chemical engineering was the motivation behind developing nonlinear PLS methods. In this paper we proposed a robust version of the quadratic nonlinear PLS, in a hybrid form between the quadratic PLS algorithm and the BACON algorithm in order to overcome problems caused by outliers. Our method outperformed the quadratic PLS for both real and simulated data.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.