Journal of Applied Mathematics

Volume 2018, Article ID 7696302, 5 pages

https://doi.org/10.1155/2018/7696302

## Robust Nonlinear Partial Least Squares Regression Using the BACON Algorithm

^{1}Department of Mathematics, University Mohamed the First, Oujda 60000, Morocco^{2}Department of Management, Faculty of Social Sciences, Oujda 60000, Morocco

Correspondence should be addressed to Abdelmounaim Kerkri; moc.liamg@mianuomledbakrk

Received 11 June 2018; Revised 18 August 2018; Accepted 18 September 2018; Published 2 October 2018

Academic Editor: Lucas Jodar

Copyright © 2018 Abdelmounaim Kerkri et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Partial least squares regression (PLS regression) is used as an alternative for ordinary least squares regression in the presence of multicollinearity. This occurrence is common in chemical engineering problems. In addition to the linear form of PLS, there are other versions that are based on a nonlinear approach, such as the quadratic PLS (QPLS2). The difference between QPLS2 and the regular PLS algorithm is the use of quadratic regression instead of OLS regression in the calculations of latent variables. In this paper we propose a robust version of QPLS2 to overcome sensitivity to outliers using the Blocked Adaptive Computationally Efficient Outlier Nominators (BACON) algorithm. Our hybrid method is tested on both real and simulated data.

#### 1. Introduction

After it was developed by Wold [1], PLS regression became a classic way to overcome correlation in regression analysis; this method is popular in many fields such as genomics and chemometrics. Many statisticians showed interest in the mathematical properties of the method; De Jong [2] proved that the PLS estimator is a regularized version of the ordinary least squares estimator. The same result was later demonstrated algebraically by Goutis et al. [3]. With the arising of data that show nonlinear behavior in many fields, it was necessary to have a new version of PLS regression that captures the nonlinearity and provides more parsimonious models. Wold [4] developed the first nonlinear version of the PLS algorithm by substituting OLS with a quadratic regression to calculate the PLS components. Wold [5] also proposed the spline PLS algorithm. Another nonlinear algorithm based on neural networks to deal with the nonlinearity of meteorological data was proposed [6].

PLS regression is sensitive to outliers and leverages. Thus several robust versions have been proposed in the literature, but only for linear PLS. Hubert [7] proposed two robust versions of the SIMPLS algorithm by using a robust estimation for the variance-covariance matrix. Kondylis and Hadi [8] used the BACON algorithm to eliminate outliers, resulting in a robust linear PLS.

In this work we attempt to obtain a robust version of the quadratic PLS algorithm QPLS2, by using the BACON algorithm. An application on real and simulated data is used to validate the method.

#### 2. Nonlinear PLS Regression

Every linear regression method is based on the following optimization problem:where is a matrix presenting the values of the independent variables, is the dependent variable, and is the coefficient of the regression.

Instead of regular predictors, PLS regression uses a set of latent variables called scores: (with the deflated version of the initial matrix ). The latent variables (also called the PLS components) are iteratively calculated, based on the decomposition:where is the error, is a set of vectors called the loadings, and a weight vector of length . As mentioned in the introduction, owing to the encounter of data that showed nonlinear behavior, many researchers proposed new PLS algorithms to capture the nonlinearity of these datasets. In this work we use the quadratic nonlinear PLS as proposed by Wold [4].

The quadratic nonlinear PLS is a PLS algorithm that supposes the existence of nonlinear relations between the two blocks of variables. Instead of the OLS regression presented in the linear PLS algorithmWold et al. [4] used a quadratic regression:Every regression method performs poorly in the presence of outliers. As a result of the instability of the estimations, many approaches have been developed to overcome this problem, such as filtering the outliers from the dataset, or giving them lower weights to minimize their effect on the estimation process. The next section will focus on the BACON algorithm, as an approach that deletes the outliers to obtain a clean dataset.

#### 3. Robust PLS Regression

##### 3.1. Outliers Detection and Robust Regression

Robust regression is a way of dealing with outliers, which are observations that come from a different distribution. They can also be the result of error measurements, and can harm the quality of the estimation. Just like OLS regression, PLS regression is also sensitive to outliers [8]. Hence their detection is a necessary procedure, in order to have stable estimations, and accurate predictions.

Many researchers proposed methods of dealing with the outlier problem in PLS regression. Hubert [7] used two robust estimations of the variance-covariance matrix in the SIMPLS algorithm, and Kondylis and Hadi [8] used the BACON algorithm for outlier detection. Both approaches proved to be a significant improvement over the regular PLS.

The BACON algorithm [9] starts with a subset of observations of size that is supposedly free of outliers, and then it iteratively adds the observations that are consistent with the initial set. The observations left out are the outliers.

The first set is chosen. Then the distance is defined and used as a criterion for including the observation in the initial subset. Here are two distances used in the literatureand is the variance-covariance matrix of the entire data set, represents the observation, the first distance is called the Mahanalobis distance, and the second is simply the distance of the observation from the median . Here are the detailed steps of the algorithm:(1)Select an initial set (2)Compute the distances ( is the mean of , and is the matrix of covariance of ):(3)Set the new subset with all the points that have where is the Chi-square percentile and(4)Repeat (2) and (3) until the subset does not change.(5) is the dataset free from outliers.

##### 3.2. Robust Nonlinear PLS

We merge the BACON algorithm with the quadratic PLS, with the goal of obtaining a robust version of the algorithm:(1)Run the BACON algorithm on the dataset using distance (6), and keep the outcome . Then delete the observations in the dependent variable related to the outliers to obtain (free from outliers).(2)For every PLS dimension, repeat until convergence of ( is a the first column of )(i)Calculate the weights:(ii)Calculate the scores:(iii)Fit to using the quadratic function and calculate the prediction of using the nonlinear estimates:(iv)Calculate(v)Update (vi)Update as described in (i).(vii)Calculate the new value of t:(3)Calculate the loadings using the final value of t:(4)Deflate and :(5)If an additional dimension is required, replace and with E and F and repeat the steps from (2) to (4).

#### 4. Application

The goal of this application is to compare the performance of the robust quadratic PLS with the original quadratic PLS. The comparison is conducted on both simulated and real data.

##### 4.1. Real Data

We use the dataset presented in [4], which contains 8 different formulations of cosmetic products, as predictive variables, and 11 dependent variables presenting quality indicators collected in an experiment on 17 individuals.

Since we cannot calculate the mean squared error, we will compare the percentage of explained variance in both the robust and original quadratic PLS:and is the latent component of the PLS iteration, is the number of dependent variables, and p is the number of predictive variables.

In Table 1, a comparison of the original and robust quadratic PLS shows that the latter improves the explained variance in the dependent variables from 68% to 91%, which is a considerable amount. This is an indicator that the dataset contained outliers that affected the estimation in the case of the original quadratic PLS.