Table of Contents Author Guidelines Submit a Manuscript
Advances in Operations Research
VolumeΒ 2011, Article IDΒ 263762, 18 pages
http://dx.doi.org/10.1155/2011/263762
Research Article

Outlier-Resistant 𝐿𝟏 Orthogonal Regression via the Reformulation-Linearization Technique

Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, P.O. Box 843083, 1015 Floyd Avenue, Richmond, VA 23284, USA

Received 9 September 2010; Revised 7 January 2011; Accepted 14 January 2011

Academic Editor: I. L.Β Averbakh

Copyright Β© 2011 J. Paul Brooks and Edward L. Boone. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Assessing the linear relationship between a set of continuous predictors and a continuous response is a well-studied problem in statistics and data mining. 𝐿2-based methods such as ordinary least squares and orthogonal regression can be used to determine this relationship. However, both of these methods become impaired when influential values are present. This problem becomes compounded when outliers confound standard diagnostics. This work proposes an 𝐿1-norm orthogonal regression method (𝐿1OR) formulated as a nonconvex optimization problem. Solution strategies for finding globally optimal solutions are presented. Simulation studies are conducted to assess the resistance of the method to outliers and the consistency of the method. The method is also applied to real-world data arising from an environmental science application.

1. Introduction and Background

Data analysts are often posed with the problem of determining the relationship between several variables and a response variable. The standard technique when all variables are defined on a continuous domain is ordinary least squares regression (OLS). When outliers, or unusual observations, are present in data, traditional regression techniques become impaired. Methods such as M-regression (M-R) use M estimates to reduce the impact of outliers. These methods are not designed for developing errors-in-variables models in which both the predictors and the response have measurement error or are considered random components. An example of such a situation is studying the relationship between pH and alkalinity in freshwater habitats, where both measurements are subject to error.

Orthogonal regression (𝐿2OR) is used when uncertainty is known to be present in both independent and dependent variables. This assumption is in contrast to OLS, where the predictors are assumed to be known with no measurement error. Furthermore, orthogonal regression measures the distances orthogonal to the fitted hyperplane whereas in OLS residuals are measured as the vertical distance of observations to the fitted surface.

1.1. Previous Work on Robust Orthogonal Regression

The sensitivity of 𝐿2OR to outliers has been noted, and other investigators have worked to develop robust methods [1–3]. The work of Zamar [3] includes the use of 𝑆 and 𝑀 estimates for orthogonal regression. SpΓ€th and Watson [4] introduce a method for incorporating the 𝐿1 norm for measuring distances in orthogonal regression.

𝐿2OR can be formulated as equivalent to finding the last principal component, or the direction of minimum variation, in principal component analysis (PCA). Hence, any robust PCA method can be used for robust orthogonal regression. Two main approaches for robust PCA are (1) to find robust estimates of the covariance matrix (in traditional PCA, the principal components are eigenvectors of the covariance matrix) and (2) to use a robust measure of dispersion. Research in the former area includes [5–11]. Robust estimates of dispersion in PCA have been investigated in [12–16]; each of these works is based on a projection pursuit approach.

Our approach is closely related to that developed by SpÀth and Watson [4] and Kwak [16] by the manner in which we incorporate the 𝐿1 norm into an orthogonal regression procedure. SpÀth and Watson [4] measure the error of an observation as the 𝐿1 distance to its orthogonal projection onto a fitted hyperplane. Kwak [16] finds successive directions of maximum variation by maximizing the 𝐿1 distance to the 𝐿2 projection of points onto a line. In contrast to these methods, our approach is to directly find the direction of least variation by maximizing the 𝐿1 distance between points and their 𝐿2 projections onto a vector (see Figure 1). Also, the methods presented in [4, 16] guarantee only local minima to the respective optimization problems, while we present a method for deriving globally optimal solutions.

263762.fig.001
Figure 1: Two-dimensional illustration of different methods for incorporating the 𝐿1 norm into orthogonal regression. In traditional orthogonal regression, the sum of distances of points (π‘₯,𝑦) to 𝐛𝑇(π‘₯,𝑦)𝐛 is maximized, the sum of distances of points (π‘₯,𝑦) to πšπ‘‡(π‘₯,𝑦)𝐚 is minimized, and the sum of the magnitudes of πšπ‘‡(π‘₯,𝑦)𝐚 is maximized. As noted in the text, each of these distance measures can be modified to incorporate the 𝐿1 norm to derive different results. In this paper, the approach is to maximize the sum of 𝐿1 distances of points (π‘₯,𝑦) to 𝐛𝑇(π‘₯,𝑦)𝐛 which is illustrated by 𝑑1+𝑑2.

These three methods can be viewed as approximating a maximum likelihood estimator (MLE) for the linear errors-in-variables model with independent errors with a Laplace distribution (see [23, 24]). The MLE for such a model corresponds to a hyperplane that minimizes the sum of 𝐿1 projections. Zwanzig [25] considers an 𝐿1 estimator for a nonlinear generalization of the error-in-variables model and shows that under certain assumptions on the error distribution, the estimator is consistent. When applied to the setting of 𝐿1 orthogonal linear regression, the estimator is similar to the approach of SpÀth and Watson [4].

1.2. Traditional Orthogonal Regression

Suppose we are given observations with continuous predictors and responses (𝐱𝑖,𝑦𝑖)βˆˆβ„π‘‘Γ—β„, 𝑖=1,…,𝑛. 𝐿2OR seeks to find an orthogonal projection of the data onto a hyperplane such that the sum of the orthogonal distances of the points (𝐱𝑖,𝑦𝑖) to the hyperplane is minimized. We assume throughout this work that the medians have been subtracted from samples and that the fitted hyperplane passes through the origin. We note that for large values of 𝑑, the coordinate-wise median may not be a good estimate of the center of a data cloud (see [26]).

In 𝐿2OR, the sum of squared orthogonal distances of (𝐱𝑖,𝑦𝑖) to the hyperplane defined by 𝐛𝑇(𝐱,𝑦)=0 is minimized. The vector 𝐛 is normal to the best-fitting hyperplane and is the direction of least variation of the data. Because 𝐛 is the direction of least variation, the sum of squared distances of observations to their projections along 𝐛 is maximized. Therefore, we can find 𝐛 by solving the following optimization problem:𝐿2ORξ€»max𝐛𝑛𝑖=1‖‖𝐱𝑖,π‘¦π‘–ξ€Έβˆ’π›π‘‡ξ€·π±π‘–,𝑦𝑖𝐛‖‖22,(1.1) subject to𝐛𝑇𝐛=1.(1.2)

The variables are in the vector π›βˆˆβ„π‘‘+1. The term 𝐛𝑇(𝐱𝑖,𝑦𝑖)𝐛 represents the orthogonal projection of observation 𝑖 along 𝐛 in terms of the original coordinates of the data.

In this paper, we present a new outlier-resistant method for orthogonal regression called 𝐿1OR. The direction of least variation in data is found by maximizing the 𝐿1 distances of observations to their projection points along a vector. The fitted hyperplane is orthogonal to the direction of least variation. The problem is formulated as a nonconvex optimization problem. We describe how globally optimal solutions can be derived based on a reformulation-linearization technique (RLT) developed by Sherali and Tuncbilek [27]. We present results of applying 𝐿1OR to simulated data that is contaminated with outliers and compare the results to robust methods for orthogonal regression. The consistency of 𝐿1OR is assessed using simulated data. 𝐿1OR is applied to data collected for the evaluation of marine habitats, where uncertainty resides in both the dependent and independent variables.

2. Finding the Best-Fit Hyperplane

Suppose that instead of maximizing the sum of the squared perpendicular distances of observations to their projection along the direction of least variation, we maximize the sum of the 𝐿1 distances. Using the 𝐿1 metric reduces the impact of outlier observations.

In Figure 1, we illustrate different methods of incorporating the 𝐿1 norm into an orthogonal regression procedure for a two-dimensional example. The fitted hyperplane is defined by its normal vector 𝐛, representing an approximation of the direction of least variation in data. The vector 𝐚 spans the space defined by the fitted hyperplane. Our approach is to maximize the sum of 𝐿1 distances of points onto their projections on 𝐛. The 𝐿1 distance of (π‘₯,𝑦) to its 𝐿2 projection on 𝐛 is given by 𝑑1+𝑑2 in the figure. The procedure proposed by SpΓ€th and Watson [4] minimizes the sum of 𝐿1 distances of points to their 𝐿2 projections in a fitted hyperplane. The distance of (π‘₯,𝑦) to its projection in the fitted subspace is indicated by 𝑑3+𝑑4. The procedure introduced by Kwak [16] maximizes the sum of 𝐿1 magnitudes of the projections of points onto the fitted hyperplane. In Figure 1, this magnitude is given by 𝑑5+𝑑6. When these three distances are measured using the 𝐿2 norm, the same regression plane is optimal [28]; however, because the distances in each case are measured using the 𝐿1 norm, the resulting regression planes will not always coincide. The 𝐿1 projection of (π‘₯,𝑦) on to the fitted hyperplane is given by (π‘₯1,𝑦1); an MLE approach would minimize the sum of 𝐿1 distances of points to their 𝐿1 projections.

Maximizing the sum of the 𝐿1 distances of points to a line passing through the origin is written asmax𝐛𝑛𝑖=1‖‖𝐱𝑖,π‘¦π‘–ξ€Έβˆ’π›π‘‡ξ€·π±π‘–,𝑦𝑖𝐛‖‖1=max𝑛𝑑𝑖=1𝑗=1|||||π‘₯π‘–π‘—βˆ’π‘π‘—ξƒ©π‘‘ξ“π‘˜=1π‘₯π‘–π‘˜π‘π‘˜+𝑦𝑖𝑏𝑑+1ξƒͺ|||||+𝑛𝑖=1|||||π‘¦π‘–βˆ’π‘π‘‘+1ξƒ©π‘‘ξ“π‘˜=1π‘₯π‘–π‘˜π‘π‘˜+𝑦𝑖𝑏𝑑+1ξƒͺ|||||.(2.1)

The objective function is nonlinear and nonconvex. As with [𝐿2OR], the optimal hyperplane is defined by 𝐛𝑇(𝐱,𝑦)=0. Let π‘Ÿπ‘–π‘— be the 𝐿1 residual for component 𝑗 of observation 𝑖. Also, let 𝐚=𝐛+𝟏, where 𝟏 is a vector of 1’s, so that all π‘Žπ‘— variables are nonnegative. This substitution is necessary for our solution method which is explained below. The math program can then be formulated as 𝐿1ORξ€»max𝑛𝑖=1𝑑+1𝑗=1π‘Ÿπ‘–π‘—,(2.2)

subject toπ‘Ÿπ‘–π‘—=𝐱𝑖,π‘¦π‘–ξ€Έβˆ’(πšβˆ’πŸ)𝑇𝐱𝑖,𝑦𝑖(πšβˆ’πŸ)𝑗ifπ‘§π‘–π‘—ξ€Ίβˆ’ξ€·π±=0,βˆ€π‘–,𝑗,𝑖,𝑦𝑖+(πšβˆ’πŸ)𝑇𝐱𝑖,𝑦𝑖(ξ€»πšβˆ’πŸ)𝑗if𝑧𝑖𝑗=1,βˆ€π‘–,𝑗,(πšβˆ’πŸ)𝑇(πšβˆ’πŸ)=1,𝐚β‰₯𝟎,πšβ‰€πŸ,(2.3)π‘§π‘–π‘—βˆˆ{0,1},𝑖=1,…,𝑛;𝑗=1,…,𝑑+1.(2.4)

The quantities 𝟎, 𝟏, and 𝟐 are vectors with each coordinate having the value 0, 1, and 2, respectively. The objective function is now linear, and the first three sets of constraints are defined by nonconvex functions.

To derive globally optimal solutions for [𝐿1OR], we combine the use of branch-and-bound for integer programming with branch-and-bound for the reformulation-linearization technique (RLT) as described in [27]. Subproblem will refer to a linear mixed-integer program (MIP) that corresponds to a node in a branch-and-bound tree for the RLT. Each subproblem can be converted to a linear MIP by expressing the conditional constraints asπ‘Ÿπ‘–π‘—β‰€π±ξ€Ίξ€·π‘–,π‘¦π‘–ξ€Έβˆ’(πšβˆ’πŸ)𝑇𝐱𝑖,𝑦𝑖(πšβˆ’πŸ)𝑗+𝑀𝑧𝑖𝑗,π‘Ÿπ‘–π‘—β‰€ξ€Ίβˆ’ξ€·π±π‘–,𝑦𝑖+(πšβˆ’πŸ)𝑇𝐱𝑖,𝑦𝑖(πšβˆ’πŸ)𝑗+𝑀𝑧𝑖𝑗,(2.5) for a sufficiently large constant 𝑀.

The following is a summary of RLT applied to [𝐿1OR]. (i)Subproblem optimization. Select a subproblem to solve. Each subproblem is a linear MIP that relaxes the nonconvex constraints. If all subproblems are solved, then the incumbent solution is optimal.(ii)Check for new bound. If the solution satisfies the original nonconvex constraints, the current solution is feasible. Update the incumbent solution and objective value if appropriate. (iii)Fathom. Fathom if (1) the solution satisfies the original constraints, (2) the subproblem is infeasible, or (3) the objective value for the subproblem is less than the incumbent objective value. (iv)Branch. Select a variable for branching, creating two subproblems.

A flow-chart detailing the steps in the RLT branch-and-bound process is included in Figure 2.

263762.fig.002
Figure 2: Flowchart of the steps involved in RLT branch and bound procedure applied to a nonlinear mixed-integer program.

We now describe the construction of the root subproblem for RLT. For each occurrence of π‘Žπ‘—π‘Žπ‘˜ in the constraints, substitute a new variable π΄π‘—π‘˜ into the formulation. Also, add constraints of the formξ€·2βˆ’π‘Žπ‘—ξ€Έξ€·2βˆ’π‘Žπ‘˜ξ€Έξ€·π‘Žβ‰₯0,𝑗,π‘˜=1,…,𝑑+1,π‘˜βˆ’0ξ€Έξ€·2βˆ’π‘Žπ‘—ξ€Έξ€·π‘Žβ‰₯0,𝑗,π‘˜=1,…,𝑑+1,π‘˜π‘Žβˆ’0ξ€Έξ€·π‘—ξ€Έβˆ’0β‰₯0,𝑗,π‘˜=1,…,𝑑+1,(2.6) but again replace occurrences of π‘Žπ‘—π‘Žπ‘˜ with π΄π‘—π‘˜. The presence of 0 in the constraints is to reflect the lower bounds on the π‘Žπ‘— variables; these lower bounds will be changed during the optimization algorithm as described below. The result is a linear MIP that is a relaxation of [𝐿1OR][27].

We now describe the branching procedure. The optimal solution to the relaxation is feasible for [𝐿1OR] if π΄π‘—π‘˜=π‘Žπ‘—π‘Žπ‘˜ for all 𝑗,π‘˜. If this condition is not satisfied, then choose a variable π‘Žπ‘— with π΄π‘—π‘˜β‰ π‘Žπ‘—π‘Žπ‘˜ for some π‘˜ with current value π‘Žπ‘— and create two new subproblems. One of the new subproblems will have constraints of the formξ€·π‘Žπ‘—βˆ’π‘Žπ‘—ξ€Έξ€·2βˆ’π‘Žπ‘˜ξ€Έξ€·π‘Žβ‰₯0,π‘˜=1,…,𝑑+1,π‘˜βˆ’0ξ€Έξ€·π‘Žπ‘—βˆ’π‘Žπ‘—ξ€Έξ€·π‘Žβ‰₯0,π‘˜=1,…,𝑑+1,π‘˜π‘Žβˆ’0ξ€Έξ€·π‘—ξ€Έπ‘Žβˆ’0β‰₯0,π‘˜=1,…,𝑑+1,π‘—β‰€π‘Žπ‘—.(2.7)

Again, replace all occurrences of π‘Žπ‘—π‘Žπ‘˜ with π΄π‘—π‘˜ to create linear constraints. The other new subproblem will have the linearized form of the constraintsξ€·2βˆ’π‘Žπ‘—ξ€Έξ€·2βˆ’π‘Žπ‘˜ξ€Έξ€·π‘Žβ‰₯0,π‘˜=1,…,𝑑+1,π‘—βˆ’π‘Žπ‘—ξ€Έξ€·2βˆ’π‘Žπ‘˜ξ€Έξ€·π‘Žβ‰₯0,π‘˜=1,…,𝑑+1,π‘˜π‘Žβˆ’0ξ€Έξ€·π‘—βˆ’π‘Žπ‘—ξ€Έπ‘Žβ‰₯0,π‘˜=1,…,𝑑+1,𝑗β‰₯π‘Žπ‘—.(2.8)

As nodes in the branch-and-bound tree are traversed, the bounds for the π‘Žπ‘— variables are successively tightened. Sherali and Tuncbilek [27] prove that either the search for optimal solutions terminates with a globally optimal solution in finite steps or else any accumulation point of solutions along an infinite branch of the branch-and-bound tree is a globally optimal solution.

3. Simulation Studies

In this section, the ability of 𝐿1OR to resist the effects of two types of outliers is assessed using simulation studies. The approach is compared to 𝐿2OR and several robust procedures. The consistency of 𝐿1OR is also assessed using a simulation study.

[𝐿1OR] MIP subproblems are solved using CPLEX 12.1. If provable optimality is not achieved for MIP subproblems after 2 minutes, the best-known integer feasible solution is used. We implemented our own branch-and-bound algorithm for applying RLT in a C program, with a time limit of 7200 CPU seconds for each instance. Problems are solved on machines with 2Γ—2.6 GHz Opteron processors and 2 GB RAM.

𝐿1OR is compared to a robust approach based on projection pursuit [12], a 𝜏 scale-based orthogonalized Gnanadesikan-Kettenring estimate [29] (hereafter 𝜏-OGK), and a method based on PCA-𝐿1 [16]. The projection pursuit approach is applied by using the method for principal component analysis described in [15]. The method is modified for orthogonal regression by taking the last robust principal component as the coefficients of the orthogonal regression hyperplane. We denote this method by ppOR-mad or ppOR-qn, with the suffix indicating the scale function used. The other methods are denoted by 𝜏-OGK and PCA-𝐿1. For PCA-𝐿1, the initial vector is set to 𝐰0=argmax𝐱𝑖‖𝐱𝑖‖2 (see [16]).

𝐿2OR and ppOR models are derived using prcomp() and PCAgrid() functions, respectively, called in the R environment for statistical computing [30]. The function PCAgrid() is in the pcaPP [31] library. R code for the 𝜏-OGK estimator was provided by an anonymous referee. We implemented the PCA-𝐿1 method [16] in a C program.

3.1. Vertical Outliers

A simulation study is conducted to assess the ability of 𝐿1OR to detect linear relationships in bivariate data in the presence of vertical outliers. Vertical outliers have significant variation only in their response-variable values. A simulation design is utilized by varying the number of contaminated observations (𝐢) and contamination magnitude (π‘š). Each method is run on 30 datasets with 100 observations under each treatment condition. For this study, 𝐢 is varied in the following manner: no contamination, 𝐢=0, moderate contamination, 𝐢=10, and high contamination, 𝐢=25. The magnitude of contamination π‘š is varied as π‘š=1: low contamination, π‘š=10: moderate magnitude, π‘š=50: large magnitude.

The data are sampled in the following manner. (i)Generate the uncontaminated data: π‘₯π‘–βˆΌπ‘ˆ[βˆ’1,1] and 𝑦𝑖=π‘₯𝑖+πœ–π‘–, where πœ–π‘–βˆΌπ‘(0,0.1), for 𝑖=1,…,100βˆ’πΆ.(ii)Generate the contaminated data: π‘₯π‘–βˆΌπ‘ˆ[0.5,1] and π‘¦π‘–βˆΌ|𝑁(0,π‘šΓ—0.1)|, for 𝑖=101βˆ’πΆ,…,100.

An example dataset with fitted models generated using π‘š=10 and 𝐢=25 is given in Figure 3(a).

fig3
Figure 3: Examples of data sets used in simulation experiments and fitted models: (a) a data set with vertical outliers generated using parameters π‘š=10 and 𝐢=25, (b) a data set with clustered leverage outliers with 𝑛=100 and generated using πœ–=0.25, and (c) a data set with errors in both variables sampled from a Laplace distribution with 𝑛=200.

To evaluate each method’s ability to accurately fit the known underlying model, the following model discrepancy, 𝐷, is used:𝐷=ξ€œπ‘“,𝑓1βˆ’1||||𝑓(π‘₯)βˆ’π‘“(π‘₯)𝑑π‘₯,(3.1) where 𝑓 is the known model and 𝑓 is the estimated model. Note that 𝐷 corresponds to the area between 𝑓 and 𝑓. If the estimated model is close to the true model, then 𝐷 will be small. For each of the simulations 𝐷 is computed and recorded. Using these results the average model discrepancy, 𝐷, and standard error are computed.

To analyze the simulation, the means and standard deviations of 𝐷 are computed for each setting of π‘š and 𝐢 and can be found in Table 1. For all configurations with π‘šβ‰€10, 𝐿1OR has lower means and standard deviations than all other methods tested, indicating superior performance in resisting outlier contamination for such conditions. For π‘š=50, 𝐿1OR performs worse than the robust methods with the exception of PCA-𝐿1 but better than the outlier-sensitive 𝐿2OR. In the case of extreme contamination (𝐢=25,π‘š=50), 𝐿2OR and PCA-𝐿1 are extremely sensitive to outliers as indicated by large values for 𝐷. The best-performing method for this configuration is ppOR-qn. 𝐿1OR has mean discrepancy that is only 0.34 more than that of ppOR-qn but is at least 1.28 less than the outlier-sensitive methods. Overall, this suggests that 𝐿1OR performs well when no contamination is present and in the presence of larger levels of contamination, but performance degrades relative to some of the robust methods when the contamination magnitude is very large.

tab1
Table 1: Mean (standard deviation) of 𝐷 for 𝐿1OR, 𝐿2OR, ppOR-mad, ppOR-qn, 𝜏-OGK, and PCA-𝐿1 with contamination magnitudes π‘š=1, 1010, and 5050 and contamination levels 𝐢=0, 10, and 25.
3.2. Clustered Leverage Outliers

The ability of 𝐿1OR to detect linear relationships in bivariate data with outliers is further analyzed with a simulation using datasets with clustered leverage outliers. Clustered leverage outliers in a dataset have very similar values but are far from the rest of the data set. The simulation design varies the number of observations (𝑛) and the contamination level (πœ–). For each treatment condition and replication, a dataset is generated without contamination and a companion dataset is generated replacing the first βŒˆπœ–π‘›βŒ‰ observations with contaminated data. There are 50 replications for each treatment condition. For this experiment, πœ– is varied in the following manner: low contamination: πœ–=0.05, moderate contamination: πœ–=0.10, and high contamination: πœ–=0.25.

The data are sampled as follows. (i)Generate the uncontaminated data: (π‘₯𝑖,𝑦𝑖)βˆΌπ‘(𝟎,𝚺), for 𝑖=1,…,𝑛. (ii)Generate the contaminated data: (π‘₯𝑖,𝑦𝑖)βˆΌπ‘(𝐦,10βˆ’2𝐈), for 𝑖=1,…,βŒˆπœ–π‘›βŒ‰.

The covariance matrix (𝚺) is varied across replications. First, a 2Γ—2 matrix 𝐀 is generated such that each entry is sampled from a 𝑁(0,1) distribution. The QR decomposition 𝐀=𝐐𝐑 is calculated. Let 𝐁=𝐐𝐈sgn(βŸ¨π‘βŸ©), where βŸ¨β‹…βŸ© indicates taking the diagonal elements as a vector and sgn(β‹…) is the vector with the signs of the corresponding elements of a vector. Then 𝚺 is sampled from a Wishart(𝐁,5). The means (𝐦) for the contaminated data are generated such that (1)the Mahalanobis distance of 𝐦 from the distribution 𝑁(𝟎,𝚺) is at least 2πœ’20.99,2β€², (2)min{π‘₯π‘–βˆΆπ‘–=1,…,𝑛}β‰€π‘š1≀max{xπ‘–βˆΆπ‘–=1,…,𝑛}, and (3)min{π‘¦π‘–βˆΆπ‘–=1,…,𝑛}β‰€π‘š2≀max{π‘¦π‘–βˆΆπ‘–=1,…,𝑛}.

An example dataset with 100 observations and fitted models generated using πœ–=0.25 is given in Figure 3(b).

Each method is evaluated based on the similarity of the models fit on the companion uncontaminated and contaminated datasets. The similarity measure 𝑆 is defined as the absolute value of the inner product𝑆𝐛1,𝐛2ξ€Έ=||𝐛1⋅𝐛2||,(3.2) where 𝐛1 and 𝐛2 are the vectors of coefficients derived for the uncontaminated and contaminated datasets. The values of 𝑆 can be between 0 and 1, with larger values indicating that the models are in agreement and that outliers do not affect the estimate.

The means across replications and the percentage of instances with 𝑆β‰₯0.90 for each value of 𝑛 and πœ– are contained in Table 2. For πœ–=0.05,0.10, the performance of 𝐿1OR is nearly constant as 𝑛 is increased. There is a slight degradation in performance for larger values of 𝑛, which is likely due to the increased computational complexity of instances (see Section 5). For πœ–=0.05, all methods have high mean values for 𝑆 and high percentages of instances with 𝑆β‰₯0.9, including the outlier-sensitive 𝐿2OR. For πœ–=0.10, 𝐿1OR and all of the robust methods have larger mean values of 𝑆 than 𝐿2OR. The ppOR-qn estimator has the most consistent performance across different values of 𝑛 for πœ–=0.10, with mean values of 𝑆 above 0.94 for each. The 𝐿1OR estimator has mean values above 0.93 for 𝑛≀100, but performance degrades for 𝑛=200. The 𝜏-OGK estimator has the highest or second-highest mean values of 𝑆 for 𝑛≀100. For πœ–=0.25, the performance of 𝐿1OR lags the robust methods. For 𝑛≀50, the performance is similar to that of 𝐿2OR. For 𝑛β‰₯100, the mean values of 𝑆 are less than those for 𝐿2OR. For πœ–=0.25, the preferred estimator appears to be ppOR-mad, as it has the highest or second-highest value of 𝑆 for each 𝑛.

tab2
Table 2: Mean of 𝑆/percentage of instances with 𝑆β‰₯0.9 for 𝐿1OR, 𝐿2OR, ppOR-mad, ppOR-qn, 𝜏-OGK, and PCA-𝐿1 with sample sizes 𝑛=25, 50, 100, and 200 and contamination levels πœ–=0.05,0.10, and 0.25.
3.3. Consistency

The consistency of 𝐿1OR is assessed by performing tests on instances with various sample sizes. Bivariate data (π‘₯𝑖,𝑦𝑖),  𝑖=1,…,𝑛 are generated such that π‘₯𝑖=πœˆπ‘–+πœ–π‘–, where πœˆπ‘–βˆΌπ‘ˆ[βˆ’1,1] and πœ–π‘–βˆΌπΏπ‘Žπ‘π‘™π‘Žπ‘π‘’(0,0.5), and 𝑦𝑖=π‘₯𝑖+πœ‰π‘–, where πœ‰π‘–βˆΌπΏπ‘Žπ‘π‘™π‘Žπ‘π‘’(0,0.5). The sample sizes tested are 𝑛=10,25,50,100,200; data are generated for 100 datasets for each value of 𝑛. The rlaplace() function in the R package rmutil [32] is used to sample from the Laplace distribution. An example dataset with 200 observations and the fitted 𝐿1OR model is given in Figure 3(c).

Figure 4 depicts the standard error of the absolute value of the slope as a function of sample size. As sample size increases, the standard error rapidly approaches zero, indicating that the procedure is consistent. For large sample sizes, 𝐿1OR should provide good estimates.

263762.fig.004
Figure 4: Plot of the standard error of the absolute value of the slope in a bivariate experiment as a function of sample size (𝑛).

4. An Environmental Example

The pH and alkalinity of the water in which the fish live are known to impact their overall health. Alkalinity is a measure of the ability of a solution to neutralize acids. Researchers expect pH and alkalinity be highly correlated. However, the relationship of the two variables is difficult to estimate in many datasets due to low variation in pH across streams and due to the presence of outliers. The dataset for this example is a subset of values collected across the state of Ohio resulting in 312 observations. Various subsets of this dataset have been considered previously by Norton [17], Lipkovich et al. [18], Noble et al. [19], and Boone et al. [20] with varying degrees of success at estimating the relationship between pH and alkalinity. For the purposes of this work both pH and alkalinity have been normalized. Note that in this data both pH and alkalinity have measurement error, and hence an orthogonal regression method should be used. The same computational settings as in the simulation studies are used for this analysis.

Figure 5 shows the scatter plot of pH versus alkalinity. There appears to be a linear relationship between alkalinity and pH. Also notice the vertical and leverage outliers present in the data. The Pearson correlation coefficient for the relationship between pH and alkalinity is π‘Ÿ=0.3366 which is biased down due to the outliers in the data. Furthermore, since the correlation is biased down, extracting a pH-alkalinity component and using that as a predictor would not be prudent. Hence, a regression method is needed that is insensitive to outliers/influential points. With the exception of ppOR-mad, the outlier-insensitive methods all demonstrate resistance to the outliers in the measurements of pH when compared to the outlier-sensitive 𝐿2OR. The method based on PCA-𝐿1 produces a model that appears to be least affected by outliers, followed by the 𝜏-OGK estimator and then 𝐿1OR.

263762.fig.005
Figure 5: Scatter plot of pH versus alkalinity with models from 𝐿1OR, 𝐿2OR, and ppOR-mad; ppOR-qn, 𝜏-OGK, and PCA-𝐿1.

Table 3 shows the summary of regression models for each method. Here the standard errors are bootstrap standard errors based on 100 bootstrap samples. The bootstrap standard errors vary widely across the methods, with 𝐿1OR having the most stable estimates, as evidenced by smaller standard errors, followed by 𝜏-OGK and PCA-𝐿1. With the exception of ppOR-mad, the 𝑃 values indicate that the relationship between pH and alkalinity is statistically significant. Notice that the 𝑃 value for the relationship is statistically significant using 𝐿1OR, 𝜏-OGK, and PCA-𝐿1 with a 𝑃 value of less than  .00001. The 𝐿1OR, 𝜏-OGK, and PCA-𝐿1 estimators appear to be the best choice for this data, with PCA-𝐿1 producing the best estimate and 𝐿1OR providing the most stable estimate.

tab3
Table 3: Summary of regression models for alkalinity on pH using 𝐿1OR, 𝐿2OR, ppOR-mad, ppOR-qn, 𝜏-OGK, and PCA-𝐿1.

An expansion of this problem is to consider how alkalinity, pH, and habitat measure qualitative habitat evaluation index (QHEI) impact the index of biotic integrity (IBI). QHEI measures the quality of the habitat in which the fish reside [21]. QHEI is determined from the following six measures: stream substrate, in stream cover, channel morphology, riparian and bank condition, pool and riffle quality, and gradient. Here higher values correspond to better habitat quality, and lower values correspond to poorer habitat quality. IBI measures the health of the fish community. Lower values of IBI correspond only to tolerant species present, low community organization, and high proportion of fish with physical anomalies. High values correspond to highly organized fish communities, many intolerant species present, and high diversity among species [22]. The data consist of 312 observations from the same sites.

Orthogonal regression models are fit to the data with IBI as the response and QHEI, pH, and alkalinity as predictors. The method presented by Croux and Haesbroeck [10] (hereafter CH) is used instead of 𝜏-OGK because of the increased number of variables. The CH method is a robust PCA based on finding eigenvalues of a robust estimate of the covariance matrix.

Table 4 shows the coefficients, bootstrap standard errors, 𝑑 value, and 𝑃 values for the regressions using each method. Notice that the 𝐿1OR estimates for the coefficients for pH and alkalinity are the most stable, as indicated by lower standard errors. The standard errors for the QHEI coefficient estimates are the smallest for each method. The estimate for the QHEI coefficient by CH has the lowest standard error and is the largest positive estimate which seems to agree best with biological expectations. The better the habitat the fish have to live in, the better the health of the fish community. For all methods except for 𝐿2OR and PCA-𝐿1, the coefficients indicate a positive correlation between IBI and pH and a negative correlation between IBI and alkalinity. While none of the variables in any of the regressions are statistically significant, this dataset provides an example of how the regression coefficients from orthogonal regression with outliers may be suspect.

tab4
Table 4: Summary of regression models for IBI on QHEI, pH, and alkalinity (ALK) using 𝐿1OR, 𝐿2OR, ppOR-mad, ppOR-qn, CH, and PCA-𝐿1.

5. Computation Time

The solution method proposed for 𝐿1OR is more computationally intensive than the other methods used for comparison in this paper. The alternative methods solve all of the instances used here in less than a few seconds. In this section, we evaluate the computational performance of our implementation of 𝐿1OR.

Tables 5–8 contain data on the computational performance of 𝐿1OR in each of the experiments conducted. In each table, the first column(s) indicates the configuration of the data: for Table 5 the contamination level 𝐢 and contamination magnitude π‘š; for Table 6 the sample size 𝑛, contamination level πœ–, and whether the data has outliers; for Table 7 the sample size 𝑛; for Table 8 the number of variables 𝑑. The second column % Optimal indicates the percentage of instances solved to optimality, meaning that all MIP subproblems solved to optimality and the RLT branch and bound tree are fully explored. The third column Avg. MIPs Solved contains the average number of MIPs solved for each configuration. The fourth column Avg. MIPs Suboptimal contains the average number of MIPs that were not solved to optimality within the 120 CPU second time limit. The fifth column Avg. Time-to-Term. (s) contains the average of the lesser of the CPU seconds before the RLT branch and bound tree is explored and 7200 seconds. The last column Avg. Time to Best Soln. (s) contains the average time to find the best feasible solution.

tab5
Table 5: Computational performance of 𝐿1OR implementation for simulation with vertical outliers.
tab6
Table 6: Computational performance of 𝐿1OR implementation for simulation with clustered leverage outliers.
tab7
Table 7: Computational performance of 𝐿1OR implementation for consistency experiment.
tab8
Table 8: Computational performance of 𝐿1OR implementation for bootstrap simulations.

With the exception of the bootstrap samples for the environmental data with 𝑑=4 (Table 8), the RLT branch-and-bound tree is explored in every instance. However, for many of these instances, at least one of the MIP subproblems is not solved to optimality. The solution taken in these instances is therefore not β€œprovably” optimal. All instances with 𝑛≀25 are solved to optimality. As 𝑛 is increased to 50 and larger, fewer instances are solved to optimality. For 𝑛≀100, the number of MIP subproblems that are not solved to optimality is less than 5% of the subproblems solved in those instances on average. For 𝑛=200 in the simulation for clustered leverage outliers and the consistency experiment, about 10% of the MIPs are not solved to optimality. For the bootstrap simulation with 𝑑=4, more than half of the MIPs were not solved to optimality.

In the simulation with vertical outliers (Table 5), more instances are solved to optimality when the outlier contamination is larger. In contrast, in the simulation with clustered leverage outliers (Table 6), fewer instances with contamination are solved to optimality, than the companion datasets without contamination. Also, the number of instances solved to optimality seems to decrease as the contamination level increases.

In the simulation with vertical outliers (Table 5), at least one MIP is not solved to optimality in most instances. Except in the case of extreme outlier contamination, 𝐿1OR performed competitively when compared to robust methods. Also, the standard error for the slope in the consistency experiment (Table 7), the percentage of instances solved to optimality decreases dramatically as 𝑛 increases, but the standard error for the estimates continues to decrease. For these instances then, the time limit for the MIP subproblems does not appear to hamper the ability to find good solutions.

Only one experiment, the bootstrap simulation with 𝑑=4, used data with more than 2 variables. The degradation in computational performance is more dramatic in the shift from 𝑑=2 to 𝑑=4 in the bootstrap simulation than the degradation observed when 𝑛 is increased in the bivariate experiments. This phenomenon is likely due to the increase in nonlinear constraints needed to produce the RLT relaxation.

6. Discussion

This work introduces a new 𝐿1 orthogonal regression technique that is designed to be resistant to outliers. We develop a method for deriving globally optimal solutions for problem instances. Via simulation, the method shows promise for being resistant to outliers. An application to an environmental example further demonstrates that the method produces results which are more resistant to outliers than traditional orthogonal regression and competes with other robust methods. Hence, this method gives data analysts that deal with errors-in-variables data contaminated with outliers a resistant alternative to orthogonal regression.

The computational studies presented here indicate that different robust or outlier-resistant methods are suitable in different situations, and there is no clearly superior method. The pcaPP-mad method is among the best performers in the presence of vertical and clustered leverage outliers in simulated data but has perhaps the poorest estimate in the real-world example that contains both types of outliers. PCA-𝐿1 is among the poorest performers in the presence of vertical and clustered leverage outliers in simulated data but produces some of the best estimates in the real-world analysis. The inconsistency of the results for PCA-𝐿1 may be due to the dependence of the method on having a good starting point for finding a good local optimal solution. The 𝐿1OR method presented here performs best with respect to the other methods in the presence of moderate contamination by vertical outliers but suffers in cases of extreme contamination.

Traditional orthogonal regression (𝐿2OR) can be formulated as a special case of PCA. The approach presented in this work for formulation and optimization can potentially be adapted to develop an outlier-resistant method for PCA. An outlier-resistant PCA algorithm would be useful for data analysts that work with contaminated data. Another possible extension is for an outlier-resistant factor analysis procedure for analyzing categorical data.

Acknowledgment

The authors would like to thank two anonymous referees for numerous suggestions for improving the content and presentation of this work.

References

  1. M. L. Brown, β€œRobust line estimation with errors in both variables,” Journal of the American Statistical Association, vol. 77, pp. 71–79, 1982. View at Google Scholar
  2. R. J. Carroll and P. P. Gallo, β€œAspects of robustness in the functional errors-in-variables regression model,” Communications in Statistics, vol. 11, pp. 2573–2585, 1982. View at Google Scholar
  3. R. H. Zamar, β€œRobust estimation in the errors-in-variables model,” Biometrika, vol. 76, pp. 149–160, 1989. View at Google Scholar
  4. H. Späth and G. A. Watson, β€œOn orthogonal linear 1 approximation,” Numerische Mathematik, vol. 51, no. 5, pp. 531–543, 1987. View at Publisher Β· View at Google Scholar Β· View at MathSciNet
  5. N. A. Campbell, β€œRobust procedures in multivariate analysis—I: robust covariance estimation,” Applied Statistics, vol. 29, pp. 231–237, 1980. View at Google Scholar
  6. S. J. Devlin, R. Gnandesikan, and J. R. Kettenring, β€œRobust estimation of dispersion matrices and principal components,” Journal of the American Statistical Association, vol. 76, pp. 354–362, 1981. View at Google Scholar
  7. J. S. Galpin and D. M. Hawkins, β€œMethods of L1 estimation of a covariance matrix,” Computational Statistics and Data Analysis, vol. 5, no. 4, pp. 305–319, 1987. View at Google Scholar
  8. R. A. Naga and G. Antille, β€œStability of robust and non-robust principal components analysis,” Computational Statistics and Data Analysis, vol. 10, no. 2, pp. 169–174, 1990. View at Google Scholar
  9. J. I. Marden, β€œSome robust estimates of principal components,” Statistics and Probability Letters, vol. 43, no. 4, pp. 349–359, 1999. View at Google Scholar
  10. C. Croux and G. Haesbroeck, β€œPrincipal component analysis based on robust estimators of the covariance or correlation matrix: influence functions and efficiencies,” Biometrika, vol. 87, no. 3, pp. 603–618, 2000. View at Google Scholar
  11. H. Kamiya and S. Eguchi, β€œA class of robust principal component vectors,” Journal of Multivariate Analysis, vol. 77, no. 2, pp. 239–269, 2001. View at Publisher Β· View at Google Scholar Β· View at MathSciNet
  12. G. Li and Z. Chen, β€œProjection-pursuit approach to robust dispersion matrices and principal components: primary theory and Monte Carlo,” Journal of the American Statistical Association, vol. 80, pp. 759–766, 1985. View at Google Scholar
  13. Y. Xie, J. Wang, Y. Liang, L. Sun, X. Song, and R. Yu, β€œRobust principal components analysis by projection pursuit,” Journal of Chemometrics, vol. 7, pp. 527–541, 1993. View at Google Scholar
  14. R. Maronna, β€œPrincipal components and orthogonal regression based on robust scales,” Technometrics, vol. 47, no. 3, pp. 264–273, 2005. View at Publisher Β· View at Google Scholar Β· View at MathSciNet
  15. C. Croux and A. Ruiz-Gazen, β€œHigh breakdown estimators for principal components: the projection-pursuit approach revisited,” Journal of Multivariate Analysis, vol. 95, no. 1, pp. 206–226, 2005. View at Publisher Β· View at Google Scholar Β· View at MathSciNet
  16. N. Kwak, β€œPrincipal component analysis based on L1-norm maximization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 9, pp. 1672–1680, 2008. View at Publisher Β· View at Google Scholar Β· View at PubMed
  17. S. B. Norton, Using Biological Monitoring Data to Distinguish Among Types of Stress in Streams of the Eastern Corn Belt Plains Ecoregion, Ph.D. thesis, George Mason University, Fairfax, Va, USA, 1999.
  18. I. Lipkovich, E. P. Smith, and K. Ye, β€œEvaluating the impact environmental stressors on Benthic macroinvertebrate communities via Bayesian model averaging,” in Case Studies in Bayesian Statistics, pp. 267–283, 2002. View at Google Scholar
  19. R. Noble, E. P. Smith, and K. Ye, β€œModel selection in canonical correlation analysis (CCA) using Bayesian model averaging,” Environmetrics, vol. 15, no. 4, pp. 291–311, 2004. View at Publisher Β· View at Google Scholar
  20. E. L. Boone, K. Ye, and E. P. Smith, β€œEvaluating the relationship between ecological and habitat conditions using hierarchical models,” Journal of Agricultural, Biological, and Environmental Statistics, vol. 10, no. 2, pp. 131–147, 2005. View at Publisher Β· View at Google Scholar
  21. Ohio Environmental Protection Agency, The Qualitative Habitat Evaluation Index (QHEI): Rationale, Methods and Application, State of Ohio Environmental Protection Agency, 1989.
  22. Ohio Environmental Protection Agency, Biological Criteria for the Protection of Aquatic Life: Volume II: Users Manual for Biological Assessment of Ohio Surface Waters, State of Ohio Environmental Protection Agency, 1988, WQMA-SWS-6.
  23. A. Baccini, P. Besse, and A. de Faguerolles, β€œA L1-norm PCA and heuristic approach,” in Proceedings of the International Conference on Ordinal and Symbolic Data Analysis, pp. 359–368, 1987.
  24. S. Agarwal, M. K. Chandraker, F. Kahl, D. Kriegman, and S. Belongie, β€œPractical global optimization for multiview geometry,” Lecture Notes in Computer Science, vol. 3951, pp. 592–605, 2006. View at Google Scholar
  25. S. Zwanzig, β€œOn L1-norm estimators in nonlinear regression and nonlinear errors-in-variables models,” IMS Lecture Notes—Monograph Series, vol. 35, pp. 101–118, 1997. View at Google Scholar
  26. P. J. Rousseeuw and A. Struyf, β€œComputing location depth and regression depth in higher dimensions,” Statistics and Computing, vol. 8, no. 3, pp. 193–203, 1998. View at Google Scholar
  27. H. D. Sherali and C. H. Tuncbilek, β€œA global optimization algorithm for polynomial programming problems using a Reformulation-Linearization Technique,” Journal of Global Optimization, vol. 2, no. 1, pp. 101–112, 1992. View at Publisher Β· View at Google Scholar Β· View at MathSciNet
  28. I. T. Jolliffe, Principal Component Analysis, Springer, New York, NY, USA, 2nd edition, 2002.
  29. R. A. Maronna and R. H. Zamar, β€œRobust estimates of location and dispersion for high-dimensional datasets,” Technometrics, vol. 44, no. 4, pp. 307–317, 2002. View at Google Scholar
  30. R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2009.
  31. P. Filzmozer, H. Fritz, and K. Kalcher, pcaPP: Robust PCA by Projection Pursuit, 2009.
  32. J. Lindsey, rmutil: Utilities for Nonlinear Regression and Repeated Measurements, 2009.