Abstract

We propose a robust quadratic regression model to handle the statistics inaccuracy. Unlike the traditional robust statistic approaches that mainly focus on eliminating the effect of outliers, the proposed model employs the recently developed robust optimization methodology and tries to minimize the worst-case residual errors. First, we give a solvable equivalent semidefinite programming for the robust least square model with ball uncertainty set. Then the result is generalized to robust models under - and -norm critera with general ellipsoid uncertainty sets. In addition, we establish a robust regression model for per capital GDP and energy consumption in the energy-growth problem under the conservation hypothesis. Finally, numerical experiments are carried out to verify the effectiveness of the proposed models and demonstrate the effect of the uncertainty perturbation on the robust models.

1. Introduction

Traditional regression analysis is a useful tool to model the linear or nonlinear relationship between the observed data. In the simplest linear regression model, there is only one explanatory variable (the regressors) and one dependent variable (the regressand) that is assumed to be an affine function of ; it is further extended to a polynomial regression model where is an th order polynomial of . In this case, the corresponding multivariate regression model contains more than one explanatory variable. To make the regression models work well, there are several specific assumptions on the model and the observed data.

Consider the following standard multivariate linear regression model: where are given observed data and is a random error vector. Assuming that have zero mean and constant variance, they are independent of each other. Besides the assumption on the random errors, there is another important weak exogeneity assumption that the explanatory variables are known deterministic values. Under this assumption, one can arbitrarily transform their values and construct any complex function relationship between the regressors and the regressand. For example, in this case the polynomial regression is merely a linear regression with regressors .

Although this weak exogeneity assumption makes the linear regression model very powerful to fit the given data or predict the regressand for given known regressors, it may lead to overfitting or inconsistent estimations [1]. Actually this assumption may be quite unreasonable in some case. For instance, in the process of collecting data, there is often unavoidable observation noise that makes the observed data quite inaccurate. Furthermore, in statistics the incomplete sampling approach sometimes can only give an approximation of the real values.

Researches on regression models with imprecise data have been reported. One way to handle the noisy observation is the measure error model or the errors-in-variable model, where it is assumed that there exist some unknown latent (or true) variables that follow the true functional relationship, and the actual observations are affected by certain random noise [2]. Based on different assumptions about random noise, there are a variety of regression models, such as the method of moments [3] that is based on the third-(or higher-) order joint cumulants of observable variables and the Deming regression [4] assuming that the ratio of the noise variance is known. A brief historical overview of linear regression with errors in variables can be found in [5].

In addition to the errors-in-variable models, motivated by the robust optimization theory under uncertainty, studies on the robust regression models are reported. In such case, the perturbations are deterministic and unknown but bounded. Ghaoui and Lebret [6] study the robust linear regression with bounded uncertainty sets under least square criterion. They utilize the second-order cone programming (SOCP) and semi-definite programming to minimize the worst-case residual errors. Shivaswamy et al. [7] propose SOCP formulations for robust linear classification and regression models when the first two moments of the uncertain data are computable. Ben-Tal et al. [8] provide an excellent framework of robust linear classification and regression. Based on general assumption on the uncertainty sets, they provide explicit equivalent formulations for robust least squares, , , and Huber penalty regressions. For more results regarding robust classification and regression that are similar to this work, we refer the readers to [911]. Also according to [8], the traditional robust statistic approaches (see [12]), which try to reject the outliers in data, are different from the point of view in this paper as the authors here intend to minimize the maximal (worst-case) residual errors. However in order to overcome the confliction, a two-step approach can be easily implemented. First, the outliers are identified, and related data is removed. Then, our proposed method is applied in order to safely eliminate the effect generated from imprecise data. We employ this approach in the real energy-growth regression problem in Section 3.

Besides the regression models, there are a wide variety of forecasting models, such as support vector machines, decision tree, neural network, and Bayes classifier. For example, [13] utilizes the support vector machine based on trend-based segmentation method for financial time series forecasting. The proposed models have been tested by using various stocks from America stock market with different trends [14]. proposes a new adaptive local linear prediction method to reduce the parameters uncertainties in the prediction of a chaotic time series. Real hydrological time series are used to validate the effectiveness of the proposed methods. More related literatures can be found in [15] (chaotic time series analysis), [16] (fractal time series), and [17] (knowledge-based Green's kernel for support vector regression). Compared with these models, we focus on the handling of the statistics inaccuracy. The regression model is an appropriate basis to develop effective and tractable robust models.

In this paper, we try to extend the robust linear regression model to general multivariate quadratic regression and provide equivalent tractable formulations. Different from the simple extension from the classical linear model to classical polynomial (even general nonlinear) models under the weak exogeneity assumption, the perturbation of explanatory variables in the quadratic terms will affect the model in a complex nonlinear manner. Although [8, 12] have discussed the robust polynomial interpolation problem, only an upper bound and the corresponding suboptimal coefficients are given. They further conjecture that the proposed problem cannot be solved exactly in polynomial time. Our proposed robust multivariate quadratic regression model in this paper also needs to solve a complex biquadratic min-max optimization problem. However, under certain assumption on the uncertainty sets, we can obtain a series of equivalent semidefinite programming formulations for robust quadratic regression under different residual error criteria.

In particular, we first extend the traditional quadratic regression model by introducing the separable ball (-norm) uncertainty set and formulate the optimal robust regression problem as a min-max problem that tries to minimize the maximal residual error. By utilizing the S-lemma [18] and Schur complement lemma, we provide an equivalent semi-definite programming formulation for the robust least square quadratic regression model with ball uncertainty set. This result is then generalized to models with general ellipsoid uncertainty sets and under the -, -norm criteria. Furthermore the robust quadratic regression models are applied to the economic growth and energy consumption regression problem. We take the per capital GDP as the explanatory variable and the per capital energy consumption as the dependent variable. Under the conservation hypothesis, we establish a corresponding robust model. Finally we test the proposed model on different history data sets and compare our models with the classical regression models.

The paper proceeds as follows. In Section 2, we present a general robust quadratic regression model, give a solvable equivalent semi-definite programming for the robust least square quadratic regression model with ball uncertainty set, and further generalize the result. In Section 3, the proposed models are applied to the energy-growth problem. Numerical experiments are carried out in Section 4, and Section 5 concludes this paper and gives future research directions.

2. Robust Quadratic Regression Models

2.1. General Robust Models

Consider the standard multivariate quadratic regression model: where denotes the -dimension explanatory data, denotes the dependent data, and , , and are unknown coefficients that will be determined based on certain minimal criteria.

Given a set of data , where and , we utilize the -norm to measure the prediction error

In traditional regression models, we assume that the explanatory data are precise and reliable. Based on this weak exogeneity assumption, the quadratic regression can be expressed as the following linear regression: where are the problem data and the linear operator for matrix and is defined as . Therefore, we can easily solve the above linear regression model for (the least square regression) and .

To relax the weak exogeneity assumption, we assume that the real data are contained in the following uncertainty set:

To minimize the worst-case residual error, we establish the following robust quadratic regression model:

From the computational perspective, although the robust linear regression problem (where the coefficients are set to zero) with a large variety of uncertainty sets can be efficiently solved, the robust quadratic regression problems are much more difficult. Actually for general uncertainty sets and least square criteria, even the inner maximization problem, which includes convex biquadratic polynomial as the objective function and general convex set as feasible set, is in general not solvable in polynomial run time. Next we will introduce some meaningful uncertainty sets and provide the corresponding tractable equivalences.

2.2. Separable Ball Uncertainty Sets Model

In this subsection, we consider the following separable ball uncertainty set: where and . Thus the inner problem (IP) is of the following form (here we first consider square of the original objective function):

Note that for the inner problem, the separable uncertainty set and the summation form of the objective function allow us to decompose it into small scale subproblems with quadratic objective function and ball constraints. The quadratic objective function and constraints motivate us to use the following S-lemma to obtain an equivalent solvable reformulation.

Lemma 1 (inhomogeneous version of S-lemma [8]). Let be symmetric matrices of the same size, and let the quadratic form be strictly positive at some point. Then the implication holds true if and only if

We can obtain the following equivalent semidefinite programming for the separable robust least square quadratic regression model.

Proposition 2. The robust least square quadratic regression model with separable uncertainty set is equivalent to the following semidefinite programming: where

Proof. First consider the inner maximization subproblem. It is obvious that
If , we have that where .
If , we can utilize the S-lemma as follows: Note that in the last step, if , then there exists such that quadratic form is strictly positive; thus the condition of S-lemma holds truly. Similarly we have that Thus the inner maximization problem is equivalent to the following semi-definite programming: Note that based on the Schur complement lemma, the second-order cone constraint can also be formalized as the following semi-definite constraint: Thus we complete the proof by embedding the equivalent semi-definite programming into the outer problem.

Due to the advance of interior algorithms for conic programming, the above semidefinite programming can be efficiently solved in polynomial run time. There are several efficient and free software packages for solving the semidefinite programming, such as the SDPT3 [19]. Next we make several extensions based on the separable robust least square quadratic regression model.

2.3. Ellipsoid Uncertainty Set and More Norm Criterion

The above result on standard ball uncertainty set can be further extended to that on the following general ellipsoid uncertainty set: where . Linear transformation operator allows us to impose more restrictions on the uncertainty set. For example, if we choose the diagonal matrix , we can put different weights on deviation of components of ; general matrix can further restrict the correlated deviation of different components.

To obtain the corresponding reformulation, we only need to modify the first two constraints based on the S-lemma as follows:

We further consider the robust quadratic regression models with -norm and -norm criterion. Note that for -norm criteria, the inner maximization problem is of the following form: And for -norm criteria, we have the following equivalent reformulation:

Using the similar approach as in Proposition 2, both can be further reformulated as semi-definite programming.

Proposition 3. The separable robust quadratic regression model under -norm and -norm criteria are equivalent to the following semidefinite programming, respectively: where

3. Robust Energy-Growth Regression Models

Studies have been reported on the causal relationship between economic growth and energy consumption. In this section, we try to apply the proposed robust quadratic regression model to the energy-growth problem.

The seminal paper of J. Kraft and A. Kraft [20] first studies the casual relationship for USA. In a recent survey, Ilhan [21] categorizes the casual relationships into four types: no causality, unidirectional causality running from economic growth to energy consumption, the reverse case, and the bidirectional causality. Note that the resulted relationships depend on the selected data and analysis approaches. Sometimes the results obtained from different approaches conflict with each other when even using the data from the same country. For example, using the Toda-Yamamoto causality test method, Bowden and Payne [22] show that energy consumption plays an important role in economic growth in USA based on history data from 1949 to 2006 while using the same method Soytas and Sari [23] find that no causality exists between them based on USA data from 1960 to 2006. On the other hand, based on the same USA's data from 1947 to 1990, Cheng [24] and Stern [25] conclude different causalities by utilizing different analyzing approaches.

Unlike the previous energy-growth studies, we attempt to provide a long-run stationary regression model between the per capital GDP (G) and per capital energy consumption (EC). The underlying assumption of our model is similar to the traditional “conservation hypothesis” that means that an increase in real GDP will cause an increase in energy consumption [21]. The “per capital” perspective provides us with a new insight on the causality and new regression models. Figures 1 and 2 demonstrate the relationship between per capital energy consumption and per capital GDP in USA and Germany respectively. From the subfigures on the left hand side, we can see that in both countries there is a gradual increase in economy while the per capital energy consumption may decrease after reaching a certain level; the subfigures on the right hand side inspire us to establish a nonlinear regression model to characterize the relationship.

To eliminate effect of the imprecise statistics data, we employ the proposed robust quadratic regression model and put different weights on the residual errors at different time points. Specifically we establish the following weighted robust quadratic regression model: where the weight factor represents the relative importance of the predicted residual error in the th year. We could set for the abnormal data point and set as an increase function of to emphasize the importance of recent data. The uncertainty set is defined as where . Parameter controls the relative amplitude of the fluctuation in observed data.

The weighted robust quadratic regression model can be summarized as follows.(1)Solve the classical quadratic regression model using the nominal values . (2)Based on the quadratic regression, remove the data with the first largest residual errors and set weights value . (3)Solve the equivalent semi-definite programming problem and return the final weighted robust quadratic regression model.

4. Numerical Experiments

In this section, we verify the effectiveness of the proposed robust quadratic regression models on several data sets. The equivalent semi-definite programming problem is solved by the SDPT3 solver [19]. Numerical experiments are implemented using MATLAB 7.7.0 and run on Intel(R) Core(TM)2 CPU E7400.

First we test the proposed robust least square quadratic regression (LS-RQR) model with Germany data from 1960 to 2006. As previously discussed, after the preliminary quadratic regression analysis, we will remove the data with the first largest residual errors, where . Then for the rest of data, we establish the classical least square quadratic regression (LS-CQR) and LS-RQR models, respectively.

Table 1 lists the computation results for LS-CQR and LS-RQR with a series of values. The listed Err value represents the mean square error from the nominal value, and represents the run time for solving the optimization problem. It is seen that the resulted robust model exhibits smaller absolute values of , , and with the increase of value; that is, the regression curve is more flat as the model parameters are less precise. It is obvious that one drawback of the robust model is that the mean square error will increase as uncertainty increases. Figure 3 plots the regression curves for different models and also supports our analysis of the effect of increasing data uncertainty on robust regression.

To demonstrate the effectiveness of the robust models, we test the worst-case performance of the resulted models when varies from 0 to 0.1. Specifically, for each value, we randomly generate 500 groups of data from the defined uncertainty set and then calculate the maximal residual error at each data point. Figure 4 plots the worst-case error of LS-CQR model and LS-RQR models with , and . It is seen that the error of LS-CQR model increases rapidly, and LS-RQR with has the most flat error curve. Figure 4 also indicates that it is critical to accurately estimate the variability of the data and set proper value for . In our case, we recommend LS-RQR with that is almost always better than the traditional LS-CQR model.

Next we test the proposed RQR models under (L1-RQR) and -(LI-RQR) norm criteria on the same data set. Figure 5 plots the corresponding regression curves for the same uncertainty set . For the same value, LI-RQR model can be considered as the most robust one, and L1-RQR and L2-RQR models are similar. It is noticeable that it contradicts with the traditional robust regression terms. For example, [26] refers to the -norm regression as the robust regression model in the sense that the corresponding model is insensitive to the large residual errors(corresponding to the outliers). However, after removing the possible abnormal data points, here we try to make our regression analysis insensitive to the worst-case residual errors at each data point.

Finally we apply the proposed RQR model on more data sets, including USA data from 1870 to 2006, Switzerland data from 1965 to 2006, and Belgium data from 1960 to 2006. Figures 6, 7, and 8 give the resulted regression models and the worst-case residual errors for different values. It is seen that the proposed RQR models still almost always outperform the CQR model, especially for large uncertainty sets. Based on the robust quadratic regression models, these three countries reach the highest per capital energy consumption points at per capital GDP value around while the peak values vary from to Ton.

5. Conclusions and Future Works

In this paper, we studied the multivariate quadratic regression model with imprecise statistic data. Unlike the traditional robust statistic approaches that focus on the detection of the outliers and the elimination of the effects, we employed the recently developed robust optimization framework and uncertainty set theory.

In particular, we first extended the existing robust linear regression results to the robust least square quadratic regression model with the separable ball uncertainty set. The specific form of the uncertainty set allowed us to use the well-known S-lemma and give the tractable equivalent semidefinite programming. We further generalized the result to robust models under - and -norm criteria with general ellipsoid uncertainty sets. Next the proposed robust models were applied to the energy-growth problem. Under the classical conservation hypothesis, we employed the traditional quadratic regression model to remove the abnormal data and established a robust quadratic regression model for the per capital GDP and per capital energy consumption. Finally the proposed models were tested on the history data of Germany, USA, Switzerland, and Belgium. From the numerical experiments, we found that (1) the amplitude of the uncertainty perturbation plays a critical role on the robust models; (2) with the increase of , the robust model has a more flat curve; (3) for the same value, compared with - and -norm models, -norm model is the most robust one; (4) as expected, the robust approach provides a serial robust regression models that can reduce the worst-case residual errors when the observed data contain noise.

For further research, robust polynomial (nonlinear) regression models are interesting in their own right. Although we may always reduce them to the linear regression model with polynomially (or nonlinearly) transformed uncertainty data set, it is still worth studying whether the resulted regression models are solvable for quadratic regression with coupled uncertainty sets.

Acknowledgment

This work was supported by Geological Survey Project of China (nos. 1212010881801, 1212011120995).