Abstract

Quantile regression estimates are robust for outliers in y direction but are sensitive to leverage points. The least trimmed quantile regression (LTQReg) method is put forward to overcome the effect of leverage points. The LTQReg method trims higher residuals based on trimming percentage specified by the data. However, leverage points do not always produce high residuals, and hence, the trimming percentage should be specified based on the ratio of contamination, not determined by a researcher. In this paper, we propose a modified least trimmed quantile regression method based on reweighted least trimmed squares. Robust Mahalanobis’ distance and GM6 weights based on Gervini and Yohai’s (2003) cutoff points are employed to determine the trimming percentage and to detect leverage points. A simulation study and real data are considered to investigate the performance of our proposed methods.

1. Introduction

Quantile regression (QReg) has received much attention since the seminal work of Koenker and Bassett [1]. It can be considered as one of the important statistical breakthroughs in recent decades. The desirable advantages of quantile regression led to its application in wide areas of sciences such as in medicine, financial, economics, agriculture, environment, and others [2, 3]. QReg is an extension of the mean regression model to conditional of the different quantiles of the response variable distribution. Therefore, QReg is able to provide much more detailed stochastic relationship among random variables.

Consider the following regression model:where is an vector of a response variable, is a vector of covariates variables, is a vector of unknown parameters, and is an vector of error terms. For any -quantiles in the interval (0, 1), the parameter can be estimated consistently as the solution to the following optimization problem:where is the check function, defined aswhere denotes the indicator function.

One of the important advantages of quantile regression is the insensitivity for outliers and heavy tailed distribution for error term. This robustness of QReg for outliers arises because of the nature of the check function which is shown in (3) (see [35]). Similar to M-estimator regression, QReg is not robust when the predictor variables contain outliers which are called high leverage points (HLPs) [6]. There are some attempts to overcome the effect of HLPs and maximize the breakdown point of QReg. Giloni et al. [7] proposed a weighting method to increase the breakdown point and cope with HLP, based on the blocked adaptive computationally efficient outlier nominators (BACON) method that is proposed by Billor et al. (2000), in which a clean subset is chosen via their algorithm. The limitation of the weighting method is that it can be used with small numbers of regressers (often, one or two regresser variables). Rousseeuw and Hubert [8] proposed the regression depth as an extended version for regression quantile. They pointed out that the depth quantiles is robust to HLPs. Adrover et al. [9] presented a robust estimation method that is unaffected by leverage points and, at the same time, maximizes the breakdown point. The disadvantages of the weighting method and depth quantiles are computational complexity and nonstandard asymptotic distributions Neykov et al. [10].

Recently, least trimmed QReg is proposed by Neykov et al. [10] to reduce the effects of HLPs. This method is a generalization of the location estimator that was proposed by Tableman [11] and least trimmed absolute deviation proposed by Hawkins and Olive [12]. Neykov et al. [10] proved the consistency of the least trimmed quantile regression method and discussed the breakdown point of the estimators. The limitation of this method is that the trimming percentage is a constant whereby the trimmed data may be lower or higher than the actual contamination percentage of the data. The least trimmed quantile method minimizes the quantile residuals in (2) for the subset (h) out of the sample size (n). However, it is important to mention that the leverage point is not affected by residuals. Therefore, this method does not correctly detect the high leverage points.

In this paper, we proposed a new algorithm to develop the least trimmed quantile regression method and to overcome these disadvantages in the existing methods. The new proposed algorithm integrates the reweighted least trimmed method that proposed by Čížek [13] with QReg to determine the trimming percentage and robust Mahalanobis’ distance to identify the HLPs. In addition, we employ the Gervini and Yohai [14] technique to compute the cutoff point and new weights for the QReg. Besides that, RMD is used to detect the leverage points.

2. Least Trimmed Quantile Regression (LTQReg)

Least trimmed squares (LTS) method is a robust estimation technique proposed by Rousseeuw [15] by minimizing the following objective function:where is the i-th order statistic squares residuals, , , …, , and is a subset out of (n). The trimming constant , where is a ratio of trimming. When , the highest breakdown point of the LTS estimator will be achieved (0.50) [16]. Roozbeh and Hamzah [17] developed the LTS method for restricted semiparametric regression models. Based on the least trimmed squares (LTS) method, robust ridge and nonridge type estimators were developed by Roozbeh [18] in semiparametric model regression when the errors are dependent. Roozbeh et al. [19] introduced some alternative robust estimators based on a penalization scheme, whereas a nonlinear integer programming was used.

Neykov et al. [10] proposed the least trimmed quantile regression (LTQReg) as an efficient and robust method to overcome the effect of HLPs on QReg. LTQReg is defined as follows:where is defined as in (2) and (3). Neykov et al. [10] proved that when the trimming constant , the breakdown point of LTQReg estimator is asymptotically equal to 0.50, where is the maximum number of explanatory variables. Müller [20] and Neykov et al. [10] pointed out that .

The LTQReg method is based on the smallest quantile errors to reduce the influence of leverage points. In this situation, we would like to ask the following question: is the error values of the QReg and LS will be high for all leverage points? We most answer to this question by the following example. Let us consider the simple linear regression model , with a sample size of n = 50, and the independent variable is uniformly distributed . Let and .

In order to illustrate the effect of leverage points and outliers on the error term, 20% of the observations are contaminated by replacing the first 10 observations with contaminated observations. We consider three cases of contaminations: outliers, HLPs, and both outliers and HLPs simultaneously. The first 10 observations of the explanatory variable and dependent variable are contaminated as follows: and . Least squares (LS) and QReg at three quantiles (0.25, 0.50, and 0.75) were then applied to the data. In this example, we want to investigate if the LS and QReg produced high errors in all contamination scenarios which are suitable for LTQReg.

For all the three cases of contaminations, the fitted residuals are plotted as shown in Figures 13.

Figures 13 clarify influences of outliers, HLPs, and both on LS and QReg in different quantiles. In Figures 1 and 3, we can see clearly that when the data are contaminated by outliers, the first 10 observations have highest residuals for both LS and QReg in different quantiles. On the contrary, in Figure 2, when the data are contaminated by HLPs, the residuals of LS and QReg are not affected by HLPs. From Figures 13, we can conclude that the outlier observations have a direct effect on the residuals, whereas the leverage points have no effect on the residuals. Hence, we can say that the LTQReg method is not an effective method to reduce the effect of leverage points because it is based on trimming the highest (n − h) residuals.

3. Modified Least Trimmed Quantile Regression (MLTQReg)

In this section, we will discuss the modified LTQReg method to determine the rate of contamination data and the best trimming percentage. Three modified methods will be discussed in this section based on the reweighted least trimmed squares (RWLTS) method which was proposed by Čížk [13] depending on hard rejection weights [16] and combined to the LTQReg method to robustify the weighted least squares method. The hard rejection weights in the RWLTS method are defined aswhere are the standardized of regression residuals and is the cutoff point that was adapted by Gervini and Yohai [14]. The cutoff point value is computed by comparing the empirical distribution function of standardized absolute residuals with the distribution function of absolute residuals under the assumed model. The friction of unusual observation in the sample can be measured aswhere k = 2.5 [16]. Therefore, the value is set to the quantile of as follows:

The procedure of the reweighted least trimmed square method [13] can be described in two steps. The first step is determining the trimming constant h based on the weights that are given in (6), defined as

The second step is applying the LTS method depending on the trimming constant h that is computed in the first step.

To increase breakdown points of the proposed method, a high breakdown estimator LTS, LMS, or S are used as initial (see [13, 14]) and the robust weights are used to improve the efficiency.

Next, we will describe three algorithms based on the RWLTS to improve the LTQReg [10].

3.1. Reweighted Least Trimmed Quantile Regression (RWLTQReg)

In this method, we combine the RWLTS with LTQReg to determine the trimming constant, and the algorithm for this method can be describe as follows:Step 1. Consider the LTQReg estimator as an initial estimate with high breakdown point and compute the standardized residual for .Step 2. Calculate hard rejection weights for the standardized residuals aswhere is the cutoff point of Gervini and Yohai [14] that is shown in (8).Step 3. Calculate the trimming constant based on the weights in equation (10), from the formula .Step 4. Applying the LTQReg based on the algorithm that proposed by Neykov et al. [10] for the subset of the size , this procedure can be described as follows:(i)Set , select a subset with the size from the sample.(ii)For the subset , use the QReg to estimate the coefficients .(iii)For all observations in the sample, compute the residuals and then order the residuals as , .(iv)Then, set and the new subset is considered the first .(v)For the new subset, Steps (ii), (iii), and (iv) were repeated. This procedure is repeated until convergence.

3.2. Modified Least Trimmed Quantile Regression Based on RMD (RMD-LTQReg)

In this algorithm, we used the modified least trimmed quantile regression (MLTQReg) and RMD to detect the leverage points. The algorithm is presented as follows:Step 1. Compute the RMD as follows:where T (X) and C (X) are the location and the shape estimates of MVE.Step 2. Compute , where K is the cutoff point computed as follows:where and c is a constant, 2 or 3.Step 3. As in Step 1 and 2 of the RWLTQReg algorithm, we compute weights.Step 4. Find the final weights by combining with as follows:Step 5. Hence, the trimming constant ...Step 6. We will apply Step 4 in the RWLTQReg algorithm for the subset of the size from the sample and set probability zero for the leverage points to ensure that we will not start with the bad subset (contains leverage points), Rousseeuw and Van Driessen [21], which means the condition of is satisfied (clean of leverage points).

3.3. Modified Least Trimmed Quantile Regression Based on GM6 Method (GM6-LTQReg)

The GM-estimator is proposed by Schweppe (see [22]) to reduce the influence of leverage points. Adrover et al. [9] showed that the breakdown of the GM-estimator was never higher than . Coakley and Hettmansperger [23] proposed the GM6-estimator to increase the breakdown point of the GM-estimator by using the least trimmed squares (LTS) as initial and RMD based on MVE to downweight leverage points. In this paper, we suggest using the GM6 weights to modify the LTQReg, and the procedure of this modification can be determined by following algorithm:Step 1. For , compute an initial estimate for the coefficients and the corresponding residuals , the high breakdown estimators (LTQReg), Neykov et al. [10].Step 2. Compute the scale of the residuals (se), as follows:Step 3. Determine the standardized residuals , where is an initial weight computed as follows:Step 4. Hard rejection weights for the standardized residuals can be computed as follows:where is a Gervini and Yohai [14] cutoff point that is shown in (8).Step 5. Hence, the trimming parameter will be computed as .Step 6. Step 4 in the RWLTQReg algorithm for the subset of the size .

4. Simulation Study

In this section, the Monte Carlo simulation study is presented to compare the performances of some existing methods such as LTQReg [10] and QR [1] with our proposed methods RWLTQReg, RMD-LTQReg, and GM6-LTQReg.

Following Neykov et al. [10], two explanatory variables ( and ) are generated with large sample size (n = 100 and 200) from the following classical heteroscedastic multiple linear regression model:where we assume that the coefficients , and the error term is distributed as . Also, two experiments are considered with different distribution for explanatory and response variables with three levels of contamination . The trimming percentage for the LTQReg will be considered as (0.20, 0.30).

4.1. The First Experiment

In this experiment, a distribution of the variables in the model is the uniform distribution (Unif) with parameters . The variables are contaminated with different percentages, where the explanatory variables are contaminated as , j = 1, 2, and the response variable is contaminated as .

4.2. The Second Experiment

In this experiment, a distribution of explanatory variables is set as normal distribution . The variables are contaminated with different percentages, where the explanatory variables are contaminated as , j = 1, 2, and the response variable is contaminated as .

The contamination is done by replacement of clean data by outlying data in both explanatory and response variables. Let , and the explanatory variables are contaminated by replacing clean observations with outlying observations, whereas the response variable is contaminated by replacing clean observations with outlying observations.

Figures 4 and 5 show the spread shape of generating data. It can be seen clearly that these data contain leverage points (outlying in the x direction) and outliers (outlying in the y direction) and influence observations (outlying in both x and y directions in the same time).

At the three quantiles 0.25, 0.50, and 0.75, the generated data in different contamination percentages (10%, 20%, and 30%) are fitting via the proposed methods (RWLTQReg, RMD-LTQReg, and GM6-LTQReg) and the existing methods (QReg and LTQReg). Root of mean squares errors () and mean absolute errors () for the model and standard error for the parameters () are computed to evaluate our proposed methods.

In Tables 1 and 2, we reported the RMSE and MAE values for the first and second experiments. In these tables, we can see that RMSE and MAE values for all the methods at three quantiles are shown in the rows and three levels of contamination are shown in the columns. LTQReg (20%) and LTQReg (30%) show the least trimmed quantile method with 20% and 30% trim, respectively. The results in these tables are the average of 100 replications for the two experiments of the simulation study.

Table 1 and 2 show that the RMSE and MAE values for the QReg method at the different quantiles and different levels of contamination are the highest. That is, this method is more affected than the other methods by outlying data that fall in both x and y directions. On the contrary, we can see that the proposed method RMD-LTQReg has the lowest values of RMSE and MAE in most cases. This indicates that the performance of RMD-LTQReg is better than the others. On the other hand, the RWLTQReg has better performance than the other methods except the RMD-LTQReg. In addition, we can see when the contamination levels are 20% and 30%, the GM6-LTQReg performance is better than LTQReg (20%) in most cases and LTQReg (30%) in the 30% contamination level.

In Figures 69, we can see that the SE values for the parameters that estimated by the RMD-LTQReg method are the smallest in almost all cases, which indicates the performance of the RMD-LTQReg method is the best among all studied methods. Also, these figures show clearly that the QReg method has high SE leads to worse performance. In addition, the rest of the methods used showed close results in most cases and were varying in some other cases. Therefore, it could be argued that it is difficult to determine which one is better than others.

5. Real Data Applications

In this section, the “Star Cluster CYG OB1” dataset is considered to verify the performance of our proposed methods.

5.1. Star Cluster CYG OB1

The Star Cluster CYG OB1 dataset was used by many researchers such as Rousseeuw and Leroy [24], Adrover et al. [9], and Neykov et al. [10]. This dataset contains 47 observations with one explanatory variable which is the logarithm of the effective temperature at the surface of stars. The independent variable is the logarithm of its light intensity. Rousseeuw and Leroy [24] presented that the scatterplot of this dataset shows two groups of observations. The first group includes the majority of data that contain 43 stars, whereas the second group includes the remaining four stars (the observations are 11, 20, 30, and 34). The observations 11, 20, 30 and 34 are classified as leverage points [24].

In this example, we consider three quantiles (0.25, 0.50, and 0.75) to examine robustness of our proposed methods.

Table 3 presents the RMSE and MAE values for all proposed and existing estimation methods at each quantiles. It is clear to see that the QReg method has the highest RMSE and MAE values, whereas, the RMD-LTQReg method following by the RWLTQReg method has the better performance due to they have the smallest RMSE and MAE values, whereas the RMD-LTQReg method have detected the HLPs correctly. Also, we can see that the LTQReg (30%) is better than both of GM6-LTQReg and LTQReg (20%).

Figure 10 shows the fitted residuals of regression quantiles for the existing and proposed methods. We can see that the QReg method is dramatically affected by the leverage points. Even though the RWLTQReg has lowest RMSE and MAE values in some cases, it is also affected by leverage points evident by trimming the observations that have high residuals, but it failed to trim leverage points. However, LTQReg (20%), LTQReg (30%), and GM6-LTQReg methods showed convergence in the chart and illustrated that these methods were also affected by the HLPs but better than QReg. The proposed method RMD-LTQReg shows a good performance due to its ability to trim the leverage points.

6. Conclusions

In this paper, we proposed a new estimation method to overcome the impacts of leverage points in data. The new estimation method is called modified least trimmed quantile regression. In addition, we proposed three methods based on hard rejection weights that are used in reweighted least trimmed squares (Čížek [13]) to determine the trimming constant and to reduce the leverage point influence. In our proposed methods, the cutoff point of Gervini and Yohai [14] is employed for QReg. Moreover, Reweighted least trimmed, GM6 weights and robust Mahalanobis’ distance are developed for quantile regression.

To investigate the performances of our proposed methods, a simulation study and real data are considered. The results indicate that the LTQReg has bad performance with data having leverage points due to it trims observations that have high residuals, whereas leverage points do not always have high residuals. Although, the RWLTQReg has good performance, evident by small RMSE, MAE and SE values, but it is not able to get rid of the leverage points. It is the same for the GM6-LTQReg that even though it is able to determine the trimming parameters, it is also affected by leverage points. From the results, it is clear to see that the RMD-LTQReg method is the best estimation method which can avoid the effect of leverage points.

Data Availability

The “Star Cluster CYG OB1” dataset is considered to verify the performance of our proposed method. This dataset is obtained form the basis for the main sequence in a Hertzsprung–Russell diagram of the Star Cluster CYG OB1, and it has been used by many researchers such as Rousseeuw and Leroy [24], Adrover et al. [9], and Neykov et al. [10]. It is available at package “robustbase” in R.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.