Abstract

Identification of high leverage point is crucial because it is responsible for inaccurate prediction and invalid inferential statement as it has a larger impact on the computed values of various estimates. It is essential to classify the high leverage points into good and bad leverage points because only the bad leverage points have an undue effect on the parameter estimates. It is now evident that when a group of high leverage points is present in a data set, the existing robust diagnostic plot fails to classify them correctly. This problem is due to the masking and swamping effects. In this paper, we propose a new robust diagnostic plot to correctly classify the good and bad leverage points by reducing both masking and swamping effects. The formulation of the proposed plot is based on the Modified Generalized Studentized Residuals. We investigate the performance of our proposed method by employing a Monte Carlo simulation study and some well-known data sets. The results indicate that the proposed method is able to improve the rate of detection of bad leverage points and also to reduce swamping and masking effects.

1. Introduction

In regression problems, several versions of outliers exist such as residual outliers, vertical outliers, and high leverage points. Any observation that has large residual is referred to as a residual outlier. Ve rtical outliers (VO) or -outliers are those observations which are extreme or outlying in -coordinate. On the other hand, high leverage points (HLPs) are those observations which are extreme or outlying in -coordinate. HLPs can be classified into good leverage points (GLPs) and bad leverage points (BLPs). GLPs are those outlying observations in the explanatory variables that follow the pattern of the majority of the data, while BLPs are the opposite. BLPs have a larger impact on the computed values of various estimates. On the other hand, GLPs contribute to the efficiency of an estimate (see [14]). As such, only BLPs should be weighted down while GLPs should not be given low weights in the computation of weighting function in any robust method. Nonetheless, it is now evident that most robust methods attempt to reduce the effect of outliers by weighting the outliers down, irrespective of whether they are good or bad leverage points.

There are a number of good papers in the literature on the detection of HLPs (see [58]). However, those detection methods are mainly focused only on the identification of HLPs without taking into consideration their classification into good and bad. It is very important to make the classification, as only the BLPs are responsible for the misleading conclusion about the fitting of the regression model.

It is not easy to capture the existence of several versions of outliers in multiple regression analysis by using a graphical method ([6]). If only one independent variable is being considered, the four types of outliers can easily be observed from a scatter plot of against the variables. However, for more than one predictor variable, it is difficult to detect these outliers from a scatter plot. Not much work has been focused on classifying HLP’s into good and bad leverage points. Rousseeuw and Van Zomeren [2] have proposed a robust diagnostic plot or outlier map which is more efficient than the nonrobust plot for classifying observations into four types of data points, namely, regular or good observations, vertical outliers, GLPs, and BLPs. We suspect that this plot fails to detect multiple outliers and high leverage points. Pison and Van Aelst [9], suggested a new plot using robust distance obtained from robust location and scale estimators to identify outliers and empirical influences and to find the influence of observations. Similar to Rousseeuw and Van Zomeren plot, they construct a graphical tool for multivariate models. Although Pison and Van Aelst plot can be very useful in model building stage to evaluate the quality of a fit based on the number of detected outliers, it does not focus on classifying unusual data into VO, GLP, and BLP (see [9]). Hubert et al. 2005 introduced a new diagnostic plot based on robust principle component analysis (PCA) denoted by ROBPCA (see [10]). The ROBPCA can distinguish between 4 types of observations for high dimensional data. The regular observations are close to the PCA subspace. GLP lies close to the PCA space but far from the regular observations. We can also have VO, which have a large orthogonal distance to the PCA space. The BLPs have a large orthogonal distance whose projection on the PCA subspace is remote from the typical projections. To draw the diagnostic plot or outliers map, on the horizontal axis, we plot the robust score distance of each observation and on the vertical axis we draw the orthogonal distance of each observation. They suggested that the cut-off point for horizontal axis is and approximately cut-off value for vertical axis equals , where and are the estimated mean and standard deviation, respectively. The ROBPCA method is efficient for high dimensional data but not for low dimensional data.

As such, we propose a suitable plot, in this regard. The Modified Generalized Studentized Residuals for the identification of multiple vertical outliers and multiple high leverage points are discussed in Section 2. In Section 3, we propose a new robust diagnostic plot for classifying observations into the four categories. The proposed plot is based on the Modified Generalized Studentized Residuals (). A simulation study and three numerical examples are presented in Sections 4 and 5, respectively. Finally, some concluding remarks are given in Section 6.

2. The Modified Generalized Studentized Residuals for the Identification of Multiple Outliers and Multiple High Leverage Points

Consider a multiple linear regression model as follows:where is an vector of observation of dependent variables, is an matrix of independent variables, is a vector of unknown regression parameters, is an vector of random errors with identical normal distribution of , and is the number of independent variables. The linear regression model in (1) can be rewritten as follows: The ordinary least squares (OLS) estimates for linear regression in (1) are given byand the th residuals can be expressed in terms of the true disturbance as follows:where is a projection matrix or hat matrix denoted by “.” The diagonal elements of “” matrix are called the hat values denoted by , given byThe values are often used as a classical diagnostic method to identify the high leverage points. However, the mostly fails to detect HLPs due to the fact that it suffers from the masking effect ([6, 7, 12, 13]). However, the mostly fails to detect HLPs due to the fact that it suffers from the masking and swamping effects. The masking problem happens when we fail to detect some outliers that are hidden by other outliers (outlier is identified wrongly as an inlier), whereas the swamping happens when we detect some inliers as outliers ([14]). According to Barnett and Lewis (1994), the masking is “the tendency for the presence of extreme observations not declared as outliers to mask the discordancy of more extreme observations under investigation as outliers” ([15]). On the other hand, the errors corresponding to the outliers may be very big, and this may lead to swamping problem. It means that the clean observations are declared as outliers by mistake. Hadi [13] suggested a single case deleted measure called Potentials matrix. The diagonal elements of Potential matrix denoted by “” are given by (see [1, 13, 16])where is the matrix excluding the th row. We can rewrite as a function of as Although the potential matrix is efficient to identify a single outlier, it is not successful to identify multiple HLPs (see [3, 17]). To remedy this problem, Imon [17] proposed the Generalized Potentials denoted by “GP” as a diagnostic method for multiple HLPs. The GP diagnostic method is able to detect multiple HLPs, but it is not adequately effective in identifying the exact number of HLPs. This is due to the choice of the initial basic subset, which is prone to masking effects. Habshah et al. [3] developed the Diagnostic Robust Generalized Potential (DRGP) to improve the rate of detection of HLPs. They divided data into two sets, remaining data which contains clean data and deletion set which contains all suspect data. Let us denote a set of good cases (remaining in the analysis) by “” and a set of bad cases (deleted from the analysis) denoted by “.” The DRGP consists of two steps, whereby in the first step the robust method is used to identify the suspected HLPs. In the second step, the generalized potential diagnostic approach is used to confirm our suspicion. Habshah et al. [3] pointed out that the low leverage points (if any) are put back into the estimation of the remaining subset, sequentially (the observation with the smallest will be substituted first), and then recompute the values. This process is continued until all members of the deletion set have been checked to determine whether or not they can be declared as HLPs.

The suspected HLPs are determined by the robust Mahalanobis distance (RMD), based on the minimum volume ellipsoid (MVE) developed by Rousseeuw and Leroy [6] aswhere and are robust locations and shape estimates of the MVE, respectively. Habshah et al. [3] suggested using the following cut-off value for the robust Mahalanobis distance:where MAD is the median absolute deviation.

Suppose that “” contains cases after cases where “” have been deleted. Once the remaining set has been determined, the second steps of DRGP are carried out to confirm the suspected HLPs by using the GP, denoted by and defined aswhereObservations, in which the values are larger than the following threshold,where can be taken as a constant value of 2 or 3, are declared as HLPs. The studentized residuals (internally studentized residuals) and -studentized residual (externally studentized residuals) are widely used measures for the identification of outliers (see Cook and Weisberg [18]). The studentized residuals are defined as follows:where is the standard deviation estimator of the residuals. The special case of (13) is called the -studentized (denoted by ), given by (Chatterjee and Hadi [12])where is the standard deviation estimator of the residuals excluding the th case. These two measures ( and ) are also not very successful in detecting multiple outliers due to masking and swamping effects ([3, 17]). Imon [17] suggested a Generalized Studentized Residual (denoted by ) based on a group of deletions to identify multiple outliers. The generalized version of regression diagnostics first requires the selection of deletion group “” that contains all suspected influential cases and the remaining cases “.” The suspected influential cases consider outliers and HLPs separately, whereby outliers and HLPs are identified using the robust Reweighted Least Squares (RLS) residuals (Rousseeuw and Leroy [6]) and GP (Imon [8]), respectively. The union of the set of suspected outliers and the set of suspected HLPs points forms the members of the deletion set, which has observations. However, the initial basic subset of is not very stable since it is based on the GP which suffers from masking and swamping effects. In this regard, the DRGP is employed to remedy this problem. The Modified Generalized Studentized Residuals (MGti) are formulated based on the RLS and the DRGP as initial estimators. Once the “” set is identified, the vector of the estimated parameters in the remaining groups, denoted by , is defined asThe residual for th deletion observation is given byThe th externally studentized residual for the remaining groups “” is given byThus, the diagonal elements of the hat matrix are given by By utilizing the results of Rao [19], an additional point in the “” set is defined as and the corresponding estimate is given byHence, the formulation of the externally studentized residual for is defined asSubsequently, the Modified Generalized Studentized Residuals () for the whole data set are formulated by combining (17) and (21) as follows:where is the standard deviation of group and is the standard deviation of group excluding the th case.

3. New Diagnostic Plots for Classifying Observations into Four Categories

Rousseeuw and Van Zomeren [2] proposed a robust diagnostic plot which is more effective than the nonrobust plot for classifying observations into regular observations, vertical outliers, GLPs, and BLPs. Rousseeuw and Van Zomeren plot draws the standardized least median of square residual (LMS) against the robust Mahalanobis distance (RMD) based on the minimum volume ellipsoid (MVE); this plot is denoted by (LMS-RMD). The nonrobust plot draws the Studentized OLS residuals () against the Mahalanobis distance (MD), and we called this plot as (OLS-MD) plot. We suspect that the robust LMS-RMD diagnostic plot is not very effective in classifying the observations into respective categories since it is based on the robust Mahalanobis distance, which suffers from swamping effects. Moreover, this plot uses studentized residual which is not very successful in identifying multiple outliers. Habshah et al. [3] showed that the DRGP was very successful in detecting multiple HLPs. In addition, we anticipate that the newly proposed MGti is able to detect multiple outliers. As such, we proposed improving the classification method of Rousseeuw and Van Zomeren [2] by plotting versus DRGP as shown in Table 1. Our proposed diagnostic plot is called (MGt-DRGP) plot. The basic rules for classification observation by using the new proposed method are as follows (see Habshah and Mohammed [20]):(i)regular observation (RO): an observation is declared as a “RO” if and ;(ii)vertical outlier (VO): an observation is declared as a “VO” if and ;(iii)GLPs: an observation is declared as a GLP if and ;(iv)BLPs: an observation is declared as a BLP if and .

4. Monte Carlo Simulation Study

In this section, a Monte Carlo simulation study is designed to evaluate the performance of our new proposed method, MGt-DRGP plot in classifying observation into regular observations, vertical outliers, and good and bad HLPs. Here, the MGt-DRGP plot is compared with some existing plots, namely, the OLS-MD and LMS-RMD plots. The performances of these plots are evaluated based on the rate of correct detection of BLPs and the rate of masking and swamping effects. A good plot is the one that has a higher percentage of correct detection of BLPs and smaller rate of masking and swamping effects. The experiments consider two explanatory variables, and 3. In each experiment, four samples of size , and 200 and different percentages of BLPs (, and 0.20) are considered. The regular observations are generated according to a normal distribution with a mean equaling 0 and a variance equaling 1. In order to generate HLPs in a data set, the first “100 α%” observations of the regular data in and variables are replaced with a certain percentage of BLPs. To generate BLPs, the first value of a BLP is kept fixed at 10 and sequential values are created by multiplying the values index, , by 2. Each experimental run involves 5000 replications. The results of the Monte Carlo simulation study are summarized in Tables 2 and 3. They present the percentage of correct detection of BLPs and the masking and swamping rates at different levels of contamination and different sample sizes.

The results clearly indicate that the MGt-DRGP plot has a superior ability to identify the correct number of BLPs compared to OLS-MD and LMS-RMD plots regardless of the number of regressor variables, contamination rate, and size of samples. At a low level of contamination (), the MGt-DRGP plot has perfectly identified the BLPs without any masking or swamping effects. The LMS-RMD plot has a high percentage of detection of BLPs. However, it also has a high percentage of swamping due to the weakness of the RMD method, which tends to swamp some low leverage points. Although the OLS-MD has a small low rate of swamping, its performance is very poor for detection of BLPs due to the masking effects.

At moderate and high levels of contamination “, and 0.20,” the MGt-DRGP plot still outperforms other plots. On the other hand, the performance of LMS-RMD plots decreases in terms of having a smaller percentage of correct detection and increasing rate of masking and swamping effects. It can be seen that the performance of the OLS-MD plot is very bad as the level of contamination increases. The MGt-DRGP plot consistently has a higher percentage of correct detection and the lowest rate of swamping and masking, irrespective of the number of independent variables , level of contamination, and sample size.

5. Example and Discussion

In this section, three examples are considered to assess the performance of our proposed diagnostic method.

5.1. Artificial Data Set

Our first example is an artificial data set given by Kamruzzaman and Imon [11]. This data set has three variables and contains 20 observations that are generated independently as “uniform .” We suppose that the third variable is the response variable and the rest are the predicted variables. The boxplot in Figure 1 shows that this artificial data set does not have any outlier or high leverage point. We would like to evaluate the performance of our proposed diagnostic method in two situations. Firstly, we modified the data to see the effect of one BLP on the plots. To create BLP, the th case of the response and predictor variables (, and ) are replaced by outliers, where clean observations are replaced by arbitrary large values displayed in the brackets in Table 4. In this regard, the original data is modified by substituting one good observation with one BLP (case 20). In the second situation, we would like to observe the combined effect of VO, GLP, and BLP. In this situation, the original data is modified by replacing 2 good observations with 2 VO (cases 1 and 2), 2 GLPs (cases 3 and 4), and 1 BLP (case 20). To create VO, only the th observation of response variable is replaced by arbitrary large value and to create GLP; the th case of predicted variables ( and ) is replaced by arbitrary large values. Table 4 exhibited the original and the modified artificial data

The three plots (OLS-MD, LMS-RMD, and MGt-DRGP) were then applied to both modified artificial data sets. The values of the diagnostic methods and preceding plots are displayed in Table 5 and Figure 2, respectively. Since only a single BLP is present in the first modified data, we can clearly see that all the classical and robust diagnostic plots are able to correctly detect the BLP (case 20) as shown in Figures 2(a), 2(b), and 2(c). However, for second modified data, it is interesting to observe that only the MGti-DRGP plot is able to detect and classify the 5 outlying observations (cases 1, 2, 3, 4, and 20) into VO, GLPs, and BLPs correctly (see Figure 2(f)). Although the LMS-RMS plot is able to detect and classify those 5 outlying observations correctly, its swamped 4 observations (cases 8, 10, 15, and 18) are as shown in Figure 2(e). The classical OLS-MD can only detect 1 outlying observation (case 2) and mask 4 observations (cases 1, 3, 4, and 20) as shown in Figure 2(d).

Next, we would like to justify which plot has identified or has classified the suspect observations correctly. Since removing suspect observations leads to a drastic change in the coefficient estimates, a good classification plot is the one which corresponds to the highest percentage changes for various estimates. The percentage of change in estimate (PCE) is computed as where is the OLS parameter estimates of the original data and is the OLS estimate for data set excluding suspected cases (VO and BLPs) and the “” is the absolute value. Another criterion of a good plot is that, after deleting the suspect bad influential observations, there is a significant reduction in the standard error of the estimates. Table 6 presents the regression coefficients, standard error for coefficients, coefficient of determination, -test, and the PCE values for the second modified data in different situations, when we remove a single outlier (case 2) that is detected by OLS-MD plot and when we remove the outlying observations (cases 1, 2, 8, 10, 15, 18, and 20 and cases 1, 2, and 20) that are detected by LMS-RMD and MGt-DRGP plots, respectively. The results of Table 6 clearly show that the suspected influential observations that are detected by MGt-DRGP plot are removed, resulting in the smallest standard error for coefficients and having the largest PCE values.

5.2. Phosphorus Data Set

The second example is the phosphorus data set, which is taken from Snedecor and Cochran [21]. The concentrations of phosphorus in parts per million in each of 18 soils were measured to investigate the source from which the corn plants obtain their phosphorus. This data set has two predictor variables; is the concentrations of inorganic phosphorus in the soil and is the concentrations of organic phosphorus in the soil. The response variable is the phosphorus content of corn grown in the soil at 20°C. Chatterjee and Hadi [1] showed by using the Studentized OLS residuals that this data set has a single outlier (case 17). Table 7 shows the absolute values for MD, RMD, DRGP, , Studentized LMS residuals, and . The MD did not detect any HLP, whereas the RMD detected 2 HLPs (cases 1 and 6) and the DRGP detect 3 HLPs (cases 1, 6, and 10). The detected only one outlier (case 17), whereas the Studentized LMS residuals and MGti detected 2 outliers (cases 10 and 17).

Next, we would like to see the classification made by OLS-MD, LMS-RMD, and MGt-DRGP plots. From Figures 3, 4, and 5, we can see clearly that the OLS-MD plot identified one vertical outlier (case 17). However, the LMS-RMD plot identified 2 vertical outliers (cases 10 and 17) with 2 GLPs (cases 1 and 6), whereas the MGt-DRGP plot can classify the observations into their respective categories, where cases 1 and 6 are classified as GLPs with one vertical outlier (case 17) and one BLP (case 10). Similar to the artificial data, we would like to assess the performance of our proposed plot. The results of Table 8 demonstrate the standard deviation for regression coefficients in different situations, when we remove a single outlier (case 17) that is detected by OLS-MD plot and when we remove influential observations (cases 10 and 17) which are detected by LMS-RMD and MGt-DRGP plots. Also, we discussed the effect of removing BLPs (cases 10 and 17) and removing GLPs (cases 1 and 6) that are identified by both LMS-RMD and MGt-DRGP plots. The results of the analysis indicate that omitting influential observations (cases 10 and 17) has the greatest effect on the standard error of the parameter estimates. It is interesting to observe that keeping the GLPs (cases 1 and 6) has reduced the standard error of estimates compared to removing these two observations from the data set.

5.3. Aircraft Data Set

The last example is the Aircraft data set, which is taken from Gray [22]. This data set contains 23 cases with four predictor variables (aspect ratio, life to drag ratio, weight of the plane, and maximal thrust); the response variable is the Cost. From the results of Table 9, we can see that the identified one outlier (case 22), whereas the Standardized LMS residuals and MGti identified cases 16, 22 and cases 16, 19, and 22 as outliers, respectively. Moreover, the MD identified one HLP (case 14) while RMD and DRGP detected cases 14, 20, and 22 and cases 19, 22 as HLPs, respectively.

The classification of data into regular data, vertical outliers, and good and bad leverage points is shown in Figures 6, 7, and 8. It can be observed from Figure 6 that the nonrobust plot (OLS-MD) identified one vertical outlier (case 22) and one GLP (case 14). The LMS-RMD plot in Figure 7 detected one vertical outlier (case 16), BLP (case 22), and 2 GLP (cases 14, 20), while the MGt-DRGP plot in Figure 8 identified one vertical outlier (case 16), two BLPs (cases 19 and 22), and one GLP (case 21).

It can be seen from Table 10 that the standard errors of the estimates when removing observations 16, 22, and 19 (identified by MGt-DRGP) are smaller than those when removing case 22 and cases 16 and 22, which are identified by OLS-MD and LMS-RMD, respectively.

6. Conclusion

In this paper, we proposed a new method for the identification of BLPs by means of a diagnostic plot. Most of the time, the classical OLS-MD plot fails to correctly identify the BLPs. The robust LMS-RMD plot is also not very successful in classifying observations into four categories. In this regard, we propose a new MGt-DRGP plot which is very successful in classifying observations into regular observations, vertical outliers, and good and bad leverage points. The Monte Carlo simulation study clearly shows that the MGt-DRGP plot can detect BLPs correctly with very low rates of masking and swamping. It is interesting to observe that the OLS-MD suffers from the masking problem and LMS-RMD suffers from the swamping problem.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.