Abstract

In the statistical analysis of data, a model might be awfully fitted with the presence of outliers. Besides, it has been well established to use residuals for identification of outliers. The asymptotic properties of residuals can be utilized to contribute diagnostic tools. However, it is now evident that most of the existing diagnostic methods have failed in identifying multiple outliers. Therefore, this paper proposed a diagnostic method for the identification of multiple outliers in GLM, where traditionally used outlier detection methods are effortless as they undergo masking or swamping dilemma. Hence, an investigation was carried out to determine the capability of the proposed GSCPR method. The findings obtained from the numerical examples indicated that the performance of the proposed method was satisfactory for the identification of multiple outliers. Meanwhile, in the simulation study, two scenarios were considered to assess the validity of the proposed method. The proposed method consistently displayed higher percentage of correct detection, as well as lower rates of swamping and masking, regardless of the sample size and the contamination levels.

1. Introduction

Generalized linear model (GLM) is a continuation of the familiar linear regression model for modeling a nonnormal response variable [1]. In the statistical analysis of data, the model might be awfully fitted with the presence of outliers. In fact, any individual observation that appears to depart in some way from the remainder of that set of data is called an outlier [2]. Hence, the identification of outliers is a necessary step to obtain appropriate results in GLM [3]. Moreover, it has been well established to make use of residuals for the identification of outliers [4]. However, there is evidence that the maximum likelihood estimates for GLM most probably get distorted when the sample size, , of the data set is small [5]. The distribution of residuals, sometimes, differs from the distribution of the true residuals to the order , where is the sample size. Furthermore, the asymptotic properties of residuals for the selected regression model are available in the literature. For instance, Cordeiro and McCullagh [5] derived the formulae for first-order biases of maximum likelihood estimates of linear parameters, linear predictors, dispersion parameter, and fitted values in GLM. Cordeiro [6] defined the adjusted Pearson residuals where the mean and the variance are approximately zero and one. Later, Cordeiro and Simas [7] obtained an explicit formula for the density of the Pearson residuals to order , which hold for all continuous GLM and defined corrected residuals for these models. Recently, Anholeto et al. [8] acquired the matrix formula of order for the first two moments of Pearson residuals and adjusted Pearson residuals in beta regression models. These asymptotic properties of residuals can be utile to contribute as diagnostic tools [7].

However, it is now evident that a majority of the existing diagnostic methods are inadequate in identifying multiple outliers. A single case diagnostic method cannot correctly identify outliers if multiple outliers exist in a data set. The reasons are that a group of outliers is able to distort the fitting of a model as the outliers can have artificially tiny residuals that appear as inliers [9]. This type of troublesomeness is cognized as the masking effect. The opposite effect of masking is swamping, where the inliers may resemble the outliers. Thus, the outlier detection methods in GLM sustain from masking or swamping effects.

In addition, several published papers have considered the detection of multiple outliers. For example, Lee and Fung [10] proposed an adding-back procedure with an initial high breakdown point for the detection of multiple outliers in GLM and nonlinear regressions. On the other hand, Imon and Hadi [9] proposed a generalized version of standardized Pearson residuals based on group deletion method (GSPR) to overcome the difficulty of multiple outliers detection in logistic regression.

These have motivated the researchers to modify the corrected Pearson residuals [7] to adapt to the problems related to multiple outliers. Hence, this paper presents the development of a method for the identification of multiple outliers by employing corrected Pearson residuals based on group deletions. This method is called Generalized Standardized Corrected Pearson Residuals (GSCPR). Furthermore, a few frequently used diagnostics related to residuals have been briefly discussed for the identification of outliers in Section 2. In Section 3, the proposed GSCPR method is described for the identification of multiple outliers. Next, the usefulness of this proposed method is examined in Section 4 via a real data set, and finally, Section 5 reports the Monte Carlo simulations.

2. Residuals in GLM

As mentioned earlier, it is often very important to be aware of the existence of outliers in a data set. The outliers have been found to tremendously influence the covariate pattern, and hence, their existence may mislead the interpretation of the statistical analysis. The deviance residuals and the Pearson residuals are two common types of residuals in GLM. In this paper, the focus was only on the Pearson residuals. Hilbe and Robinson [11] pointed out that, by normalizing Pearson or deviance residuals to a standard deviation of 1.0, the entire deviance residual tends to normalize the residual better than the Pearson residual. Although deviance residual is the favorable statistics in GLM at the time the study was conducted, more tests developed based on the Pearson residuals are essential. The Pearson residuals are defined as , where is the th fitted values and is variance function. Meanwhile, the standardized Pearson residuals are defined as , where is the th fitted values, is the variance function, and hat-values for GLM is the th diagonal element of the matrix , where is a diagonal matrix with diagonal elements . is the vector of observations corresponding to the th case. Furthermore, an observation is stated as a residual outlier when its corresponding or exceeds a quantity in absolute term. A popular and well-reasoned choice for could be 3 as it matches the three-sigma distance rule practice in the normal theory [12]. Another suitable constant between 3 and 5 is considered if the cutoff value 3 identifies too many observations as outliers [13].

In general, the distribution of the Pearson residuals deviates from its true distribution by terms of order . In addition, the mean and the variance of its true distribution are zero and one, respectively. Cordeiro [6] claimed that the adjusted Pearson residuals have closer distribution to standard normal distribution compared to the corresponding moments of the unmodified residuals. The adjusted Pearson residual corrects the residuals to equal mean and variance, but its distribution is not equivalent to the distribution of the true Pearson residuals to order . Hence, the corrected Pearson residuals were proposed to remedy this problem. According to Cordeiro and Simas [7], it is essential to discern the - and the -asymptotic in GLM. The -asymptotic was applied in this method; that is, the dispersion parameter, , is fixed, whereas the size of the number of observations, , becomes large. This asymptotic theory is widely concerned about estimation and hypothesis testing with regard to the unknown parameter . Nonetheless, these methods are restricted to data set with continuous distributions, such as normal, gamma, and inverse Gaussian distribution. The corrected Pearson residuals for continuous GLMs are defined as , where is the Pearson residuals and is a function of order created to produce residual with the same distribution of to order . The correction function is equivalent to

Furthermore, the term in the equation is equivalent to . This expression can be easily implemented to any continuous model as only , , and need to be calculated. The equations for the correction for some important GLM are shown in the work conducted by Cordeiro and Simas [7]. Besides that, the values such as , for several useful link functions, , , , and for various distributions, are shown by Cordeiro and Simas [7]. Meanwhile, for normal models, There is also . As for gamma models,   =  . There are also   − .

Then,Furthermore, for inverse Gaussian models, and .

Next, and .

Then,Moreover, the QQ plot was employed for the quantiles of the corrected Pearson residuals against the quantiles of the estimated shifted gamma distribution to detect outliers. The corrected Pearson residuals were found to be superior than the uncorrected Pearson residuals in discerning the discrepancies among the fitted model and the data.

3. Identification of Multiple Outliers by Using GSCPR

The diagnostics of multiple outliers are crucial because it is difficult to ensure that the data set possesses only single outlier in real life problem. The explanation from above is comprehensive and emphasized that the diagnostic methods for the identification of multiple outliers are influenced by masking or swamping effect. For identification of multiple outliers, a group-deleted version of the residuals is used to develop effective diagnostics. By assuming that observations in a set of observations are excluded, hence, only (-) cases of observations are considered for model fittings. Denoting a set of “remaining” observations in the study by and a set of “deleted” observations by , it is assumed that the “deleted” observations are the last of rows of , , and so that , , and . Let be the corresponding vector of the estimated coefficients when a group of observations indexed by is excluded. The corrected Pearson residuals for GLMs are defined as ; thus, the th deletion of the corrected Pearson residuals is , where , is the th deletion for the fitted values and is a function of order . The respective deletion variances, (for gamma distribution), and deletion leverages, , for the overall data set are defined asOn the other hand, concerning logistic regression, Imon and Hadi [9] defined the th GSPR asThe GSPR were obtained by using the principles of the scaled types residuals and linear regression like approximation modus. According to Imon and Hadi [9], the implantation of the scaled types residuals principle enabled the residuals for the set and set to be measured on a similar scale.

In order to obtain the standardized corrected Pearson residuals, slight modifications were made to the standardized Pearson residuals; that is, , where is the Pearson residuals. Thus, the standardized corrected Pearson residuals are also defined as   +  or equivalent to .

By using the principles of the scaled types residuals, the th GSCPR for GLM is defined asTherefore, the GSCPR method is summarized in the following.

Step 1. For each point, calculate the corrected Pearson residuals, .

Step 2. An th point with exceeding the cutoff point of value ±3 is suspected as outlier. These points are taken into consideration to be assigned to the deleted set . The rest of the points are assigned to set.

Step 3. Calculate the values based on the decided and sets in Step 2.

Step 4. Any deleted points in accordance with exceeding value 3 in absolute term are finalized and declared as the outlier.

The choice of deleted observations played a pivotal role in the method as the exclusion of this group decided the residuals resulting from the set and the set. Besides, all suspected outliers were included in the initial deletion set, since the entire GSCPR set would have been faulty if any outlier was left in the set. Meanwhile, in logistic regression, Imon and Hadi [9] suggested a diagnostic-robust approach, which employed graphical procedures or robust procedures to identify suspect outliers at the early stage. The next stage employed the diagnostic tools to the resulting residuals so as to assign the inliers (if any), which were incorrectly identified as outliers at the early stage, back into the estimation subset. In this study, corrected Pearson residuals were applied to detect the suspected outliers in the early stage, and then, the suspected points were declared as outliers if every member subset in set fulfilled the rule given in Step 4. Otherwise, the observations were placed back to set. Moreover, the deletion of set was continuously inspected by recalculating the GSCPR until every member in the final deletion set individually fulfilled the rule outlined in Step 4. The member subset in this final set was, eventually, declared as outliers.

4. Example Using Real Data Set

A real data set was used to illustrate the capability of the newly proposed tool for the identification of multiple outliers. The drill experiment comprised a 24 unreplicated factorial in order to investigate the response variable advance rate. This example was reanalyzed by using a generalized linear model developed by Lewis and Montgomery [14]. As for the GLM analysis, the gamma distribution was selected with a log link function. Table 1 displays the design matrix and the response data. In order to observe the effect of outliers, the original response data had been modified. For the single outlier case, the 16th observation was replaced with a value of 36.30 instead of 16.30. Meanwhile, for the case of two outliers, the 13th and 16th observations were replaced with values 27.77 and 36.30, respectively. The results of the Pearson residuals, corrected Pearson residuals, and GSCPR measures are shown in Table 2.

Table 2 and index plots of Pearson residuals, as given in Figures 1(a) and 1(b), show that the Pearson residuals method failed in identifying the outliers. For data set with an outlier, as observed in Figures 1(c) and 1(e), the values for both corrected Pearson residuals and GSCPR for observation 13 had been remarkably high and, therefore, could be revealed as outlier. Meanwhile, for the data set with two outliers, as depicted in Figure 1(d), the corrected Pearson residuals method correctly identified observations 13 and 16 as outliers but swapped a good observation (observation 15) as an outlier. Nevertheless, Figure 1(f) shows that the proposed GSCPR correctly identified observations 13 and 16 as outliers in the data set.

5. Simulation Results

In this section, simulation studies based on factorial design data and explanatory variable data were conducted to verify the conclusion of the numerical data set that the proposed method had been indeed capable of correctly identifying multiple outliers. The measures of Pearson residuals, corrected Pearson residuals, and GSCPR measures were reported. Besides, only the model that adhered to Gamma distribution and log link function had been considered. Furthermore, two different scenarios were considered in this study. In the first scenario (Scenario 1), a factorial design where with replication of 8, 16, 32, and 64 runs had been applied. For each experimental design condition, a model was set up with known parameters and a known true mean, as suggested by Lewis et al. [15]. On top of that, the true linear predictor, , was created as , where the true linear predictor was taken as , for 22 factorial design [7]. The true mean was defined as , where is the link function and . The actual observation was acquired by attaching an error drawn at arbitrary from a specified distribution to the linear predictor; that is, , where , while residuals were induced from a Gamma distribution, with a shape equivalent to 0.1 and a scale that equaled one. Furthermore, in order to generate 8 run designs’ matrix, two replicates of the 22 factorial design were used, four replicates gave 16 run designs matrix, and so on. The outlying observations were generated from the model , where and represents the distance of the outliers that were placed away from good observations. In this study, was taken as 10. The residual outliers were created in the first observation, where is the total number of the outlying observations. The 5,000 replications of the simulation study are summarized in Table 3. This table presents the percentage of correct detection of outliers, masking rates, and swamping rates for 22 factorial design. In the second scenario (Scenario 2), a simulation study was designed for a set of explanatory variables data, that is, two explanatory variables. The true linear predictor was taken as , , by adhering to that suggested by Cordeiro and Simas [7]. The true mean and the actual observation were obtained as described in Scenario 1. The explanatory variables were generated as Uniform (0,1). In addition, the sample sizes were considered as equivalent to 20, 40, 60, 100, and 200 with different contamination levels , , , and of outliers with varying weights. The initial 100α% observations were further constructed as outliers in the data set. Thus, in order to induce the outlying values of the varying weights, the first outlier point was unchanged and remained at value 10, whereas those successive values increased as much as value two. The results of the 5,000 replications are summarized in Table 4.

Additionally, for Scenario 1, Table 3 clearly shows that the Pearson residuals method was excellent in detecting single outlier only when the size of the experimental run had been large, but its performance became extremely poor for multiple outliers detection. However, the proposed GSCPR method consistently displayed higher rate of detection of outliers with almost negligible masking rates and swamping rates regardless of the size of the experimental run and the existing number of outliers. The performance of the corrected Pearson residuals method was also encouraging, but it did not outperform the proposed method. Moreover, the rate of correct detection for the corrected Pearson residuals method had been found to become lower as the amount of outliers increased.

Other than that, Table 4 portrays the eminence of the proposed method for Scenario 2. The Pearson residuals method had been discovered to be excellent only when the contamination level was low, but it exhibited an extremely poor performance when the contamination level was more than 5% in the cases. In general, it generated less percentage in correct detection, high masking, and low swamping effects. However, the GSCPR method that outperformed the other methods revealed higher rate of detection of outliers and almost negligible masking rates regardless of the sample size, , and contamination levels, . This method also had a tendency of swamping the inliers as outliers when the contamination level had been low, such as 5% and 10% cases. Moreover, the performance of the corrected Pearson residuals method was better than that of the Pearson residuals method since the rate of detection of outlier was higher when the contamination level was low. Although the swamping effects for the corrected Pearson residuals method were low, its masking effects increased as the contamination level increased. Hence, the outcomes of the study indicated that the GSCPR method had the best performance, followed by the corrected Pearson residuals method and the Pearson residuals method.

6. Conclusion

This paper proposed a diagnostic method for the identification of multiple outliers in GLM, where traditionally used outlier detection methods are effortless as they undergo masking or swamping dilemma. Thus, an investigation had been conducted to determine the capability of the proposed GSCPR method. The findings obtained from the numerical examples indicated that the performance of the proposed method was satisfactory in identifying multiple outliers. Other than that, in the simulation study, two scenarios were considered to assess the validity of the GSCPR method. The results retrieved from the simulation study exhibited the superiority of the proposed method under wide assortment of conditions. The proposed method consistently displayed higher percentage of correct detection, as well as lower rates of swamping and masking, regardless of the sample size, , and the contamination level, .

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported by the Research University Grant of the University Putra Malaysia, Malaysia.