Abstract

A bewildering large number of test statistics have been found for testing the presence of an outlier in multiple linear regression models. Exact critical values of these test statistics are not available, and approximate ones are usually obtained by the first-order Bonferroni upper bound or large-scale simulations. In this paper, we show that the upper bound values of two of these test statistics are algebraically the same. An application to real data for multiple linear regression is used to demonstrate the procedure.

1. Introduction

An outlier is a discordant observation. It is an observation that does not fit in with the pattern of the remaining observations. It differs markedly not only from other members of the set from which it occurs, but also from its fitted value. Such an observation usually has a large residual. Outliers meet data analysts at the point of data analysis and in data mining. Reference [1] pointed out that there various causes of outliers such as human errors, erroneous operation of computer systems, sampling errors, or standardization failures. Excellent books on outliers include [24].

Outliers usually have a major influence on the resulting parameter estimates, and their presence impacts adversely on the results of the statistical inference concerning the models. They can reduce the power of statistical tests during analysis. Reference [5] advised that there is the need for the analyst to identify outliers if they exist so that appropriate measures might be taken.

Outliers need to be identified and corrected or eliminated. The process of identification and correction of outliers is not a straightforward thing; rather, it requires marked ability, competence, circumspection, and a strict adherence to scientific objectivity (impartiality) of high degree. If identified outliers cannot be remedied, they need to be removed because they contaminate the information contained by the remainder of that set of data (see [1, 6]).

Test for an outlying observation in the response variable is usually based on the use of test statistics that depend on the standardized residuals. Different test statistics have been developed for testing of an outlier in a least squares analysis of linear regression models. However, exact critical values of some of these test statistics are not available and are not easy to obtain. The available approximate ones are based on the first-order Bonferroni upper bound or large-scale simulations.

Upper bounds for the critical values of test statistics for detecting the presence of a single outlier in linear regression have been developed by [7, 8]. Although formal distinctions exist in the principles invoked by [7, 8] in deriving these upper bounds, we show in this paper that these upper bounds derived by [7, 8] are algebraically the same.

The multiple linear regression model is where is the observation vector, is an matrix of constants, is a vector of unknown parameters to be estimated, and is an vector of normally distributed errors. Assuming that and , the least squares estimator of in (1) is given by and the vector of residuals is

The variance-covariance matrix

If σ2 is estimated using , then the estimated variance-covariance matrix of becomes

Residuals are important diagnostic tools in regression analysis as no regression analysis is complete without a thorough examination of them. They are versatile as most regression diagnostics are written in terms of them. They are used in checking model adequacy and the validity of model assumptions. A thorough examination of the residuals therefore provides valuable information concerning the appropriateness of assumptions that underlie statistical models and helps in pinpointing an appropriate model. Different types of graphic plots (representations) of residuals are used for diagnostic purposes.

Ordinary residuals are not all that suitable for diagnostic purposes, and a standardized version of them is usually preferred. This is because the variances of the residuals are not homogeneous, and this makes them intractable. A standardized residual has a representation of the form where is the predicted value of and is the th element of matrix , called the hat matrix. The th transformed residual is often called an internally studentized residual. They are tractable and are more versatile. They are used as a replacement of the ordinary residuals in regression diagnostics. Numerous graphical and numerical techniques for checking model assumptions using standardized residuals can be found in the regression literature. They are also fundamental building blocks for most of the known test statistics studied in the literature for outlier detection in linear models (see [9, 10]).

The test statistic is called the maximum absolute internally studentized residuals. Reference [11], following the suggestion of [12], used a large-scale simulation study involving many thousands of sampling experiments to obtain approximate critical values of (7) for a simple linear regression. The approximate values obtained by [11] are almost the same with the values obtained by [13].

Reference [7] considered the test statistic where is the estimated average variance of the ordinary residuals. Reference [9] showed that the variance of the residuals is , so that the estimated variance of the ordinary residuals .

Therefore, Reference [7] showed that the corresponding percentage point of is bounded above by where is the percentage point of the distribution with degrees of freedom 1 and , is the number of observations, and is the number parameters estimated. Reference [7] results for simple linear regression were found to be almost identical to those of [11]. Reference [7] also suggested the use of (10) to obtain other critical values that are not in Table 1. The reference [7] result was not elaborate and extensive enough as the result in [8] because of the unavailability of the needed values of the -distribution (see [8]).

Define Reference [14] showed that the joint distribution of has a multivariate Inverted-Students Function and that the probability density function for any is a univariate Inverted-Students Function with probability density function given by where Reference [8], following the suggestion of [7], made use of the results of [14]. Reference [8] used the first-order Bonferroni inequality to obtain the upper bounds of the critical values of . Reference [8] obtained from where and then obtained using the relationship between and given by the equation for sample sizes up to , regression parameters and , 0.05, and 0.01. With that, [8] is claimed to have produced the most elaborate upper bound values . For , the upper bounds computed by [7] using (10) and the upper bound values computed by [8] using (14) are extremely very close.

2. Demonstration of the Sameness of Upper Bounds

In this section, we show that the upper bounds and are algebraically identical. From (10), we let

We determine the distribution of as follows: so that with distribution domain or range given by . Explicitly, we have where

Then, using the first Bonferroni inequality, one can obtain the upper bounds by solving

Now, from (11), we have so that with distribution domain or range given by . Explicitly, we have where

Let

Because of the symmetry of the distribution of in (24), we obtain the distribution of as follows:

Explicitly, we have

Then, using the first Bonferroni inequality, one can obtain by solving

The sameness of (21) and (29) means that implying that . This also means that and have distributions that are bounded by the same distribution.

Using (21) to obtain the upper bounds averts the problem encountered by [7]. This is because (10) depends on the tabulated percentage points of -distribution while (21) does not. Reference [8] remarked that implementation of the suggestion made by [7] was very difficult because the needed percentage points of the -distribution were not available. Therefore, for any value of , can easily be obtained using (21) without recourse to a tabulated value of the -distribution. It is also preferable to use (29) to obtain instead of (14). This is because, using (14) involves a kind of transformation from to as indicated in (15), but using (29) does not. The use of approximate critical values for detecting a single outlier in linear regression can be found in the work of [7, 8].

3. Table Construction

We use the Bonferroni inequality to obtain upper bound values for the 10 percent, 5 percent, and 1 percent critical values of the test statistic . A table of the upper bounds of the critical values of is presented in Table 1 for a simple linear regression and sample sizes up to 60. These were obtained by solving (29) using the Mathematica Software. It is to demonstrate numerically that the upper bound values of the two test statistics (7) and (8) are the same as what (30) shows. Equation (29) produces precise and accurate values of upper bound values of these test statistics ((7) and (8)). The values in Table 1 compare favorably with values obtained by [7] by solving (10), the values obtained by [8] by solving (14), and the approximate values obtained by [11] via simulation.

4. Application to Real Data

We now show that the upper bounds of the two test statistics are the same with an application to a real data set. The data in Table 2 is from [15]. Reference [15] carried out an investigation concerning the source from which corn plants obtain their phosphorus. It was carried out by chemically determining the concentrations of inorganic () and organic () phosphorus in the soils. They used eighteen soil samples in the experiment and measured the phosphorus content of the corn grown on Iowa soils. The phosphorus content was used as the dependent variable in a multiple regression analysis with and as the independent variables. The content (phosphorus content) of the corn in soil sample number 17 was found to be considerably larger than the phosphorus content of the corn grown in the other soil samples (no explanation was given for its size) and produced a standardized residual of 3.18. Multiple linear regression analysis of the data set produced the result in Table 2.

We now show that the upper bound values and are the same. We compute the upper value of using equation (10) which is given by where is the percentage point of the distribution with degrees of freedom 1 and , is the number of observations, and is the number parameters estimated. For , , , and , the upper bound for their critical value of is

To obtain the upper bound for the critical value of , we make use of equation (14). For , , and , we apply equation (14) using the Mathematica software to obtain . Thus, the upper bound values of the two test statistics are the same. Equation (21) or (29) gives the same value.

Finally, the observed value of 3.18 is found to be significant at the one percent level (). Thus, the phosphorus content of the corn grown in soil sample number 17 should be regarded as an outlier, and the null hypothesis of no outlier in the data set is rejected at the 0.01 level.

5. Conclusions

In this article, we have shown that the upper bound values of the test statistic (7) and the upper bounds of the test statistic (8) are identical. Although formal distinctions exist in the principles used by [7] in deriving and those employed by [8] in deriving , we have herein shown that they are algebraically the same. Having shown this, we recommend the use of (29) to compute the upper bounds of (7) or (8). It is more tractable than (10) and (14). Since (14) borders on some kind of transformation and (10) makes use of tabulated values of -distribution, accuracy and precision may be lost when using them.

Data Availability

A real data set on regression with a single outlier has been analyzed and included in the paper.

Conflicts of Interest

The authors declare that they have no conflicts of interest.