#### Abstract

Using highly precise and accurate Monte Carlo simulations of 20,000,000 replications and 102 independent simulation experiments with extremely low simulation errors and total uncertainties, we evaluated the performance of four single outlier discordancy tests (Grubbs test N2, Dixon test N8, skewness test N14, and kurtosis test N15) for normal samples of sizes 5 to 20. Statistical contaminations of a single observation resulting from parameters called from ±0.1 up to ±20 for modeling the slippage of central tendency or from ±1.1 up to ±200 for slippage of dispersion, as well as no contamination ( and ), were simulated. Because of the use of precise and accurate random and normally distributed simulated data, very large replications, and a large number of independent experiments, this paper presents a novel approach for precise and accurate estimations of power functions of four popular discordancy tests and, therefore, should not be considered as a simple simulation exercise unrelated to probability and statistics. From both criteria of the Power of Test proposed by Hayes and Kinsella and the Test Performance Criterion of Barnett and Lewis, Dixon test N8 performs less well than the other three tests. The overall performance of these four tests could be summarized as .

#### 1. Introduction

As summarized by Barnett and Lewis [1], a large number of discordancy tests are available for determining an outlier as an extreme (i.e., legitimate) or a discordant (i.e., contaminant) observation in normal samples at a given confidence or significance level. These discordancy tests are likely to be characterized by different power or performance. Numerous researchers [1–6] have commented on the properties of these tests under the slippage of location or central tendency and slippage of scale or dispersion by one or more observations, but very few studies have been reported on the use of Monte Carlo simulation for precise and accurate performance measures of these tests. Relatively more recently using Monte Carlo simulation of replications or runs, Hayes and Kinsella [7] evaluated the performance criteria of two discordancy tests (Grubbs single outlier test N2 and Grubbs multiple outlier test N4k2; the nomenclature is after Barnett and Lewis [1]) and discussed their spurious and nonspurious components of type II error and power function. However, four single extreme outlier type discordancy tests, also called two-sided discordancy tests by Barnett and Lewis [1], are available, which are Grubbs type N2, Dixon type N8, skewness N14, and kurtosis N15. Their relative performance measures should be useful for choosing among the different tests for specific applications.

Monte Carlo simulation methods have been extensively used in numerous simulation studies [8–18]. Some of the relatively recent papers are by Efstathiou [12], Gottardo et al. [13], Khedhiri and Montasser [14], P. A. Patel and J. S. Patel [15], Noughabi and Arghami [16], Krishnamoorthy and Lian [17], and Verma [18]. For example, Noughabi and Arghami [16] compared seven normality tests (Kolmogorov-Smirnov, Anderson-Darling, Kuiper, Jarque-Bera, Cramer von Mises, Shapiro-Wilk, and Vasicek) for sample sizes of 10, 20, 30, and 50 and under different circumstances recommended the use of Jarque-Bera, Anderson-Darling, Shapiro-Wilk, and Vasicek tests.

We used Monte Carlo simulations to evaluate comparative efficiency of four extreme outlier type discordancy tests (N2, N8, N14, and N15, the nomenclature after Barnett and Lewis [1]) for sample sizes of 5 to 20. Our approach to the statistical problem of test performance is novel because, instead of using commercial or freely available software, we programmed and generated extremely precise and accurate random numbers and normally distributed data, used very large replications of 20,000,000, performed 102 independent experiments, and reduced the simulation errors to such an extent that the differences in test performance are far greater than the total uncertainties expressed as 99% confidence intervals of the mean. This is an approach hitherto practiced by none (see, e.g., [8–18]) except by our group [19–23]. This work, therefore, supersedes the approximate simulation results of test performance reported by the statisticians Hayes and Kinsella [7].

#### 2. Discordancy Tests

For a data array or an ordered array of observations, with mean and standard deviation , four test statistics were objectively evaluated in this work. For a statistically contaminated sample of size of 5 to 20, observations of this data array were obtained from a normal distribution and the remaining observation was taken from a central tendency shifted distribution or dispersion shifted distribution , where the contaminant parameters for modeling slippage of central tendency and for slippage of dispersion can be either positive or negative. For an uncontaminated sample, the simulations were done for and . In order to achieve an unbiased comparison, the application of the tests was always forced to the upper outlier for positive values of or and to the lower outlier for negative values of or .

Thus, the first test was the Grubbs test N2 [24] for an extreme outlier or , for which the test statistic is as follows:

The second test was the Dixon test N8 [2] as follows:

The third test was sample skewness N14 as (note that the absolute value is used for evaluation):

Finally, the fourth test was the sample kurtosis N15 as follows:

All tests were applied at a strict 99% confidence level using the new precise and accurate critical values (CV_{99}) simulated using Monte Carlo procedure by Verma et al. [19] for N2, N8, and N15 and Verma and Quiroz-Ruiz [20] for N14, which permitted an objective comparison of their performance.

#### 3. Monte Carlo Simulations

Random numbers uniformly distributed in the interval and normal random variates were generated from the method summarized by Verma and Quiroz-Ruiz [21]. However, instead of only 10 series or streams of as done by Verma and Quiroz-Ruiz [21], a total of 102 different streams of were simulated. Similarly, the replications were much more than those used by Verma and Quiroz-Ruiz [21] for generating precise and accurate critical values.

For a data array of size , observations were drawn from one stream of and the contaminant observation () was added from a different central tendency shifted stream of where was varied from and or a dispersion shifted distribution where was varied from and . The simulation experiments were also carried out for uncontaminated distributions, in which () observations were taken from one stream of normal random variates and an additional observation was incorporated from a different stream with no contamination; that is, and .

Now, if we were to arrange the complete array from the lowest to the highest observations, the ordered array could be called after Barnett and Lewis [1]. All four tests under evaluation could then be applied to the resulting data array.

If , , , or (the contaminant present), two possibilities would arise for the ordered array as follows (Table 1): (i) the contaminant occupies an inner position in the ordered array; that is, if or or if , or ; this array is called a type event and the contaminant was not used in the test statistic; and (ii) the contaminant occupies the extreme position; that is, if or , or if , or ; this array was called a type event and the contaminant was used in the test statistic.

To an event of type when any of these four tests (N2, N8, N14, or N15) was applied, the outcome was called either a spurious type II error probability () if the test was not significant or a spurious power () if it was significant (Table 1). For this decision, the calculated test statistic TN (TN2, TN8, TN14, or TN15) for a sample was compared with the respective CV_{99} [19, 20]. If TN ≤ CV_{99}, the outcome of the test was considered as not significant; else when TN > CV_{99}, the outcome of the test was considered as significant (Table 1).

Similarly, to an event of type, when a discordancy test was applied, the outcome was either a nonspurious type II error probability () if the test was not significant or a nonspurious power () if the test was significant (Table 1).

If or (the contaminant absent) and a discordancy test was applied to the ordered array to evaluate the extreme observation or , the outcome would either be a true negative (the respective probability ) if the test was not significant, that is, if it failed to detect or as discordant, or a type I error (probability ) if the test was significant; that is, it succeeded in detecting or as discordant (Table 1).

#### 4. Test Performance Criteria

Hayes and Kinsella [7] documented that a good discordancy test would be characterized by a high nonspurious power probability (high ), a low spurious power probability (low ), and a low nonspurious type II error probability (low ).

Hayes and Kinsella [7] defined the Power of Test () as

Similarly, they also defined the Test Performance Criterion (which is equivalent to the probability P5 of Barnett and Lewis [1]) or the Conditional Power as

#### 5. Optimum Replications

The optimum replications required for minimizing the errors of Monte Carlo simulations were decided from representative results summarized in Figures 1 and 2, in which the vertical error bar represents total uncertainty at 99% confidence level (, equivalent to 99% confidence interval of the mean) for 102 simulation experiments. For example, for and , Power of Test is plotted in Figures 1(a)–1(d) as a function of the replications ( to 20,000,000) for N2, N8, N14, and N15. Although mean values remain practically constant (within the confidence limits of the mean) for replications of about 8,000,000, still higher replications of 20,000,000 (Figures 1 and 2) were used in all simulation experiments.

(a) and |

(b) and |

(c) and |

(d) and |

(a) and |

(b) and |

(c) and |

(d) and |

Similarly, for all four tests as a function of replications is also shown in Figures 2(a)–2(d), which allows a visual comparison of this performance parameter for different sample sizes and values. Error bars () for the 102 simulation experiments are not shown for simplicity, but, for replications larger than 10,000,000, they were certainly within the size of the symbols. The replications of 20,000,000 routinely used for comparing the performance of discordancy tests clearly show that the differences among values (Figures 2(a)–2(d)) are statistically significant at a high confidence level; that is, these differences are much greater than the simulation errors.

Alternatively, following Krishnamoorthy and Lian [17] the simulation error for the replications of 20,000,000 used routinely in our work can be estimated approximately as .

Because we carried out 102 independent simulation experiments, each with 20,000,000 replications, our simulation errors were even less than the above value. Thus, the Monte Carlo simulations can be considered highly precise. They can also be said to be highly accurate, because our procedure was modified after the highly precise and accurate method of Verma and Quiroz-Ruiz [21]. These authors had shown high precision and accuracy of each and experiments and had also applied all kinds of simulated data quality tests suggested by Law and Kelton [25]. Besides, in the present work a large number of such experiments (204 streams of and 102 streams of ) have been carried out. Therefore, as an innovation in Monte Carlo simulations we present the mean values as well as the total uncertainty () of 102 independent experiments in terms of the confidence interval of the mean at the strict 99% confidence level.

Finally, in order to evaluate the test performance, test N2 was used as a reference and differences in mean () values of the other three tests were calculated as where the subscript stands for N8, N14, or N15.

#### 6. Results and Discussion

##### 6.1. Type and Contaminant-Absent Events

According to Barnett and Lewis [1] this type of events is of no major concern, because the contaminant occupies an inner position in the ordered array and the extreme observation or under evaluation from discordancy tests is a legitimate observation. An inner position of the contaminant would affect much less the sample mean and standard deviation [1]. For small values of or close to 0 or ±1, respectively, most events generated from the Monte Carlo simulation are of type. The and values for to as a function of are presented in Figures 3(a)–3(d) and Figures 4(a)–4(d), respectively. For , these parameters behave very similarly and, therefore, the corresponding diagrams are not presented.

(a) |

(b) |

(c) |

(d) |

(a) |

(b) |

(c) |

(d) |

When the contaminant is absent ( or ), the and values are close to the expected values of 0.99 and 0.01, respectively, because the discordancy tests were applied at the 99% confidence level (open circles in Figures 3(a)–3(d) and Figures 4(a)–4(d)). As changes from 0 to about ±2.5, the values slightly increase from 0.99 to about 0.996 for (Figure 3(a)), 0.996 for (Figure 3(b)), 0.994-0.995 for (Figure 3(c)), and 0.993-0.994 for (Figure 3(d)). The values show the complementary behavior (Figures 4(a)–4(d)). Because in this type of events , a legitimate extreme observation is being tested, our best desire is that the and values remain close to the theoretical values of 0.99 and 0.01, respectively, for contaminant-absent events. This is actually observed in Figures 3 and 4.

##### 6.2. Type and Contaminant-Absent Events

The type events are of major consequence for sample statistical parameters. In such events, because the contaminant occupies an extreme outlying position ( or ) in an ordered data array, it is desirable that the discordancy tests detect this contaminant observation as discordant. The and values for to as a function of are presented in Figures 5(a)–5(d) and Figures 6(a)–6(d), respectively. Similarly, these values as a function of are shown in Figures 7(a)–7(d) and Figures 8(a)–8(d), respectively.

(a) |

(b) |

(c) |

(d) |

(a) |

(b) |

(c) |

(d) |

(a) |

(b) |

(c) |

(d) |

(a) |

(b) |

(c) |

(d) |

For uncontaminated samples ( in Figures 5(a)–5(d) and Table 2, or in Figures 7(a)–7(d)) the probability values were close to the theoretical value of 0.99 (which corresponds to the confidence level used for each test). Similarly, for such samples, values for all sample sizes were close to the theoretical value of 0.01 (complement of 0.99 is 0.01; Figures 6 and 8).

A complementary behavior of and exists for all other or values as well (Figures 5 and 7 or Figures 6 and 8). Thus, for all tests decreases sharply from 0.99 for to very small values of about 0.03 for and , to about 0.01–0.03 for and , to about 0.006–0.02 for and , and to about 0.001–0.01 for and (Table 2; Figures 5(a)–5(d)). On the contrary, increases very rapidly from very small values of 0.01 to close to the maximum theoretical value of 0.99 (see the complementary behavior in Figures 6(a)–6(d) and Figures 5(a)–5(d)). These probability ( and ) values show a similar behavior for larger values of than for (compare Figures 7 and 8 with Figures 5 and 6, resp.). There are some differences in these probability values among the different tests (Table 2; Figures 5–8), but they will be better discussed for the test performance criteria.

##### 6.3. Test Performance Criteria ( and )

These two parameters are plotted as a function of and in Figures 9, 10, 11, and 12 and the most important results are summarized in Tables 3–6. For a good test, both (; (5)) and (6) should be large [1, 7]. Values of both performance criteria ( and ) increase as or values depart from the uncontaminated values of or (Figures 9–12; Tables 3–6). However, and increase less rapidly for smaller than for larger . For , even for or , none of the two parameters truly reaches the maximum theoretical value of 0.99 (Figure 9(a) to Figure 12(a)). For larger (10–20), however, both and get close to this value for all tests and for much smaller values of or than the maximum values of 20 and 200, respectively (Figures 9(b)–9(d) to Figures 12(b)–12(d); Tables 3–6).

(a) |

(b) |

(c) |

(d) |

(a) |

(b) |

(c) |

(d) |

(a) |

(b) |

(c) |

(d) |

(a) |

(b) |

(c) |

(d) |

The performance differences of the four tests are now briefly discussed in terms of both and as well as . The total uncertainty values of the simulations are extremely small (the error is at the fifth or even sixth decimal place; Tables 3–6). Therefore, most differences among the tests ( for test N8, for test N14, and for test N15; all percent differences are with respect to test N2; see (7)) are statistically significant (Tables 3–6). A negative value of (where stands for N8, N14, or N15) means that or value for a given test (N8, N14, or N15) is less than that of test N2, implying a worse performance of the given test as compared to test N2, whereas a positive value of signifies just the opposite. Note that test N2 is chosen as a reference test, because it shows generally the best performance (values of are mostly negative in Tables 3–6). Additional fine-scale simulations were also carried out for which both and become about 0.5 for the reference test N2 (0.5 is about the half of the maximum value of one for or ). Hence, the values of and can be visually compared in Tables 3–6 (see the rows in italic font).

For , all tests show rather similar performance, because the maximum difference () is only about −1.1% for N8 (as compared to N2) and <−0.1% for N14 and N15 (see the first set of rows corresponding to in Tables 3–6). Test N2 shows for , whereas tests N8, N14, and N15 have values of 0.49503, 0.50014, and 0.50015, respectively, (Table 3). The respective values are about −1.1%, −0.06%, and −0.06% (Table 3). Practically the same results are valid for as well (see the row in italic font in Table 4). Similar results were documented for and as a function of (rows for or in Tables 5 and 6, resp.).

For , Dixon test N8 becomes considerably less efficient than Grubbs test N2, because the values become as low as −7.8% for or −6.4% for (Tables 3–6). Skewness test N14 also shows slightly lower and than N2 (% for , or % for ; Tables 3–6). Kurtosis test N15 shows a similar performance as test N2; the maximum difference is about 0.7 (Tables 3–6). For test N2 shows (or ) for ; for this case, the other three tests (N8, N14, and N15) show values of about −7.8%, −1.8%, and −0.7% (Tables 3 and 4). Similarly, for such cases, and show , , and values of about −4.3%, −1.0%, and −0.4%, respectively.

For and , test N8 shows the worst performance and the values become as large as −12.2% to −15.5% for (Tables 3 and 4) or −9.8% to −11.5% for (Tables 5 and 6). For these sample sizes, test N14 also shows a worse performance as compared to N2, because the maximum differences represented by values are about −6.3% to −10.9% for (Tables 3 and 4) or −4.5% to −7.0% for (Tables 5 and 6). Test N15 shows a comparable performance, because the maximum differences ( values) are about −1.5% to −2.4% for (Tables 3 and 4) or −1.1% to −1.5% for (Tables 5 and 6). For and , when test N2 shows or , the , , and values range from about −6.9% to −15.0%, −3.0% to −9.2%, and −0.6% to −1.9%, respectively.

The significantly lower and values of the Dixon test N8 as compared to the Grubbs test N2, skewness test N14, and kurtosis test N15 may be related to the masking effect of the penultimate observation on or of on as documented by Barnett and Lewis [1]. The masking effect may also be responsible for a somewhat worse performance of N14 as compared to N2.

##### 6.4. Final Remarks

The two performance criteria ( and ) [1, 7] used in this work provide similar estimates (Tables 3–6) and, more importantly, similar conclusions. Therefore, any of them can be used to evaluate numerous other discordancy tests for single or multiple outliers [1, 26–28]. The main result of Monte Carlo simulations concerning the performance of the single extreme outlier discordancy tests could be stated as follows: .

Additional simulation work is required to evaluate other discordancy tests, such as the single upper or lower outlier tests, as well as more complex statistical contamination involving two or more discordant outliers and the comparison of consecutive application of single outlier discordancy tests with multiple outlier tests [1, 7, 26–28]. Then, the multiple test method, initially proposed by Verma [29] and used by many researchers [30–35], would be substantially improved for subsequent applications. These performance results could then be incorporated in new versions of the computer programs DODESSYS [36], TecD [37], and UDASYS [38].

#### 7. Conclusions

Our simulation study clearly shows that Dixon test N8 performs less well than the other three extreme single outlier tests (Grubbs N2, skewness N14, and kurtosis N15). Both performance parameters (the Power of Test and Test Performance Criterion ) have up to about 16% less values for N8 than test N2. Test N8, therefore, shows the worst performance for outlier detection. For certain values of or test N14 also shows lesser values of and than N2, which means that N14 is also somewhat worse than N2. The other two tests (N2 and N15) could be considered comparable in their performance.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The computing facilities for this work were partly from the DGAPA-PAPIIT project IN104813. The second author (Lorena Díaz-González) acknowledges PROMEP support to the project “Estadística computacional para el tratamiento de datos experimentales” (PROMEP/103-5/10/7332). The third author (Mauricio Rosales-Rivera) thanks the Sistema Nacional de Investigadores (Mexico) for a scholarship that enabled him to participate in this research as Ayudantes de Investigador Nacional Nivel III of the first author (Surendra P. Verma).