Data Transformation Technique to Improve the Outlier Detection Power of Grubbs’ Test for Data Expected to Follow Linear Relation
Grubbs test (extreme studentized deviate test, maximum normed residual test) is used in various fields to identify outliers in a data set, which are ranked in the order of . However, ranking of data eliminates the actual sequence of a data series, which is an important factor for determining outliers in some cases (e.g., time series). Thus in such a data set, Grubbs test will not identify outliers correctly. This paper introduces a technique for transforming data from sequence bound linear form to sequence unbound form . Applying Grubbs test to the new transformed data set detects outliers more accurately. In addition, the new technique improves the outlier detection capability of Grubbs test. Results show that, Grubbs test was capable of identifing outliers at significance level 0.01 after transformation, while it was unable to identify those prior to transforming at significance level 0.05.
Grubbs test  is a statistical test used to detect outliers which was introduced in 1950 and extended in 1969  and 1972  by the same author. Grubbs test locates outliers that exist in a univariate data set using mean, standard deviation, and tabulated criterion. Grubbs test is also known as maximum normed residual test or “extreme studentized deviate” (ESD) test, and the data set is assumed to be normally distributed. The test is defined as where is standard deviation and is the sample mean. If the maximum related to the th element is greater than the relevant tabulated criterion, then the element is considered an outlier. The testing procedure is continuing until no more outliers are detected. However Grubbs test is not recommended for detecting outliers for sample size of six or less. When the sample size is six or less, most of the times Grubbs test identified nonoutliers as outliers .
During the last decades Grubbs test was used to identify outliers in different disciplines [5–9]. Also, during the last decades pros and cons of Grubbs test were identified and were improved as well. In 1975 Rosner showed that Grubbs test (ESD) performs much better than studentized range methods and performs equally as Kurtosis and R-statistic methods . In 1983 Rosner introduced an improved version of ESD as generalized extreme studentized deviate (GESD) test . However, GESD does not work well when the sample size is less than 25 . Brant in 1990 stated that the combination ESD rules and boxplot provide comparable performance . On the other hand, it was shown that the standard deviation and mean are affected by two or more outliers; Grubbs test does not detect outliers  correctly. Also, if the standard deviation of the data set is too large or too small, the test will tend to detect false outliers and vice versa. This was overcome by setting a threshold value for standard deviations for the specific considered data domain . Meanwhile, some publications show that Grubbs test is robust against the effect of intraclass correlation structure  and data that have Baldessari’s structure [15, 16].
Form the definition of Grubbs test, it locates outliers in a data set which are ranked in the order of . This implies that Grubbs test considers only the value of data points but not the real order of the data. In other words, Grubbs test treats sorted series and unsorted series in the same manner. Thus, Grubbs test is valid only for those data domains where the occurrence order is of no importance. However, with respect to outliers, the order of the data points is a very important factor for data series that are expected to have gradual increment or decrement over time. Thus, applying Grubbs test to data which has a relation with the occurrence order will not give a correct output. This is particularly important in the area of process control where the order of the data points has a very high impact on data interpretation. Therefore, Grubbs test is not a reliable method for detecting outliers in time series, because in time series occurrence order is a critical factor.
Grubbs test is capable of checking whether a certain suspected data point is an outlier. By default suspected points are the minimum and the maximum of the data set. If the most suspected data points are not outliers, Grubbs test does not identify other data points as outliers. Figure 1 shows two artificial data sets (data sets 1 and 2) with one outlier in each data set. Table 1 shows results of Grubbs test for two significance levels (α) of 0.05 and 0.01. In both data sets, the test does not detect outliers for any considered significance level. In data set 1, the outlier is not significant enough to be identified by Grubbs test. Although the outlier in data set 2 (190) deviated significantly in relation to its position, after ranking it moves to the end of the series and becomes an insignificant outlier.
The aim of this paper is to introduce a method for transforming “sequence bound" data into a “sequence unbound” form. Since the transformed series is totally independent of the sequence, applying Grubbs test could produce more robust results. Furthermore, the transformation increases the outlier detection capability of Grubbs test for data which are expected to have linear or nearly linear relation.
Data transformation techniques are used to convert data status that is closer to the requirements of the technique or method to be applied . The transformation process converts each data point of into the transformed value by means of a function , where . Since Grubbs test is not suitable for detecting outliers in sequence bound series such as time series, one solution is to transform the sequence bound series into a sequence unbound series. In the domain of linear regression, any curve with the form of , value of any data point is always constant. Therefore, any curve with the form of is a curve that is independent of the sequence.
Lemma 1. If it is possible to find a proper reference curve for any curve which has the same domain as , it is possible to transfer into a constant.
Proof. If represents the curve of actual data and is a constant, the function has the same domain. However, has a different range than . If the curve of then . Since is , is also . Then, can be considered as a transformation form of , which is equal to a constant, and can be considered as the , which is the reference curve.
The curve , which is the transformed form of , has a simpler form than . Also, can be used to describe the behaviour of . Because , then is independent of the sequence of the data. Therefore, because Grubbs test gives correct detections with the data sets that are independent of the occurrence sequence, is a suitable data set, which can be tested with Grubbs test. Figures 2 and 3 illustrate usage of the above-mentioned concept for outlier detection.
In the real world, it is not always possible to find the exact for a certain data set in advance. Thus, if has an approximate relation to the behaviour of the real data, then for all data elements (Figure 2). If the actual curve has abnormal data (outlier), shows higher deviation from (Figure 3). Applying Grubbs test to the suspected element can be checked for an outlier.
When is known in advance it is possible to apply this method for any data set of any form. can be known in advance theoretically or by means of preknowledge of data. If these two options are not available, one possibility is to derive from existing data of original data (). This paper shows a method of deriving for a data set that is expected to have linear form, using the original data ().
For any , the curve is a curve which has the same gradient as , where is the gradient of . Then, According to (2) the curve is a constant. Therefore, for any linear function the function can be considered as the reference function . In other words is the curve, which goes through the origin with the same gradient as . As shown in Figure 4, form () can be considered as the transformation of form (). Because (or ), then . Since the gradient of , it can be calculated either by using known theoretical and/or practical information or by deriving it from existing data. We focus on deriving the gradient of using a part of original data (). When deriving any information from existing original data (), the influence of outliers introduces distortions to the derived value. Outlier detections methods are used to remove such data points. However, when detecting outliers this is not a feasible solution. Therefore, when detecting outliers, the best solution is to exclude all suspected data points to minimize the influence of outliers to identify outliers.
Unlike most of outlier detection methods, Grubbs test always considers the maximum and the minimum as most suspected data points. Thus, we excluded the maximum and the minimum from the calculations. After removing the maximum and the minimum, the original series splits into a maximum of three small series (Figure 4: Segment 1, Segment 2, and Segment 3 of ). If there are equal maximum values and minimum values, the value with low index is considered as the maximum and the value with high index is considered as the minimum for a data series with increment. For a data series with decrement the value with high index is considered as the maximum and the value with low index is considered as the minimum.
Then the series with the highest number of consecutive items (longest series) was considered for calculating the gradient of . If the longest series of is and then where is the starting index of the and is the end index of the .
If is any point of , then where is the gradient of at point .
All ’s on are not suspected data points and candidates for calculating the gradient of . However, among all the data points of , still it is not possible to determine the most suitable point for calculating . If the selected is a bias data point (e.g., point in Figure 4), it may introduce distortions to the calculated even though it is not suspected. Therefore, it is necessary to have a more reliable method for calculating the gradient of . If the average of all gradients at all ’s is considered, it will provide much better approximation for gradient instead of a gradient derived by referring to a certain single point. Therefore, if the resultant gradient of is , then can be defined as the mean of all ’s. Then, From (4), Because is the longest segment of , (gradient of ) is considered as the gradient of . For the linear relation form is the reference function (). Therefore, gradient of is the same as the gradient of . Then = for all and . According to Lemma 1 is a constant and has the form of , where is a constant. Then the function is the final transformation form, which is suitable for applying Grubbs test.
Finally, we applied Grubbs test on and checked for outliers in the . The existence of outliers in confirms the existence of outliers in . If the th item of is identified as an outlier, the th item of is considered as an outlier. Since depends on which is the index of the data point, is also a function of . This modification establishes a relation between data points and their index and eliminates the major problem identified for Grubbs test. After transformation Grubbs test can be applied repeatedly on until no outliers were detected.
2.1. Evaluation Using Artificial Data
Four artificial data sets with one outlier in each data set (which cannot be identified by Grubbs test) were tested with the new method. Each data set consists of 10 elements with an outlier of different type, as mentioned in Table 2. Data sets 1 and 2 are the same data sets as in Table 1.
2.2. Evaluation Using Real Data
Real data sets collected from a biogas plant over a period of 60 days with a frequency of one data point per day were tested using both our transformation technique and standard Grubbs test. Among the different parameters, the counter reading of the electricity generator and the volumetric percentage of methane (CH4) in the biogas were selected for testing. During the stable situation, the counter reading of the electricity generator (operating hours) is continuously increasing, while the percentage of CH4 is fluctuating around a certain value. Both data sets were tested with the new technique and standard Grubbs test for the significance level of 0.05 and window sizes of 4, 5, 6, and 10 without overlapping. Also, Grubbs test was repeatedly applied until there were no outliers detected.
3. Results and Discussion
The results from the test with artificial data show that applying Grubbs test on the transformed data set using our proposed method is capable of locating outliers at a significance level of 0.01 (Tables 3, 4, 5, and 6). When applied on the original data set, Grubbs test was unable to locate the outliers even with significance level of 0.05. The outlier in Table 5 deviates very little and is also neither the maximum nor the minimum, which is the worst case situation for single outlier domain. However, after transformation, Grubbs test identifies the outlier with a high level of confidence. Even though the data set in Table 6 has no continuous increment or decrement, Grubbs test located the outlier after transformation.
For the real data sets, the results show that our transformation technique is capable of identifying the outliers depending on the selected window size (Figures 5 and 6). The data points in most of the selected windows of Figure 5 consist of values that slightly deviated from the actual value. After transformation, Grubbs test was able to locate those data points. However, standard Grubbs test was unable to locate any of those points as outliers from both data sets for the same window sizes.
The data series shown in Figure 6 is not a linear series. However, application of windowing technique allowed locating outliers in each window using new transformation method. However, standard Grubbs test was unable to locate the outliers in the same windows. Furthermore, the data points shown in different window sizes in Figure 6 have different forms (increment, decrement, or constant). After transformation Grubbs test located the outliers despite the behaviour of data in the selected window. Another important fact is that the located outliers were outliers in relation to the selected window size and linear relation. Finally, the results show the capability of applying Grubbs test after transforming the series with new transformation and suitable windowing technique. Thus, the method can be used for locating outliers in time series regardless of the fact that the series is linear or nonlinear. However, still each window is considered as a window containing a linear segment of the curve.
According to the generally accepted idea, Grubbs test is not suitable for locating outliers in a data set with six or fewer terms . However, the results show that after transforming with new method, Grubbs test was capable of locating outliers in the data sets with four and six terms (window sizes four and six). This is in disagreement with the generally accepted idea. In particular, when applying Grubbs test on nonlinear data series it is necessary to apply suitable windowing technique for having data windows which has better approximation for linearity (Figure 6). Therefore, we can state that the new transformation technique eliminates one of the major drawbacks that prevent applying Grubbs test on small windows.
The accuracy and the reliability of the transformation totally depend on the gradient of . Therefore, applying better method could give much better approximation for . We considered other statistical properties such as mode and the median of the series as well as the longest segment for deriving . We excluded the median because it is a single data point. The problem of any single data point is that it is not a reliable data point as a reference data point. Even though the considered single point is neither the maximum nor the minimum, it can be a deviated data point such as in Figure 4. Not like median the mode of a series represents multiple data points. Therefore, the mode can be considered as a good alternative for a data set expected to follow the form of . Unfortunately, the mode cannot be used for a data series expected to follow the form of (increasing or decreasing), because in such a data series it is not possible to expect multiple equal or nearly equal values. Finally, in general we decided to use mean for deriving . However, if the considered domain ensures the reliability, it is possible to use any other method for deriving , rather than the method we used. For example, if there is a guarantee of accuracy of a certain data point, even any single data point (such as the first data point of the series) can be used for deriving .
In this paper we used the longest segment of the data set for deriving . However, if there are considerable numbers of data points in other segments it is possible to calculate the gradient of other segments and get the average gradient of considered segments as the gradient. On the other hand, if the outliers were clustered and located in the longest segment, the method we mentioned in this paper will not give a better approximation for due to the influence of outliers. However, if the considered domain is having or expected to have clustered outlier, then excluding the whole cluster before calculating will give a better approximation for . One possibility is to remove nearest neighbours of the maximum and the minimum including the maximum and the minimum. This will provide much better data set for deriving .
The results for artificial and real data show that our new transformation technique improves the outlier detection power of Grubbs test. The transformation is independent of already existing reference data sets and derived reference set from the part of the original data set. This is the main advantage of the new method. After transformation, Grubbs test was capable of detecting outliers at the significance level 0.01 which were not identified without transformation, even at the significance level 0.05. Also, after transformation, Grubbs test was capable of locating outliers in a data set that is not in ranked order, since the new technique transforms data from the form to the form which is independent of the sequence.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The University of Ruhuna, Sri Lanka, provided the article processing charges of this paper. The German Academic Exchange Service (German: Deutscher Akademischer Austauschdienst) financed this work.
F. E. Grubbs, “Sample criteria for testing outlying observations,” The Annals of Mathematical Statistics, vol. 21, no. 1, pp. 27–58, 1950.View at: Publisher Site | Google Scholar | MathSciNet
F. E. Grubbs, “Procedures for detecting outlying observations in samples,” Technometrics, vol. 11, no. 1, pp. 1–21, 1969.View at: Publisher Site | Google Scholar
F. E. Grubbs and G. Beck, “Extension of sample sizes and percentage points for significance tests of outlying observations,” Technometrics, vol. 14, pp. 847–854, 1972.View at: Google Scholar | MathSciNet
M. Thompson and P. J. Lowthian, Notes on Statistics and Data Quality for Analytical Chemists, Imperial College Press, 2011.
S. Geisser, “Influential observations, diagnostics and discovery tests,” Journal of Applied Statistics, vol. 14, no. 2, pp. 133–142, 1987.View at: Publisher Site | Google Scholar
W.-K. Fung, “A statistical-test-complemented graphical method for detecting multiple outliers in two-way tables,” Journal of Applied Statistics, vol. 18, no. 2, pp. 265–274, 1991.View at: Publisher Site | Google Scholar
B. M. Colosimo, R. Pan, and E. del Castillo, “A sequential Markov chain Monte Carlo approach to set-up adjustment of a process over a set of lots,” Journal of Applied Statistics, vol. 31, no. 5, pp. 499–520, 2004.View at: Publisher Site | Google Scholar | MathSciNet
M. K. Solak, “Detection of multiple outliers in univariate data sets,” Paper SP06-2009, Schering, 2009.View at: Google Scholar
R. B. Jain, “A recursive version of Grubbs' test for detecting multiple outliers in environmental and chemical data,” Clinical Biochemistry, vol. 43, no. 12, pp. 1030–1033, 2010.View at: Publisher Site | Google Scholar
B. Rosner, “On the detection of many outliers,” Technometrics, vol. 17, pp. 221–227, 1975.View at: Google Scholar | MathSciNet
B. Rosner, “Percentage points for a generalized ESD many-outlier procedure,” Technometrics, vol. 25, no. 2, pp. 165–172, 1983.View at: Publisher Site | Google Scholar
R. Brant, “Comparing classical and resistant outlier rules,” Journal of the American Statistical Association, vol. 85, no. 412, pp. 1083–1090, 1990.View at: Publisher Site | Google Scholar
L. Xu, P. Zhang, J. Xu, S. Wu, G. Han, and D. Xu, “Conflict analysis of multi-source SST distribution,” in High Performance Computing and Applications, W. Zhang, Z. Chen, C. C. Douglas, and W. Tong, Eds., pp. 479–484, Springer, Berlin, Germany, 2010.View at: Google Scholar
M. S. Srivastava, “Effect of equicorrelation in detecting a spurious observation,” The Canadian Journal of Statistics, vol. 8, no. 2, pp. 249–251, 1980.View at: Publisher Site | Google Scholar | MathSciNet
D. M. Young, R. Pavur, and V. R. Marco, “On the effect of correlation and unequal variances in detecting a spurious observation,” The Canadian Journal of Statistics, vol. 17, no. 1, pp. 103–105, 1989.View at: Publisher Site | Google Scholar | MathSciNet
J. K. Baksalary and S. Puntanen, “A complete solution to the problem of robustness of Grubbs's test,” The Canadian Journal of Statistics, vol. 18, no. 3, pp. 285–287, 1990.View at: Publisher Site | Google Scholar | MathSciNet
O. H. J. Christie and K. H. Alfsen, “Data transformation as a means to obtain reliable consensus values for reference materials,” Geostandards and Geoanalytical Research, vol. 1, no. 1, pp. 47–49, 1977.View at: Publisher Site | Google Scholar