Journal of Applied Mathematics

Volume 2015, Article ID 708948, 9 pages

http://dx.doi.org/10.1155/2015/708948

## Data Transformation Technique to Improve the Outlier Detection Power of Grubbs’ Test for Data Expected to Follow Linear Relation

^{1}Group Bio-Process Analysis Technology, Technische Universität München, Weihenstephaner Steig 20, 85354 Freising, Germany^{2}Institut für Landtechnik und Tierhaltung, Vöttinger Straße 36, 85354 Freising, Germany^{3}Computer Unit, Faculty of Agriculture, University of Ruhuna, Mapalana, 81100 Kamburupitiya, Sri Lanka

Received 9 September 2014; Revised 9 December 2014; Accepted 10 December 2014

Academic Editor: Carlos Conca

Copyright © 2015 K. K. L. B. Adikaram et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Grubbs test (extreme studentized deviate test, maximum normed residual test) is used in various fields to identify outliers in a data set, which are ranked in the order of . However, ranking of data eliminates the actual sequence of a data series, which is an important factor for determining outliers in some cases (e.g., time series). Thus in such a data set, Grubbs test will not identify outliers correctly. This paper introduces a technique for transforming data from sequence bound linear form to sequence unbound form . Applying Grubbs test to the new transformed data set detects outliers more accurately. In addition, the new technique improves the outlier detection capability of Grubbs test. Results show that, Grubbs test was capable of identifing outliers at significance level 0.01 after transformation, while it was unable to identify those prior to transforming at significance level 0.05.

#### 1. Introduction

Grubbs test [1] is a statistical test used to detect outliers which was introduced in 1950 and extended in 1969 [2] and 1972 [3] by the same author. Grubbs test locates outliers that exist in a univariate data set using mean, standard deviation, and tabulated criterion. Grubbs test is also known as maximum normed residual test or “extreme studentized deviate” (ESD) test, and the data set is assumed to be normally distributed. The test is defined as where is standard deviation and is the sample mean. If the maximum related to the th element is greater than the relevant tabulated criterion, then the element is considered an outlier. The testing procedure is continuing until no more outliers are detected. However Grubbs test is not recommended for detecting outliers for sample size of six or less. When the sample size is six or less, most of the times Grubbs test identified nonoutliers as outliers [4].

During the last decades Grubbs test was used to identify outliers in different disciplines [5–9]. Also, during the last decades pros and cons of Grubbs test were identified and were improved as well. In 1975 Rosner showed that Grubbs test (ESD) performs much better than studentized range methods and performs equally as Kurtosis and R-statistic methods [10]. In 1983 Rosner introduced an improved version of ESD as generalized extreme studentized deviate (GESD) test [11]. However, GESD does not work well when the sample size is less than 25 [11]. Brant in 1990 stated that the combination ESD rules and boxplot provide comparable performance [12]. On the other hand, it was shown that the standard deviation and mean are affected by two or more outliers; Grubbs test does not detect outliers [13] correctly. Also, if the standard deviation of the data set is too large or too small, the test will tend to detect false outliers and vice versa. This was overcome by setting a threshold value for standard deviations for the specific considered data domain [13]. Meanwhile, some publications show that Grubbs test is robust against the effect of intraclass correlation structure [14] and data that have Baldessari’s structure [15, 16].

Form the definition of Grubbs test, it locates outliers in a data set which are ranked in the order of . This implies that Grubbs test considers only the value of data points but not the real order of the data. In other words, Grubbs test treats sorted series and unsorted series in the same manner. Thus, Grubbs test is valid only for those data domains where the occurrence order is of no importance. However, with respect to outliers, the order of the data points is a very important factor for data series that are expected to have gradual increment or decrement over time. Thus, applying Grubbs test to data which has a relation with the occurrence order will not give a correct output. This is particularly important in the area of process control where the order of the data points has a very high impact on data interpretation. Therefore, Grubbs test is not a reliable method for detecting outliers in time series, because in time series occurrence order is a critical factor.

Grubbs test is capable of checking whether a certain suspected data point is an outlier. By default suspected points are the minimum and the maximum of the data set. If the most suspected data points are not outliers, Grubbs test does not identify other data points as outliers. Figure 1 shows two artificial data sets (data sets 1 and 2) with one outlier in each data set. Table 1 shows results of Grubbs test for two significance levels (*α*) of 0.05 and 0.01. In both data sets, the test does not detect outliers for any considered significance level. In data set 1, the outlier is not significant enough to be identified by Grubbs test. Although the outlier in data set 2 (190) deviated significantly in relation to its position, after ranking it moves to the end of the series and becomes an insignificant outlier.