Abstract

The standard boxplot is one of the most popular nonparametric tools for detecting outliers in univariate datasets. For Gaussian or symmetric distributions, the chance of data occurring outside of the standard boxplot fence is only 0.7%. However, for skewed data, such as telemetric rain observations in a real-time flood forecasting system, the probability is significantly higher. To overcome this problem, a medcouple (MC) that is robust to resisting outliers and sensitive to detecting skewness was introduced to construct a new robust skewed boxplot fence. Three types of boxplot fences related to MC were analyzed and compared, and the exponential function boxplot fence was selected. Operating on uncontaminated as well as simulated contaminated data, the results showed that the proposed method could produce a lower swamping rate and higher accuracy than the standard boxplot and semi-interquartile range boxplot. The outcomes of this study demonstrated that it is reasonable to use the new robust skewed boxplot method to detect outliers in skewed rain distributions.

1. Introduction

More real-time flood forecasting systems in China, particularly in the large basins where there are many remote gauge stations, now use telemetry systems to transmit the rainfall signals of rainfall stations because telemetry systems can provide timely, dense, and labor-saving hydrological information for remote rainfall stations [1]. However, it has been shown that telemetric rainfall information includes inevitable outliers caused by instrument malfunction, human-related errors, and/or signal acquisition errors resulting from signal leaks, and collisions or disturbances in the process of signal transmission, in addition to random errors normally distributed with zero mean and a small variance [13]. The outliers have an unknown distribution with a much greater variance and appear to be inconsistent with the remainder of the dataset and are relatively large in magnitude [4, 5]. Therefore, outliers should be treated differently [6], and in this paper, observations containing outliers were called abnormal data.

In real-time flood forecasting systems, rainfall observations represent the main input and determine the accuracy of the forecasting results. The presence of abnormal data can lead to unreliable forecast conclusions. The need to increase the accuracy and reliability of telemetric rainfall data has prompted researches on the construction of a robust method to efficiently detect these abnormal data before they have been entered into the hydrologic models [7].

One of the most frequently used nonparametric tools to detect outliers for a univariate dataset is based on the concept of the boxplot. The method, suggested by Tukey [8], has come into common use and has been studied extensively (see for example [916]). An observation is considered as “potential” abnormal data when its value does not belong to the interval (the fence): (q1 − 1.5 ∗ IQR, q3 + 1.5 ∗ IQR), where q1 and q3 are the first and third quartiles, respectively, and IQR is the interquartile range, i.e., IQR = q3q1. The standard boxplot is fitted to normal or symmetric distributions in particular. For Gaussian data, the probability of lying below (q1 − 1.5 ∗ IQR) or above (q3 + 1.5 ∗ IQR) is 0.0035 (0.35%) each.

However, the real hourly rainfall observations are often not normal or symmetric. When we apply the standard boxplot for real hourly rainfall observations, the percentage of data outside the standard boxplot fence becomes excessively high. As an example, we selected the real hourly rainfall data of flood events from 1988 to 2008 measured at the Wuyigong rain gauge in the Qilijie basin in Southeastern China. The standard boxplot of the datasets is shown in Figure 1. The figures for other rainfall stations were not included because they were nearly identical to that for the Wuyigong Station.

It is clear that the underlying distribution of the rainfall dataset was skewed to the right. Up to 13.04% of the observations were above q3 + 1.5 ∗ IQR. Clearly, it would not be correct to classify them all as real abnormal data.

To cope with this, several adjusted boxplot methods have been proposed in case of skewed data. Kimber [17] suggested the use of the semi-interquartile range (SIQR) rather than IQR, i.e., the fence of the SIQR boxplot is defined as q1 − 3 ∗ (q2 − q1), q3 + 3 ∗ (q3 − q2). The SIQR boxplot has also been applied to the real hourly rain observations from the Wuyigong rain gauge. The boxplots of the two methods are shown in Figure 2.

The SIQR boxplot adjusts itself to the right skewness, compared with the standard method (Figure 2). The SIQR method expands the upper boundary slightly, and consequently less data are detected as outliers. However, the adjustment is not enough [18]. The probability of lying outside of the SIQR fence was 9.29%. It was still highly risky to identify these values as real abnormal data. The poor performance of the SIQR was further analyzed in Section 3.

Carling [19], Schwertman et al. [14], and Schwertman and de Silva [20] suggested replacing q1 and q3 in the fences with the median q2. They also suggested replacing the constant 1.5 with the functions that combine sample size with skewness. These approaches can achieve a prespecified swamping, i.e., the potential of misclassifying an uncontaminated observation as real abnormal data [21]. The methods perform well when the distribution is a lambda distribution. It is not clear how they perform with other distributions. Finally, the functions in these methods depend on the sample size, and the procedures require some characteristics of the uncontaminated distribution, which is often difficult to estimate for the real-time hourly rainfall datasets.

The aim of the paper was to construct a new boxplot method that was robust to outliers and sensitive to skewness. The new method was independent of the sample size and performs well with the rainfall distribution. It can reduce swamping and rapidly detected abnormal telemetric rainfall data before they were entered into the real-time flood forecasting model. This paper was organized as follows. In Section 2, we detailed the proposed procedure that included a robust measure of skewness in the construction of the fences. Section 3 illustrated the differences between the standard, SIQR, and robust boxplots for uncontaminated as well as abnormal datasets. Finally, we provided conclusions in Section 4.

2. Materials and Methods

2.1. Study Catchments

Three catchments located in southeastern and southern China were selected. They were the Qilijie catchment in Fujian Province; the Lushui reservoir catchment in Hubei Province; and the Yitang reservoir catchment in Guangdong Province. The three catchments showed some similar hydrological characteristics, such as excessive precipitation and a humid climate.

The areas of the three selected catchments varied, collecting together the basin characteristics in different sizes. The Qilijie catchment, as a representative of a large basin, covered 14,787 km2 with 43 telemetric rain gauges; the Lushui catchment was representative of a middle basin covering 3,960 km2 with 13 telemetric rain gauges; and the Yitang reservoir catchment represented a small watershed, covering 251 km2 with 6 telemetric rain gauges.

2.2. Data

Historic hourly rainfall observations from 1988 to 1998 from all the rain stations were compiled and used. The datasets were considered as nonoutliers or normal data. The total number of all the rainfall records was greater than 200,000.

2.3. Methods
2.3.1. Robust Skewness

The asymmetry of a distribution can be described by the skewness coefficient. The classical skewness coefficient depends on the second and third empirical moments of the datasets. However, the moments are sensitive to outliers. Therefore, the classical skewness coefficient could be strongly affected by outliers. Even a single outlier can make it easy to distort and make it difficult to interpret [22]. To overcome this problem, the medcouple (MC), introduced in Brys et al. [22], was chosen to estimate the skewness of rainfall observations.

The datasets were sorted in ascending order, i.e., , where s is the number of the datasets.

The MC of the datasets is defined aswhere is the median of the observation samples, and for all , the kernel function is given by

For the special case , the kernel function can be estimated as follows. Let denote the indices of the observations that are tied to the median , i.e., for all . Then

The MC equals the median of all values for which . According to the definitions in equations (1) and (2), MC is based on the quantiles and is therefore not as vulnerable to outliers as the classical skewness. It is clear that MC always lies between −1 and 1. A distribution that is skewed to the right has a positive value of MC, whereas it becomes negative for a left skewed distribution. Finally, a symmetric distribution has a zero MC. As shown in Brys et al. [22], MC is robust to resisting outliers and sensitive to detecting skewness. It has a bounded influence function and a breakdown value of 25%, which means that MC can resist up to 25% of outliers in the data.

2.3.2. Combination the Robust Skewness with the Boxplot Fence

To construct the boxplot fences for the skewed data, we propose to insert medcouple (MC) into the boxplot method. The constant 1.5 in the standard boxplot fence is replaced by some functions related to MC, such as and . The new robust skewed fence is defined by

Let to equal the standard boxplot at symmetric distributions. When distributions are asymmetric, and can be used to adjust the fence to fit the skewness. Using different functions, and cause the effects of the adjustment to be different. In Section 2.3.3, we compare the differences between the different functions.

2.3.3. Comparison of Different Functions

Three types of simple functions including only a few parameters were selected, which are important for operational real-time flood forecasting systems.

The following three functions were considered:(1)Linear function:(2)Quadratic function:(3)Exponential function:with .

To determine the values of , the expected percentage of observations beyond the robust skewed fence (equation (4)) was set to 0.7%, which was similar to the rule of the standard boxplot of the Gaussian distribution. According to the rule, the fence boundaries must reachwhere and are the and quantile of the distribution, respectively, and and .

Combining equations (5)–(7) into equation (8), we can obtain

Based on the historic hourly rainfall records, equations (9)–(11) were constructed for each rain gauge. The parameter values of the three functions could be derived using linear least squares estimation.

3. Results and Discussion

3.1. Data Preprocessing

In humid catchments, the gap between the maximum and minimum of the hourly rainfall records was large. Analyzing the datasets together will make it difficult to correctly detect outliers. To overcome this issue, the data records must be preprocessed. First, the hourly areal mean rainfall (HAMP) was calculated using the Tyson polygon. Then, the hourly areal mean rainfall (HAMP) was divided into three groups: (0, 1 mm] (Group 1), (1 mm, 2 mm] (Group 2), and (2 mm, +∞) (Group 3). Finally, the rainfall records for each station are divided into the three groups based on the HAMP. The analysis was repeated for each group.

3.2. MC Results

The average results for MC for the three groups are listed in Table 1.

It was clear that the average MCs are greater than zero. The rainfall datasets of the three groups were right skewed. The average MCs were less than 0.5, which demonstrated that the distributions were not extremely skewed. It was risky to use the standard and SIQR boxplot to detect outliers for skewed rainfall distributions.

3.3. Results of the Function Comparisons

To compare the behavior of the three different functions, they were applied to fitting the lower and upper boundaries of the fence. The fit of the linear, quadratic, and exponential functions is displayed in Figures 3 and 4. is set as the vertical axis in Figure 3, and is set as the vertical axis in Figure 4. The special defined vertical axis is the same as in equation (11) that allows exponential functions to form a straight line.

When MC was greater than 0.43, the linear function decreases abruptly and fully separated from the observation samples (Figure 3). At the same time, the quadratic and exponential functions fit the samples better. The fit of the exponential function was slightly better than that of the quadratic function (Figure 4). The exponential function was simply needed to determine fewer parameters than the quadratic function. As a result, the exponential function was selected to conduct the robust boxplot fence. The new robust skewed boxplot fence is defined by

The parameters in the new fence are estimated using the rainfall records from the three basins mentioned above. To simplify the practical application of the new method, the estimated values are taken by rounding up  = −3.96 and f = 3.35 to  = −4 and f = 3. Note that, rounding up the values to the nearest smaller integer yields a smaller fence and consequently a more robust model. The new robust skewed fence is

Note that, although MC is a robust estimator, it can be affected by outliers, particularly at high percentages of outliers. To reduce the effects of a high percentage of outliers on MC and new boxplot fences, a low percentage of outliers (≤5%) is considered. This low percentage of outliers coincides with the characteristics of the telemetric rainfall observations.

3.4. Performance of Noncontaminated Data

In this section, we compare the performance of the standard boxplot, SIQR [17], and the proposed robust skewed boxplot using real data without outliers, the real rain observations from the three basins. By calculating the swamping rate (SR) (the proportion of “good” data identified as outliers) [23] and accuracy (the proportion of outliers and “good” data identified correctly) [23], the differences in the three boxplots were analyzed. The SR and accuracy are defined as

When there is no outlier, we know that the total number of the good data was equal to the total number of datasets. In this case, SR plus accuracy equaled 1.0. Table 2 lists the average SR and accuracy for the three methods. Figure 5 displays the average SR of every rain gauge for the standard boxplot and robust skewed boxplot for the purpose of clarity.

Based on equation (14), SR is an estimator of method risk. From Table 2 and Figure 5, the results clearly showed that the robust skewed boxplot had a lower SR than the standard boxplot and SIQR without outliers. The SR of the proposed boxplot was 1.3% and was close to the portion outside of the standard boxplot fence for Gaussian data (0.7%). SIQR was slightly superior to the standard boxplot; however, the SR was still far greater than 0.7%. The accuracy of the robust boxplot was the highest in the three methods. The standard boxplot performed much worse than the SIQR.

We also observed that the proposed robust skewed boxplots for each rain gauge yielded much better SR values on the skewed uncontaminated observations than the other boxplots. This was due to the fact that the robust skewed boxplot used MC to adjust the fences to the skewed data. The detailed information on the boxplots is shown in Figure 6.

The rain observations (HAMP ∈ (1, 2]) at Wuyigong Station were selected to demonstrate that the proposed method was able to adjust the fence to the skewed data. The results are presented in Figure 6. MC equaled 0.52, and the distribution was skewed to the right.

It is clear that the robust skewed boxplot yielded the larger upper boundary than standard boxplot and SIQR, and it was adjusted to better reflect the right skewed data. The proposed boxplot identified fewer large good data points as the upper outliers. At the same time, the new method had less adjustment of the lower fence than the other methods that may result in the smallest good data being marked as the lower outliers (Figure 6). However, for the flood forecasting system, the influence effects of the larger good data identified as outliers on the forecasting results were far greater than those of the smallest good data identified as outliers. The robust skewed boxplot had practical value.

3.5. Performance under Contamination

We now compared the robustness of the three boxplots using the contaminated data. To understand the detailed information of the outliers, synthetic datasets were generated by superimposing the following upper outliers on the real rain observations in Section 2.2:where is a random number and is a constant that controls the maximum of . L is the frequency of outliers, for example, means that the outlier percentage is 5%. By adjusting and , outliers of different magnitude and frequency could be generated.

We generated outlier samples and ran the experiment 1000 times for every (T = 5 mm, 10 mm, 20 mm, 30 mm, 40 mm, 50 mm, and 100 mm) and L = 20.

The MC of different T values at Wuyigong Station in the Qilijie basin is listed in Table 3. We obtained comparable results for the other rain gauge stations.

The results in Table 3 showed that MC changes little when T increased. This demonstrated that MC was not influenced evidently by the outliers. The average performance of the three boxplots with outliers is shown in Table 4 when T = 40 mm.

Under contamination, the total number of good data did not equal the total number of the datasets, and SR plus accuracy did not equal 1.0.

Because of based on quantiles, the three boxplots all had the ability to resist outliers and they maintained robust results for noncontaminated and contaminated data (Tables 2 and 4). Compared with the standard boxplot and SIQR, besides the quantiles, the new boxplots used the MC that was robust to outliers and sensitive to skewness to construct the new fences. By moving the upper boundary up, the proposed method had a much lower SR and higher accuracy. This illustrated again that the proposed boxplot accounted sufficiently for skewness.

To further analyze the effects of different values of L on performance, we ran the experiment 1000 times for L = 10. The average performance of L = 10 is listed in Table 5 (T = 40 mm).

The outlier frequency changed from 5% to 10%, and the performance of the three boxplots varied a little. It was clear that the size of the outliers and the frequency had only a small effect on SR and accuracy. However, the results required some restrictive conditions, for example, the proportions of the outliers could not be too high.

4. Conclusions

The standard boxplot is a popular nonparametric method to detect outliers in data series. Unfortunately, when it is used on skewed data, such as hourly rainfall series, the probability of identifying good data as outliers was high. Therefore, a MC that was not only robust to outliers but also sensitive to skewness, and different simple function styles were produced to adjust the standard boxplot fence to fit the skewed hourly rainfall distributions. The exponential function was then selected based on comparisons.

The comparison of the results using uncontaminated and abnormal data showed that the proposed method had robust performance and a lower risk of identifying good data as outliers, compared with the standard boxplot and SIQR.

In flood forecasting systems, the decision to eliminate data as outliers is a serious matter and should not be taken lightly. The unusual good observations often provide valuable information. Therefore, a more conservative approach (less risk of identifying good data as outliers), such as the robust skewed boxplot method, is reliable for practical applications.

Data Availability

The rainfall observation data used to support the findings of this study were supplied by the branch of hydrology and water resources investigation bureau of Fujian Province under license and so cannot be made freely available. Requests for access to these data should be made to the branch of hydrology and water resources investigation bureau of Fujian Province, http://www.fjsw.gov.cn/.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

C. Z. proposed the research ideas and methods in the manuscript and was responsible for data collection and writing. H. L. and J. Y. suggested revisions for the paper.

Acknowledgments

This study was funded by the National Nature Science foundation of China (grant no. 50909084) and Nature Science Foundation of Fujian Province (grant nos. 2017J01721 and 2018J01525).