Abstract

Extreme events in earthquakes, wind speed, among others are rare but may lead to catastrophic effects on humans and the environment. The primary parameter in the estimation of such rare events is the tail index which measures the tail heaviness of an underlying distribution. Since extreme events are rare, the presence of missing observations may further lead to flawed. In view of this, there is a growing effort by researchers to address this problem. However, the existing methods of estimating the tail index use only the available nonmissing data. Thus, if the missing observations are influential values, ignoring them could introduce more bias and higher mean square error (MSE) in the tail index estimation and subsequently other extreme event--estimators such as high quantiles and small exceedance probabilities. In this study, we propose imputation of the missing observations before applying some standard estimators (Hill and geometric-type) to estimate the tail index. Through a simulation study, we assess the performance of the standard estimators under the proposed data enhancement method and the existing modified estimators of the tail index. The results show that the enhanced estimators have relatively lower bias and MSE. The estimation method was illustrated with a practical dataset on wind speed with missing values. Therefore, we recommend imputation mechanism as viable for enhancing the performance of tail index estimators in the case where there is missingness.

1. Introduction

Statistics of extremes is a branch of Statistics that deals with the estimation of parameters of rare events. It enables the assessment of the probability of events that are more extreme than any previous observation from a sample of random variables Coles [1]. According to Gomes and Guillou [2, 3], the occurrence of rare events in phenomena such as earthquakes, hurricanes, wind speed, sea waves, and floods can have a catastrophic impact on human beings and the environment. Modelling the occurrence of such events aids in planning to reduce or prevent the impact of such events.

Recent developments in this area focus on modelling and predicting rare events to mitigate their negative impact on humans and properties. The primary parameter of interest is the tail index or extreme value index (EVI), which measures the tail heaviness of an underlying distribution. One key challenge researchers encounter in their quest to model rare events or estimate the tail index is that the number of observations available is usually very small to none due to their unusual occurrence. Therefore, having missing observations can affect the tail index estimates computed from the sample, thereby leading to unreliable estimates for exceedance probabilities, high quantile, and return period which are the goals of extreme value analysis. In this study, we propose an enhanced method of estimating the tail index of underlying distributions, where the missing values are estimated and replaced in the data via an imputation method.

Extreme Value Theorem (EVT) was proposed to model the tails of probability distribution. According to Fisher and Tippett [4], there are three broad families of the limiting distributions in EVT, namely, Gumbel, Frechet (Pareto or Fre´chet-Pareto), and Weibull. These are also referred to as the extreme value distributions (EVD). These three families of extreme value distributions were simplified by Jenkinson [5] as the generalised extreme value (GEV) distribution. The GEV distribution function for a random variable is given by where , , and are the location, scale, and the tail index (or extreme value index (EVI)), respectively. According to Coles [1], in (1) determines the most suitable type of tail behaviour for a dataset. The tail distribution is classified as Gumbel when , Frechet-Pareto when , and Weibull when .

Many researches conducted in statistics of extremes have been limited to estimating the tail index and other parameters of extremes with complete dataset (Beirlant et al., Ameraoui et al., Minkah et al. [69]). Gomes and Pestana [10] and Beirlant et al. [11] proposed reduced-bias estimators for the tail index parameter on complete data. In the case of presence of censoring, Gomes and Neves [12], Ameraoui et al. [6], and Minkah et al. [13] provide techniques for incorporating censoring in the estimation of the tail index. Li and Qi [14] employed an existing adjusted empirical likelihood method, to construct confidence intervals of the tail index so as to achieve a better accuracy. Through a simulation study, they found their method to be superior in terms of the coverage probability and length of confidence intervals. In the case of the presence of covariate information, Ma et al. [15] propose empirical likelihood-based statistics to construct confidence regions for the regression coefficient of the parametric tail index regression model. Also, Minkah et al. [16] studies several estimators of the tail index in the presence of both censoring and covariate information. However, all these estimators do not take into account missing data and hence are somewhat challenged in the presence of missing observations.

Since missing data is a common problem in statistics, some authors such Mladenovic and Piterbarg [17] worked on the estimation of the exponent of the regular variation with the use of incomplete data samples. They proved the asymptotic consistency of the Hill estimator. Ilić and Mladenovic [18] extended the works of Hsing [19] and Mladenovic and Piterbarg [17] where the authors relied on available observations (incomplete samples). Under the assumption of weak dependency, they proved the consistency of their proposed Hill-type estimator of the tail index based on an incomplete sample.

In addition, Zou et al. [20] focused on extreme value analysis without the largest values. The study revealed that the presence of missing extremes makes the choice of threshold for the top order statistics problematic. They simultaneously considered the number of missing extremes with the tail index and other parameters and proposed a functional version of the Hill estimator and named it Hill Estimator Without Extremes (HEWE). The estimator was found to be robust to missing extremes on light-tailed dataset.

Furthermore, Ilić and Veličković [21] considered the simple tail index estimation in the case of heterogeneous and dependent data samples with missing values. Their study was on the asymptotic behaviour of the median estimator and its robustness against deviations of the slowly varying function. Although under small deviations from the assumed parametric model, the proposed method provided a reliable tail index, the top values of the sample were not considered.

However, the existing methods for estimating the tail index use reduced sample size since portions of the dataset within the order statistics that are missing are ignored. Using only portions of the data may result in estimators with large bias and/or variance especially if the missing observations are influential in the top order statistics which are of interest in statistics of extremes.

Therefore, in the quest to reduce bias and variance in tail index estimation in the presence of missing observations; we propose imputation of the missing observations before applying standard tail index estimators (such as Hill and geometric-type), instead of using the modified estimators in the literature.

The rest of the paper is organised as follows. In Section 2, we present the materials and methods including the existing and the proposed method for estimating tail index. Section 3 presents the results of the simulation study and a discussion of the results. Lastly, in Section 4, we provide concluding remarks, areas for future research, and recommendations.

2. Material and Methods

Let be a sample of independent and identically distributed observations from some process with underlying distribution . Assume to be the sample order statistics associated with the sample. The so-called semiparametric estimators of the tail index (see, e.g., Beirlant et al. [7, 22], de Haan and Resnick [23]) in the literature rely on the exceedances over a particular threshold, . Thus, the dependence of tail index on the largest order statistics makes the selection of critical. A careful choice of is needed as a small value leads to large and hence few observations for estimation. This may lead to a tail index estimator with smaller bias but larger variance. On the contrary, a large leads to small which may result in the inclusion of observations with smaller magnitude leading to a larger sample. Although, having more data will reduce the variance, it is at the expense of bias. The problem is addressed by choosing such that as increases, increases but at a slower rate. Formally, such that

Equations (2) and (3) are used to obtain a number of nonzero sequence of integers which are referred to as intermediate. Next, we present some standard estimators for tail index estimation under complete dataset.

2.1. Hill Estimator

The Hill estimator (Hill [24]) is the most popular estimator in the Frechet-Pareto family under semiparametric method (Gomes and Guillou [2, 3]). The Hill estimator is valid for and is given by where is an intermediate sequence of integers defined in (2) and (3). Some desirable properties of the Hill estimator are its consistency and asymptotic normality (de Haan and Resnick [23]). However, it is known for its dependence on and exhibits large bias for large values.

2.2. Geometric-Type Estimator

The Geometric-type estimator, proposed by Brito and Freitas [25], is an adopted geometrical estimator motivated by the fact that, for large random variable , is approximately linear with slope , where is a positive constant. The estimator is given by where is the sample size (number of random variables), is a sequence of positive integers satisfying , as well as Equations (2) and (3). Brito and Freitas [25] investigated the weak asymptotic properties of the geometric-type estimator and showed that its distribution is asymptotically normal under general conditions.

2.3. Estimators of Tail Index under Missing Observations

We now discuss existing modifications of the Hill and geometric-type estimators of the tail index when there are missing observations. For a given sample with some missing observations, we consider an observed portion in the sample to be , where is the number of observed random variables among the first terms of the sequence and . The order statistics of the observed sample is , such that the associated maximum is .

According to Ilić [26], a sample with missing observations may be obtained on condition that every observed term has probability which is independent of the other terms. Hence, is a binomial random variable with parameters and . To obtain the tail index estimator, the following assumption must be satisfied:

The random variable is independent of , and there exists a sequence of real numbers such that and

Let be a sequence of real numbers such that and . Also, let and

Then, the inequalities and will hold for , the largest integer not greater than .

The modified Hill and geometric-type estimators are respectively given as and where is the indicator function such that for , , and if .

2.4. Multivariate Imputation with Chain Equations (MICE)

MICE is one of the widely used imputation methods for filling missing observations in data. MICE, also known as the sequential regression or fully conditional specification multiple imputation, is a very flexible method because it can handle different variable types such as discrete and continuous. It uses fully conditional specification to preserve unique features such as bounds, skip patterns, interactions, and bracketed responses in the data (Van Buuren [27]).

The MICE operation is based on the assumption of Missing at Random (MAR) with the implication that missing value probability is independent of unobserved values but is dependent of the observed values Schafer and Graham [28]. MICE has three different phases which are similar to any other multiple imputation method: imputation, analysis, and pooling. It creates multiple imputations to overcome the limitation of single imputation. MICE can handle large data sets through the use of chain equations as compared to other imputation methods that uses joint models (He et al. [29]). This makes it a flexible multiple imputation method that uses a number of regression algorithms. In this study, we use the MICE algorithm to impute missing observations.

2.5. Proposed Data Enhancement Method for Tail Index When Observations Have Missingness

This method uses MICE algorithm to impute the missing observations before applying the standard estimators of the tail index such as the Hill and geometric-type. Again, consider the sample , where is the sample size of a dataset which is not fully realized due to missing observation(s). That is, the dataset available is , and . We propose the following data enhancement and estimation method of the tail index parameter: (1)Apply MICE on the incomplete data, , to generate the missing observations (2)Combine the imputed observations and the available observations to obtain a sample of size with observations , hereinafter referred to as an imputed dataset(3)Obtain the order statistics associated with such that the associated maximum is (4)Assume is in the Maximum Domain of Attraction (MDA) for a suitable tail index as is the case in a semiparametric framework and select the upper order statistics in (5)Estimate using the standard estimators (Hill and the geometric-type without any modification) based on upper order statistics, where is an intermediate sequence

2.6. Simulation Design

In this section, we present a simulation study to compare the results of the data enhancement method to some existing methods, such as Mladenovic and Piterbarg [17], for estimating the tail index as discussed in Section 2.3. We generate samples from distributions from the Pareto domain of attraction for the simulation study. Table 1 contains the distribution functions used for the simulation study and their characteristics.

A step-by-step procedure for the simulation study is outlined in the flowchart in Figure 1.

In the next section, we present and discuss the simulation results. However, for brevity and ease of presentation, we provide the simulation results and discussion of the performance of the estimators for samples generated from the Burr distribution only. The results from the other distributions did not differ significantly from the Burr distribution and are available upon request from the authors.

3. Results and Discussion

Generally, an estimator with relatively smaller bias and MSE is preferred. In addition, we require such an estimator to be stable as increases. An estimator that is less sensitive to the changing values of maintains a stable outlook throughout the evolution of . Such an estimator is deemed as the most appropriate for the estimation of the tail index as it maintains a better balance between bias and variance.

Since extreme value analysis for the right tail concerns larger observations, we assess the estimators’ performances for up to 60% of the sample size. Thus, this enables the inclusion of smaller order statistics and the assessment of estimators’ performance across a broad spectrum of where bias is expected to be prevalent.

Table 2 contains the notations of estimators used in the study.

Also, all the simulation and practical application (i.e., Section 3.2) results were obtained using the R statistical package, and the codes are available upon request from the authors.

3.1. Results from the Burr Distribution

For each figure discussed in this subsection, the left and the right panels show the absolute bias and the MSE of an estimator. Also, the top, middle, and bottom panels depict 10%, 30%, and 60% missing observations, respectively relative to the sample size.

Figure 2 shows the absolute bias and MSE of the estimators of the tail index for samples of size, , generated from the Burr distribution. It is evident from the left panel (i.e., consisting of 10%, 30%, and 60% missingness) of Figure 2 that the Hill on MICE imputed dataset (Im.Hill) has the smallest absolute bias for small values of . Generally, the bias of Im.Hill increases as increases regardless of the percentage of missing observations. Also, Im.Hill is not stable as increases above the first five top order statistics.

Although the M_Hill estimator is the most biased estimator for small values of , it has the least bias for larger values of . The Im.Hill estimator is competitive to the M_Hill for larger values of .

In terms of MSE, the right panel of Figure 2 shows that all the estimators diverge as increases. However, the Im.Hill is closer to zero for smaller values of in all the percentages of missingness, i.e., 10%, 30%, and 60%. Also, for large , generally, the Im.Hill estimator has smaller MSE comparable to that of the M_Hill in the case of samples with high percentages of missingness.

Therefore, in terms of bias and MSE, the proposed Im.Hill can be considered as the most appropriate across a large spectrum of .

In addition, the simulation results for samples of size are shown in Figure 3. In the case of bias, the Im.Hill estimator exhibits the least bias and relative stable for less than 40% of the sample size. Beyond this value of , M_Hill relatively has the smallest bias but unstable as it diverges as increases. For MSE, Im.Hill has the least MSE for approximately less than 50% of the sample size. Again, M_Hill has better MSE values than all the other estimators for over 50% of the sample size. However, for high percentage of missingness, the proposed Im.Hill outperforms the M_Hill estimator. Thus, overall, the Im.Hill provides a better estimator of the tail index in terms of bias and MSE across the values of .

Furthermore, in the case of samples of size , the estimators exhibit similar performance characteristics to the two preceding cases discussed. More importantly, a closer look at the corresponding graphs for all the sample sizes indicates that the MSE generally decreases as increases. This is empirically consistent with the consistency property of estimators of the tail index.

Moreover, the right panel which shows the estimators’ performance in terms of MSE indicates that the Im.Hill, Geom, and Im.Geom estimators are more stable within the 50 and 200 upper order statistics, whereas M_Hill is not stable within the 50 and 200 upper order statistics. Here again, the Hill on MICE imputed dataset (Im.Hill) has the smallest MSE within the 200 upper order statistics. Hence, Im.Hill is the preferred estimator for estimating the tail index under missing observations when the sample of size is drawn from the Burr distribution.

The results of the tail index estimators on a sample of size drawn from the Burr distribution are presented in Figure 4. The results in the left panel (subgraphs (a), (c) and (e) representing absolute bias for the 10%, 30%, and 60% missingness) indicate that Geom, Im.Geom, and M_Hill are not stable within the first 200 upper order statistics.

Specifically, the absolute bias of M_Hill decreases as increases within 200 and 1000 upper order statistics whereas the absolute bias of Im.Hill, Geom, and Im.Geom increases as increases. Comparatively, Im.Hill is stable within the first 200 upper order statistics with smallest absolute bias. Therefore, Im.Hill can be said to be the best/preferred estimator in terms of low bias. Figures 4(b), 4(d), and 4(f) present the MSE of the tail index estimators on a sample of size drawn from the Burr distribution. Within 200 and 800 upper order statistics, the MSE of M_Hill decreases as increases. The MSE of Im.Hill, Geom, and Im.Geom are more stable as increases. Im.Hill has the smallest MSE within the first 400 upper order statistics and hence is selected as the most appropriate estimator for estimating the tail index under missing observations using a sample drawn from the Burr distribution with size .

3.2. Application to Real-Life Data

In this section, we illustrate the enhanced and existing methods of estimating tail index under missing observations, on a real-life wind speed data obtained from the extremefit package in R (Durrieu et al. [30]). The wind speed data contains the average wind speed in meters per second (m/s) per day in Brest (France) from 1976 to 2005. The data contains 10903 observations with minimum wind speed of 0.700, maximum of 27.400, and an average daily mean wind speed of 8.553.

High wind speeds are known to cause collapse of buildings, ships, difficulties in aircraft takeoff and landing, among others (see, e.g., Marchigiani et al. [31]). Therefore, modelling the tail behaviour of an underlying distribution of wind speed will help in planning and mitigating the effects of extreme wind speeds. The presence of missing values in historic wind data speed needs to be taken care of in the modelling process. In the case of the present wind speed data, there are 6 missing values. In addition, in order to further assess the suitability of the proposed data enhancement method for tail index estimation, we introduced missingness up to 45% (i.e., 10%, 25%, and 45% to represent small, medium, and large percentage of missingness, respectively) of the sample size.

The application of the proposed data enhancement method of estimating tail index begins with a search for the domain of attraction of the underlying distribution of the wind speed dataset. Figure 5 (the scatter plot of the wind speed data) shows some few observations are detached from the majority of the data values. Thus, the detached values (large values) may be outliers or extreme observations.

It is evident from the histogram that the wind speed data is positively skewed which suggests that the data has a heavy tail. Also, the general increasing trend of the mean excesses as the threshold decreases indicates that the wind speed data has an underlying distribution which is heavy-tailed than the exponential. Again, the QQ-plots at the bottom of Figures 6(c) and 6(d) compare the sample quantiles of the wind speed data to the theoretical quantiles of the exponential and Pareto distributions. Both plots support the assertion from the previous graphs that the underlying distribution has a Pareto-type tail.

Next, we apply the geom estimator to estimate the tail index of the available data and call this the “gold standard.” Subsequently, a set of missing percentages of data relative to the sample size, i.e., 10%, 25%, and 45% were created randomly in the data using the ampute() function in the R package MICE. The existing modified Hill and geometric-type estimators in the literature are used to estimate the tail index of the underlying distribution of the wind speed datasets with the missing observations. However, in the application of our method, we impute the missing observations using the mice() function in each of the three datasets containing the various percentages of missing observations. Thereafter, we use the standard Hill and geometric-type estimators to estimate the tail index of the underlying distribution from each sample.

Figure 7 presents the results of the tail index estimators on the wind speed data as a function of the number of top order statistics, .

For each of the estimators of the tail index, the performance is assessed on their closeness to the standard geometric-type estimator (i.e., C Geom, of which the estimation is done on a complete dataset of the wind speed data) at different values of .

It can be seen from Figure 7 that, as increases, almost all the estimators deviate from the standard geometric-type on complete dataset (C_Geom). From Figure 7(a) (i.e., 10% missingness), estimates of Geom and Im.Geom are almost the same as estimates of the C Geom, whereas estimates of Im.Hill and M_Hill are farther away from C_Geom. Also, in Figure 7(b) (with 25% missingness) the Im.Geom estimator is closer to C_Geom but it diverges as the number of top order statistics increases. Also, the proposed Im.Hill is closer to C_Geom than the rest of the estimators for and diverges beyond this range. In the case of the introduction of high percentage of missingness (i.e., 45%), Figure 7(c) shows that the estimates of Im.Hill are almost the same as those of C_Geom and quite stable compared with the other estimators of the tail index of the underlying distribution of the wind speed data.

In all the cases considered, it is evident that the M_Hill deviates more from the standard as compared to the Geom, Im.Geom, and Im.Hill. Thus, it can be ruled out as not good for the tail index of the wind speed data.

Generally, Im.Hill and Im.Geom are relatively closer to standard (C_Geom) whereas M_Hill deviates from the standard in all the scenarios. Therefore, the estimators of tail index that are based on our proposed data enhancement method can be considered as appropriate for estimating the tail index of the underlying distribution of the wind speed data. With these estimates, other parameters of extreme events such as high exceedance probability, extreme quantiles, and return periods for certain wind speeds, which are the focus of extreme value analysis, can be obtained more readily.

4. Conclusion

In this paper, a data enhancement method is proposed in the estimation of tail index of an underlying distribution of a dataset when some observations are missing. This method involves imputing the missing data with an appropriate imputation method and thereafter the application of standard tail index estimators such as Hill and geometric-types. This method is contrary to the existing approach where standard estimators are augmented to use only the nonmissing part of a dataset to estimate the tail index.

The estimators based on the data enhancement method are compared with the existing estimators of tail index in the presence of missingness using a simulation study. The results of the simulation study show that no estimator is universally the best across a broad spectrum of the number of top order statistics and percentages of missingness. However, generally, the proposed estimators based on the data enhancement method exhibit smaller bias and MSE across larger spectrum of top order statistics. More importantly, in the presence of high percentage of missingness, the estimators based on the proposed data enhancement method show smaller bias and MSE and can thus be considered appropriate for estimating the tail index under missing observations.

In addition, the proposed data enhancement method of estimating tail index together with the existing estimators were illustrated with a practical dataset on wind speed. The results show that the estimators based on the data enhancement method are competitive when there are few missing observations and are more suitable when there is a high percentage of missing observations. Therefore, the imputation of missing data to obtain a semblance of the complete data offers a good approach in tail index estimation. In this regard, the MICE algorithm is recommended as a suitable imputation mechanism for enhancing the performance of tail index estimators under missingness.

Data Availability

The wind speed data used in this study is publicly available in the R package, extremefit, and it is named dataWind.

Conflicts of Interest

The authors declare that there is no conflict of interest.