Abstract

Many definitions exist for sample quantiles and are included in statistical software. The need to adopt a standard definition of sample quantiles has been recognized and different definitions have been compared in terms of satisfying some desirable properties, but no consensus has been found. We outline here that comparisons of the sample quantile definitions are irrelevant because the probabilities associated with order-ranked sample values are known exactly. Accordingly, the standard definition for sample quantiles should be based on the true rank probabilities. We show that this allows more accurate inference of the tails of the distribution, and thus improves estimation of the probability of extreme events.

1. Introduction

The quantile of a continuous, strictly monotonous distribution function is defined as where is the probability of nonexceedance of a variable value. When the distribution is unknown, sample quantiles provide estimators of their population counterparts based on a set of independent order-ranked observations . The associated sample probabilities are then , where is the probability of a new sampled value being less than or equal to . These nonexceedance probabilities are those defining the cumulative distribution function (CDF).

Many different formulas for defining sample quantiles have been used in literature and statistical software. This has caused considerable confusion, in particular when performing extreme value analysis for various applications where probabilities of rare events need to be estimated. In a widely cited article Hyndman and Fan [1] identified this problem and emphasized that there is a need to adopt a standard definition for sample quantiles. The same problem was discussed again by Langford [2] who identified twelve different sample quantile definitions that are used in statistical software.

Hyndman and Fan [1] analysed nine different sample quantile definitions. They selected six “desirable properties” for an estimator of a sample quantile and considered how well different definitions satisfy them. This approach is similar to judging the plotting position estimators by five “postulates” as done by Gumbel [3] and by three “purposes” by Kimball [4]. Hyndman and Fan [1] proposed to be used as the basis of the standard definition.

However, the definition of the quantile has not yet been standardized. Modern statistical software, such as Matlab, Excel, SciPy, STATA, , and , include different definitions and offer user-selected options for the formulation of the quantile function, as well as for plotting positions in quantile plots and quantile-quantile plots; see, for example, Castillo-Gutiérrez et al. [5]. The inability to agree on a standard definition has arisen from the many proposals [68] and the subjective nature of the “criteria” and “desired properties.”

Since the quantile function is the reverse of the cumulative distribution function, the quality of its definition must be judged by how close the probabilities defined by it are to the true probabilities of the cumulative distribution function. Thus, the definition of a sample quantile function should be based on the true nonexceedance probabilities. It was pointed out by Makkonen [9] that, for order-ranked data, they are known exactly. We outline here two rigorous proofs of this conclusion and show how the appropriate definition for the sample quantile function follows from it.

2. Sample Probabilities

We present in the following two deductions of the probability .

Consider in Figure 1 an order-ranked sample (a) of random observations (white circles) and a new observation (grey circle) sampled randomly from the population the distribution of which is unknown. In the new sample (b) obtained by including the new observation the new value may fall in any interval of the original sample or be smaller than or larger than . In the sample, each observation has the same probability to be the smallest one. In particular, .

In the same way, each observation has the same probability to be the in order, where . In particular, . Consequently, the probability of the value to be smaller than or equal to the value of the original sample equals .

The result deduced above can also be derived by formal mathematics [10]. Consider variate with cumulative distribution function and a sample of observations ranked in ascending order. Values in different samples of size are random values of variate for which the probability density function in terms of and its derivative is given by [3, 11] We wish to associate a probability with each observed rank . The precise meaning of is illustrated in Figure 2. Since the probability of event is controlled by two variates and , the probability is obtained by integrating the joint density function of the variates and over the area where .

Due to the mutual independence of and , their joint density function equals , where and are the density functions of and , respectively. The nonexceedance probability is, therefore, obtained by integration of the joint density function over zone : The last step in the deduction above is based on direct application of Euler’s -function.

3. Plotting Positions

One might expect that, based on (3) and the definition of CDF, would have been the sample probability used by everybody. Unfortunately, this is not so, as discussed, for example, in the reviews [12, 13]. Even though has been recommended already by Weibull [7] and used by numerous researchers since the 1950s, there has been a lot of research aiming at “improving” the probability by using probability estimates called plotting positions of the form , where and are constants depending on the type of parent distribution, size of the sample, and so forth. These attempts include, for example, Benard and Bos-Levenbach [14], Blom [6], Langbein [15], Gringorten [8], Wilk and Gnanadesikan [16], Barnett [17], Cunnane [18], Guo [19], Jones [20], Yu and Huang [21], and Folland and Anderson [22]. To the effect of adding to this complexity, even numerical methods to calculate plotting positions have been proposed [23].

The main reason for this trend appears to be a misunderstanding in the role of the mean value both in theoretical considerations and when interpreting the results of Monte Carlo simulations. This is shown in the Appendix, in which also the ideal performance of the Weibull positions is demonstrated in one special case using the bin frequency criterion introduced by Makkonen et al. [24]. According to the bin frequency criterion, the method of least squares (MLS) using the Weibull probabilities gives a better distribution function than MLS using any of the plotting positions of Cunnane, Gringorten and Blom.

4. Sample Quantiles

It is apparent that even if there may be some reasons, for example, bias in the MLS, to use the linear transformations of , that is, the plotting positions of the type of Cunnane, Blom, Gringorten, Hyndman and Fan, and so forth, in curve fitting, none of them is valid when the sample probabilities are plotted. The definition of a sample quantile does not need to be based on any “estimator” of . It can be defined objectively based on the true nonexceedance probability of a new value drawn from the population. For variable values in between the observed values, one may interpolate linearly. Accordingly, we define the following.

For For and we define The definition of the quantile function in (5) is illustrated in Figure 3.

The definition of (5) has all the desired properties of Hyndman and Fan [1]. The huge advantage of (5) over all other suggestions for the sample quantile definition is that it is based on the true probabilities. This makes it possible to standardize the sample quantile function in statistical software by sound theoretical basis instead of subjective criteria.

Using any other definition, for example, promoted by Langford [2], as an estimate of the probability , results in significant relative errors, particularly at the tails of the distribution. This is illustrated in Figure 4 showing the relative errors for different ranks . The relative error in the exceedance probability is defined as Note that, at the upper tail of the distribution, the relative error remains large even when the sample size goes to infinity. The error of using or some other improper estimate of the probability plotting positions is particularly persistent in the extreme value analysis methods [13]. This confusion alone justifies the definition as proposed here.

5. Conclusions

The sample quantiles can be defined by the true nonexceedance probabilities of the order-ranked sample values. Following the basic principles of the probability calculus, the sample quantile function can, therefore, be defined by (5).

This definition removes the methodological uncertainty related to calculating sample quantiles and should be adopted as a standard in statistical software.

The claim that the true nonexceedance probabilities, so-called Weibull plotting positions, result in a biased estimate of a CDF is shown to be false and founded on a misunderstanding in theoretical considerations and when interpreting the results of Monte Carlo simulations.

The definition of the quantile function proposed here should, of course, be applied to its reverse function, EDF, as well, and used in the inference of data. This is particularly important in the extreme value analysis where probabilities of rare events need to be estimated.

Appendix

Evaluating Plotting Positions

Consider a normally distributed variate with mean 0, standard deviation 1, and distribution function . Take values , . Assume that represents the function which transforms the probabilities to the probability paper; that is, all points fall on the same straight line which represents the cumulative distribution function (CDF). The slope and intercept of the straight line are independent of .

Plot next the points () on the same probability paper. Here, values are the same as those above but the probabilities represent Blom’s [6] plotting position , which have been developed for normal distribution. With increasing the resulting curve approaches the correct straight line, but they never coincide as illustrated in Figure 5. In this way we have detected samples which are correctly represented by the Weibull plotting positions and incorrectly by any other. Vice versa, choosing properly plotting positions with we get a curve which by a linear regression can be forced to compensate for any error somewhere else. This is exactly what has been done in the history of plotting positions. The reason for and the nature of the error are characterized in the following.

First, the reasons why the Weibull [7] plotting positions have been objected in the literature are discussed.(1)Gumbel showed that the expected value of the probability is , and, for example, Langbein [15] argued that the probability of the next observation to fall in interval is , but these observations have not been regarded as rigorous justifications to use the Weibull positions. It is likely that the terminology used, for example, by Chow [25] (mean number of exceedances in future trials) and by Langbein [15] (mean value of exceedance probabilities), has not been understood as giving an ordinate on the CDF. A rigorous proof for was presented in the textbook by Madsen et al. [11], but this has not deserved much attention in the later research.(2)The way of thinking for some researchers has been that the sample values are given and the probabilities associated with them are random while the correct way is to think vice versa that the sample probabilities are exact and the sample values associated with them are variates. Some others, for example, Benson [26], appear to have fully understood that the plotting positions different from those of Weibull are not probabilities. Nevertheless, for example, Cunnane [18] claimed that it is not necessary to use the true probabilities because the final result, that is, the resulting regression line, is decisive. Cunnane [18] tried to estimate the probability and used it as the plotting position when determining the CDF. In other words he used to represent , although they are different concepts. It is not at all surprising that the Weibull positions corresponding to the latter probability are not representative of the former.(3)In Monte Carlo simulations by the Weibull positions, the conventional curve fitting procedures, like the MLS tend to result in parameter estimates, the means of which do not coincide with the parameters of the distribution from which the samples are taken. It has been observed that the difference can be reduced by transforming the Weibull points linearly, that is, by replacing probabilities by . Geometrically, this simply means that, to improve the fit, the straight line on the probability paper, resulting from linear regression, is replaced by another straight line. In more detail, when points are replaced by points , the linearity is also affected as illustrated in Figure 5. This effect remains hidden, however, because the linear regression forces the fitted curve to be linear. The behaviour described above has often been explained by stating that the Weibull probabilities are “biased” or more “biased” than some other probabilities. This is misuse of terminology. The bias is defined as , where is the expected value of an estimator determined from a sample and is the correct parameter value. The Weibull probabilities are exact values in the same way as is the probability of heads when tossing a coin. There is no need for estimation here.

In contrast to the abovementioned arguments against the Weibull positions, the bias in the parameters, resulting from the traditional Monte Carlo simulations by the Weibull positions, is, in fact, attributable to taking the mean of the parameter estimates and to the curve fitting method. From the mathematics we know that if and are nonlinearly related as , it follows that . From the elementary statistics we know that the sample variance is a biased estimate of the population variance. Why should we then believe that the mean of standard deviations or any other distribution parameters obtained from successive samples would approach the parameter of the population? On probability paper, the slope of the regression line represents , where stands for a proper nonlinear transformation of the probability axis. Thus, there is no a priori reason to expect that a mean of sample slopes in MC simulations presents something relevant in the probabilistic sense.

Consequently, the convergence of the mean of successive parameter estimates to the correct parameter value cannot be regarded as a goodness criterion for plotting positions. We should use a criterion based on the bin frequency instead, because it is the frequency by which probability is defined. From a parent distribution with given parameters, take a sample of size , find the estimated straight line , take from the parent distribution one additional random value , record the bin to which belongs, and repeat the steps. A uniform distribution of hits to each bin means that the method has been successful. The fit on each bin can be considered separately using the criterion or the whole distribution by Here is the number of simulations and is the number of hits to bin .

Such an analysis was made by Makkonen et al. [24]. The simulations verified that the Weibull positions give the most accurate estimate in the sense of criterion (A.2) for all considered distributions, that is, for Gumbel, Weibull, normal, and lognormal distribution. Another simulation for the Gumbel distribution with mean = 5 and standard deviation = 2 was carried out using the Weibull and Gringorten plotting positions, and its results are presented in Figure 6. Sample size 2 was chosen to eliminate the possible bias due to the linear regression. The results confirm the performance of the Weibull positions in the same way as a nearly uniform distribution in numbers confirms that the die is fair. The erroneous deduction based on calculating the mean of the estimated distribution parameters and results in a traditional (and incorrect) conclusion that the Weibull positions are worse than those proposed by Gringorten [8]. The source of this misunderstanding is demonstrated in Table 1.

The discussion above shows that all the claims presented against the Weibull plotting positions are unfounded. Particularly, the recent Monte Carlo simulations supporting these claims [2733] are based on a misunderstood role of the mean of the sample parameters. The performance of a single fitted curve should always be compared with the Weibull probabilities plotted against the observed order-ranked values. The plotting positions, different from those of Weibull, should not be used to eliminate the error observed when a mean of the sample estimates is taken in Monte Carlo simulations, because such an error never occurs in a practical situation where we have only one sample and one estimate for each parameter.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported by the Academy of Finland via the FICCA programme.