We provide bounds for tail probabilities of the sample variance. The bounds are expressed in terms of Hoeffding functions and are the sharpest known. They are designed having in mind applications in auditing as well as in processing data related to environment.
1. Introduction and Results
Let be a random sample of independent identically distributed observations. Throughout we write
for the mean, variance, and the fourth central moment of , and assume that . Some of our results hold only for bounded random variables. In such cases without loss of generality we assume that . Note that is a natural condition in audit applications.
The sample variance of the sample is defined as
where is the sample mean, . We can rewrite (1.2) as
We are interested in deviations of the statistic from its mean , that is, in bounds for the tail probabilities of the statistic ,
The paper is organized as follows. In the introduction we give a description of bounds, some comments, and references. In Section 2 we obtain sharp upper bounds for the fourth moment. In Section 3 we give proofs of all facts and results from the introduction.
If , then the range of interest in (1.5) is , where
The restriction on the range of in (1.4) (resp., in (1.5) in cases where the condition is fulfilled) is natural. Indeed, for , due to the obvious inequality . Furthermore, in the case of we have for since (see Proposition 2.3 for a proof of the latter inequality).
The asymptotic (as ) properties of (see Section 3 for proofs of (1.7) and (1.8)) can be used to test the quality of bounds for tail probabilities. Under the condition the statistic is asymptotically normal provided that is not a Bernoulli random variable symmetric around its mean. Namely, if , then
If (which happens if and only if is a Bernoulli random variable symmetric around its mean), then asymptotically has type distribution, that is,
where is a standard normal random variable, and is the standard normal distribution function.
Let us recall already known bounds for the tail probabilities of the sample variance (see (1.19)–(1.21)). We need notation related to certain functions coming back to Hoeffding [1]. Let and . Write
For we define . For we set . Note that our notation for the function is slightly different from the traditional one. Let . Introduce as well the function
and for . One can check that
All our bounds are expressed in terms of the function . Using (1.11), it is easy to replace them by bounds expressed in terms of the function , and we omit related formulations.
Let and . Assume that
Let be a Bernoulli random variable such that and . Then and . The function is related to the generating function (the Laplace transform) of binomial distributions since
where are independent copies of . Note that (1.14) is an obvious corollary of (1.13). We omit elementary calculations leading to (1.13). In a similar way
where is a Poisson random variable with parameter .
The functions and satisfy a kind of the Central Limit Theorem. Namely, for given and we have
(we omit elementary calculations leading to (1.16)). Furthermore, we have [1]
and we also have [2]
Using the introduced notation, we can recall the known results (see [2, Lemma ]). Let be the integer part of . Assume that . If is known, then
The right-hand side of (1.19) is an increasing function of (see Section 3 for a short proof of (1.19) as a corollary of Theorem 1.1). If is unknown but is known, then
Using the obvious estimate , the bound (1.20) is implied by (1.19). In cases where both and are not known, we have
as it follows from (1.19) using the obvious bound .
Let us note that the known bounds (1.19)–(1.21) are the best possible in the framework of an approach based on analysis of the variance, usage of exponential functions, and of an inequality of Hoeffding (see (3.3)), which allows to reduce the problem to estimation of tail probabilities for sums of independent random variables. Our improvement is due to careful analysis of the fourth moment which appears to be quite complicated; see Section 2. Briefly the results of this paper are the following: we prove a general bound involving , , and the fourth moment ; this general bound implies all other bounds, in particular a new precise bound involving and ; we provide as well bounds for lower tails ; we compare the bounds analytically, mostly as is sufficiently large.
From the mathematical point of view the sample variance is one of the simplest nonlinear statistics. Known bounds for tail probabilities are designed having in mind linear statistics, possibly also for dependent observations. See a seminal paper of Hoeffding [1] published in JASA. For further development see Talagrand [3], Pinelis [4, 5], Bentkus [6, 7], Bentkus et al. [8, 9], and so forth. Our intention is to develop tools useful in the setting of nonlinear statistics, using the sample variance as a test statistic.
Theorem 1.1 extends and improves the known bounds (1.19)–(1.21). We can derive (1.19)–(1.21) from this theorem since we can estimate the fourth moment via various combinations of and using the boundedness assumption .
Theorem 1.1. Let and .
If and , then
with
If and , then
with
Both bounds and are increasing functions of , and .
Remark 1.2. In order to derive upper confidence bounds we need only estimates of the upper tail (see [2]). To estimate the upper tail the condition is sufficient. The lower tail has a different type of behavior since to estimate it we indeed need the assumption that is a bounded random variable.
For Theorem 1.1 implies the known bounds (1.19)–(1.21) for the upper tail of . It implies as well the bounds (1.26)–(1.29) for the lower tail. The lower tail has a bit more complicated structure, (cf. (1.26)–(1.29) with their counterparts (1.19)–(1.21) for the upper tail).
If is known, then
One can show (we omit details) that the bound is not an increasing function of . A bit rougher inequality
has the monotonicity property since is an increasing function of . If is known, then using the obvious inequality , the bound (1.27) yields
If we have no information about and , then using , the bound (1.27) implies
The bounds above do not cover the situation where both and are known. To formulate a related result we need additional notation. In case of we use the notation
In view of the well-known upper bound for the variance of , we can partition the set
of possible values of and into a union of three subsets
and ; see Figure 1.
Figure 1: .
Theorem 1.3. Write . Assume that .
The upper tail of the statistic satisfies
with , where
and where one can write
The lower tail of satisfies
with , where , and is defined by (1.34).
The bounds above are obtained using the classical transform ,
of survival functions (cf. definitions (1.13) and (1.14) of the related Hoeffding functions). The bounds expressed in terms of Hoeffding functions have a simple analytical structure and are easily numerically computable.
All our upper and lower bounds satisfy a kind of the Central Limit Theorem. Namely, if we consider an upper bound, say (resp., a lower bound ) as a function of , then there exist limits
with some positive and . The values of and can be used to compare the bounds—the larger these constants, the better the bound. To prove (1.38) it suffices to note that with
The Central Limit Theorem in the form of (1.7) restricts the ranges of possible values of and . Namely, using (1.7) it is easy to see that and have to satisfy
We provide the values of these constants for all our bounds and give the numerical values of them in the following two cases.
(i) is a random variable uniformly distributed in the interval . The moments of this random variable satisfy
For defined by (1.41), the constants and we give as .(ii) is uniformly distributed in , and in this case
For the constants and with defined by (1.42) we give as .
We have
while calculating the constants in (1.44) and (1.46) we choose . The quantity in (1.43) and (1.45) is defined by (1.34).
Conclusions
Our new bounds provide a substantial improvement of the known bounds. However, from the asymptotic point of view these bounds seem to be still rather crude. To improve the bounds further one needs new methods and approaches. Some preliminary computer simulations show that in applications where is finite and random variables have small means and variances (like in auditing, where a typical value of is ), the asymptotic behavior is not related much to the behavior for small . Therefore bounds specially designed to cover the case of finite have to be developed.
2. Sharp Upper Bounds for the Fourth Moment
Recall that we consider bounded random variables such that , and that we write and . In Lemma 2.1 we provide an optimal upper bound for the fourth moment of given a shift , a mean , and a variance . The maximizers of the fourth moment are either Bernoulli or trinomial random variables. It turns out that their distributions, say , are of the following three types (i)–(iii):
(i)a two point distribution such that
(ii)a family of three point distributions depending on such that
where we write
notice that (2.4) supplies a three-point probability distribution only in cases where the inequalities and hold;(iii)a two point distribution such that
Note that the point in (2.2)–(2.7) satisfies and that the probability distribution has mean and variance .
Introduce the set
Using the well-known bound valid for , it is easy to see that
Let . We represent the set as a union of three subsets setting
and , where and are given in (2.5). Let us mention the following properties of the regions.
(a)If , then since for such obviously for all . The set is a one-point set. The set is empty.(b)If , then since for such clearly for all . The set is a one-point set. The set is empty.For all three regions , , are nonempty sets. The sets and have only one common point , that is, .
Lemma 2.1. Let . Assume that a random variable satisfies
Then
with a random variable satisfying (2.11) and defined as follows:
(i)if , then is a Bernoulli random variable with distribution (2.2);(ii)if , then is a trinomial random variable with distribution (2.4);(iii)if , then is a Bernoulli random variable with distribution (2.7).
Proof. Writing , we have to prove that if
then
with . Henceforth we write , so that can assume only the values , , with probabilities , , defined in (2.2)–(2.7), respectively. The distribution is related to the distribution as for all .
Formally in our proof we do not need the description (2.17) of measures satisfying (2.15). However, the description helps to understand the idea of the proof. Let and . Assume that a signed measure of subsets of is such that the total variation measure is a discrete measure concentrated in a three-point set and
Then is a uniquely defined measure such that
satisfy
We omit the elementary calculations leading to (2.17). The calculations are related to solving systems of linear equations.
Let . Consider the polynomial
It is easy to check that
The proofs of (i)–(iii) differ only in technical details. In all cases we find , , and (depending on , and ) such that the polynomial defined by (2.18) satisfies for , and such that the coefficient in (2.18) vanishes, . Using , the inequality is equivalent to , which obviously leads to . We note that the random variable assumes the values from the set
Therefore we have
which proves the lemma.
(i)Now . We choose and . In order to ensure (cf. (2.19)) we have to take
If , then for all . The inequality is equivalent to
To complete the proof we note that the random variable with defined by (2.2) assumes its values in the set . To find the distribution of we use (2.17). Setting in (2.17) we obtain and , as in (2.2).(ii)Now or, equivalently and . Moreover, we can assume that since only for such the region is nonempty. We choose and . Then