Abstract

Fix a base and let have the standard exponential distribution; the distribution of digits of base is known to be very close to Benford's law. If there exists a such that the distribution of digits of times the elements of some set is the same as that of , we say that set exhibits shifted exponential behavior base Let be i.i.d.r.v. If the 's are Unif, then as the distribution of the digits of the differences between adjacent order statistics converges to shifted exponential behavior. If instead 's come from a compactly supported distribution with uniformly bounded first and second derivatives and a second-order Taylor series expansion at each point, then the distribution of digits of any consecutive differences and all normalized differences of the order statistics exhibit shifted exponential behavior. We derive conditions on the probability density which determine whether or not the distribution of the digits of all the unnormalized differences converges to Benford's law, shifted exponential behavior, or oscillates between the two, and show that the Pareto distribution leads to oscillating behavior.

1. Introduction

Benford's law gives the expected frequencies of the digits in many tabulated data. It was first observed by Newcomb in the 1880s, who noticed that pages of numbers starting with a in logarithm tables were significantly more worn than those starting with a . In 1938 Benford [1] observed the same digit bias in a variety of phenomena. From his observations, he postulated that in many datasets, more numbers began with a 1 than with a 9; his investigations (with 20,229 observations) supported his belief. See [2, 3] for a description and history, and [4] for an extensive bibliography.

For any base , we may uniquely write a positive as , where and (called the mantissa) is in . A sequence of positive numbers is Benford base if the probability of observing a mantissa of base of at most is . More precisely, for , we have Benford behavior for continuous functions is defined analogously. (If the functions are not positive, we study the distribution of the digits of the absolute value of the function.) Thus, working base 10 we find the probability of observing a first the probability of observing a first digit of is , implying that about of the time the first digit is a .

We can prove many mathematical systems follow Benford's law, ranging from recurrence relations [5] to [6] to iterates of power, exponential and rational maps, as well as Newton's method [79]; to chains of random variables and hierarchical Bayesian models [10]; to values of -functions near the critical line; to characteristic polynomials of random matrix ensembles and iterates of the -Map [11, 12]; as well as to products of random variables [13]. We also see Benford's law in a variety of natural systems, such as atomic physics [14], biology [15], and geology [16]. Applications of Benford's law range from rounding errors in computer calculations (see [17, page 255]) to detecting tax (see [18, 19]) and voter fraud (see [20]).

This work is motivated by two observations (see Remark 1.9 for more details). First, since Benford's seminal paper, many investigations have shown that amalgamating data from different sources leads to Benford behavior; second, many standard probability distributions are close to Benford behavior. We investigate the distribution of digits of differences of adjacent ordered random variables. For any , if we study at most consecutive differences of a dataset of size , the resulting distribution of leading digits depends very weakly on the underlying distribution of the data, and closely approximates Benford's law. We then investigate whether or not studying all the differences leads to Benford behavior; this question is inspired by the first observation above, and has led to new tests for data integrity (see [21]). These tests are quick and easy-to-apply, and have successfully detected problems with some datasets, thus providing a practical application of our main results.

Proving our results requires analyzing the distribution of digits of independent random variables drawn from the standard exponential, and quantifying how close the distribution of digits of a random variable with the standard exponential distribution is to Benford's law. Leemis et al. [22] have observed that the standard exponential is quite close to Benford's law; this was proved by Engel and Leuenberger [23], who showed that the maximum difference in the cumulative distribution function from Benford's law (base ) is at least .029 and at most .03. We provide an alternate proof of this result in the appendix using a different technique, as well as showing that there is no base such that the standard exponential distribution is Benford base (Corollary A.2).

Both proofs apply Fourier analysis to periodic functions. In [23, equation (5)], the main step is interchanging an integration and a limit. Our proof is based on applying Poisson summation to the derivative of the cumulative distribution function of the logarithms modulo , . Benford's law is equivalent to , which by calculus is the same as and . Thus, studying the deviation of from is a natural way to investigate the deviations from Benford behavior. We hope the details of these calculations may be of use to others in investigating related problems (Poisson summation has been fruitfully used by Kontorovich and Miller [11] and Jang et al. [10] in proving many systems are Benford's; see also [24]).

1.1. Definitions

A sequence is equidistributed ifSimilarly, a continuous random variable on , whose probability density function is , is equidistributed modulo iffor any , where for and otherwise.

A positive sequence (or values of a function) is Benford base if and only if its base logarithms are equidistributed modulo ; this equivalence is at the heart of many investigations of Benford's law (see [6, 25] for a proof).

We use the following notations for the various error terms.

(1) Let denote an error of at most in absolute value; thus means .(2) Big-Oh notation: for a nonnegative function, we say if there exist an and a such that, for all , .

The following theorem is the starting point for investigating the distribution of digits of order statistics.

Theorem 1.1. Let have the standard (unit) exponential distributionFor , let be the cumulative distribution function of ; thus . Then, for all ,where are constants such that for all , one has For , take and , which givewith , , andwith

The above theorem was proved in [23]; we provide an alternate proof in Appendix A. As remarked earlier, our technique consists of applying Poisson summation to the derivative of the cumulative distribution function of the logarithms modulo ; it is then very natural and easy to compare deviations from the resulting distribution and the uniform distribution (if a dataset satisfies Benford's law, then the distribution of its logarithms is uniform). Our series expansions are obtained by applying properties of the Gamma function.

Definition 1.2 (Definition exponential behavior, shifted exponential behavior). Let have the standard exponential distribution, and fix a base . If the distribution of the digits of a set is the same as the distribution of the digits of , then one says that the set exhibits exponential behavior (base ). If there is a constant such that the distribution of digits of all elements multiplied by is exponential behavior, then one says that the system exhibits shifted exponential behavior (with shift of ).

We briefly describe the reasons behind this notation. One important property of Benford's law is that it is invariant under rescaling; many authors have used this property to characterize Benford behavior. Thus, if a dataset is Benford base , and we fix a positive number , so is the dataset obtained by multiplying each element by . This is clear if, instead of looking at the distribution of the digits, we study the distribution of the base logarithms modulo . Benford's law is equivalent to the logarithms modulo being uniformly distributed (see, e.g., [6, 25]); the effect of multiplying all entries by a fixed constant simply translates the uniform distribution modulo , which is again the uniform distribution.

The situation is different for exponential behavior. Multiplying all elements by a fixed constant (where for some ) does not preserve exponential behavior; however, the effect is easy-to-describe. Again looking at the logarithms, exponential behavior is equivalent to the base logarithms modulo having a specific distribution which is almost equal to the uniform distribution (at least if the base is not too large). Multiplying by a fixed constant shifts the logarithm distribution by .

1.2. Results for Differences of Orders Statistics

We consider a simple case first, and show how the more general case follows. Let be independent identically distributed from the uniform distribution on . We consider fixed and study the limit as . Let be the 's in increasing order. The are called the order statistics, and satisfy . We investigate the distribution of the leading digits of the differences between adjacent 's, . For convenience, we periodically continue the data and set . As we have differences in an interval of size , the average value of is of size and it is sometimes easier to study the normalized differencesAs the 's are drawn from a uniform distribution, it is a standard result that as , the 's are independent random variables, each having the standard exponential distribution. Thus, as , the probability that tends to (see [26, 27] for proofs).

For uniformly distributed random variables, if we know the distribution of , then we can immediately determine the distribution of the digits of the base because

As are independent with the standard exponential distribution as ; if are independent uniformly distributed, the behavior of the digits of the differences is an immediate consequence of Theorem 1.1. Theorem 1.3 (Shifted exponential behavior of differences of independent uniformly distributed random variables). Let be independently distributed from the uniform distribution on , and let be 's in an increasing order. As , the distribution of the digits (base ) of the differences converges to shifted exponential behavior, with a shift of .

A similar result holds for other distributions. Theorem 1.4 (Shifted exponential behavior of subsets of differences of independent random variables). Let be independent, identically distributed random variables whose density has a second-order Taylor series at each point with first and second derivatives uniformly bounded, and let the 's be the 's in increasing order. Fix a . Then, as the distribution of the digits (base ) of consecutive differences converges to shifted exponential behavior, provided that 's are from a region where is nonzero.

The key ingredient in this generalization is that the techniques, which show that the differences between uniformly distributed random variables become independent exponentially distributed random variables, can be modified to handle more general distributions.

We restricted ourselves to a subset of all consecutive spacings because the normalization factor changes throughout the domain. The shift in the shifted exponential behavior depends on which set of differences we study, coming from the variations in the normalizing factors. Within a bin of differences, the normalization factor is basically constant, and we may approximate our density with a uniform distribution. It is possible for these variations to cancel and yield Benford behavior for the digits of all the unnormalized differences. Such a result is consistent with the belief that amalgamation of data from many different distributions becomes Benford; however, this is not always the case (see Remark 1.6). From Theorems 1.1 and 1.4, we obtain the following theorem. Theorem 1.5 (Benford behavior for all the differences of independent random variables). Let be independent identically distributed random variables whose density is compactly supported and has a second-order Taylor series at each point with first and second derivatives uniformly bounded. Let the 's be the 's in an increasing order be the cumulative distribution function for , and fix a . Let . For each fixed , assume that
(i) is not too small for (ii) is equidistributed: for all Then, if and , the distribution of the digits of the differences converges to Benford's law (base ) as .

Remark 1.6. The conditions of Theorem 1.5 are usually not satisfied. We are unaware of any situation where (1.13) holds; we have included Theorem 1.5 to give a sufficient condition of what is required to have Benford's law satisfied exactly, and not just approximately. In Lemma 3.3, we show with: Example 3.3 shows that the conditions fail for the Pareto distribution, and the limiting behavior oscillates between Benford and a sum of shifted exponential behavior. (If several datasets each exhibit shifted exponential behavior but with distinct shifts, then the amalgamated dataset is closer to Benford's law than any of the original datasets. This is apparent by studying the logarithms modulo . The differences between these densities and Benford's law will look like Figure 1(b) (except, of course, that different shifts will result in shifting the plot modulo ). The key observation is that the unequal shifts mean that we do not have reinforcements from the peaks of modulo densities being aligned, and thus the amalgamation will decrease the maximum deviations.) The arguments generalize to many densities whose cumulative distribution functions have tractable closed-form expressions (e.g., exponential, Weibull, or ).

The situation is very different if instead we study normalized differences:note if is the uniform distribution on , (1.14) reduces to (1.10). Theorem 1.7 (Shifted exponential behavior for all the normalized differences of independent random variables). Assume the probability distribution satisfies the conditions of Theorem 1.5 and (1.12) and is as in (1.14). Then, as , the distribution of the digits of converges to shifted exponential behavior.

Remark 1.8. Appropriately scaled, the distribution of the digits of the differences is universal, and is the exponential behavior of Theorem 1.1. Thus, Theorem 1.7 implies that the natural quantity to study is the normalized differences of the order statistics, not the differences (see also Remark 3.5). With additional work, we could study densities with unbounded support and show that, through truncation, we can get arbitrarily close to shifted exponential behavior.

Remark 1.9. The main motivation for this work is the need for improved ways of assessing the authenticity and integrity of scientific and corporate data. Benford's law has been successfully applied to detecting income tax, corporate, and voter fraud (see [1820]); in [21], we use these results to derive new statistical tests to examine data authenticity and integrity. Early applications of these tests to financial data showed that it could detect errors in data downloads, rounded data, and inaccurate ordering of data. These attributes are not easily observable from an analysis of descriptive statistics, and detecting these errors can help managers avoid costly decisions based on erroneous data.

The paper is organized as follows. We prove Theorem 1.1 in Appendix A by using Poisson summation to analyze . Theorem 1.3 follows from the results of the order statistics of independent uniform variables. The proof of Theorem 1.4 is similar, and given in Section 2. In Section 3, we prove Theorems 1.5 and 1.7.

2. Proofs of Theorems 1.3 and 1.4

Theorem 1.3 is a consequence of the fact that the normalized differences between the order statistics drawn from the uniform distribution converge to being independent standard exponentials. The proof of Theorem 1.4 proceeds similarly. Specifically, over a short enough region, any distribution with a second-order Taylor series at each point with first and second derivatives uniformly bounded is well approximated by a uniform distribution.

To prove Theorem 1.4, it suffices to show that if are drawn from a sufficiently nice distribution, then for any fixed the limiting behavior of the order statistics of adjacent 's becomes Poissonian (i.e., the normalized differences converge to being independently distributed from the standard exponential). We prove this below for compactly supported distributions that have a second-order Taylor series at each point with the first and second derivatives uniformly bounded, and when the adjacent 's are from a region where is bounded away from zero.

For each , consider intervals such that ; thus, the proportion of the total mass in such intervals is . We fix such an interval for our arguments. For each , let Note is with probability and with probability ; is a binary indicator random variable, telling us whether or not . Thus,Let be the number of in , and let be any nondecreasing sequence tending to infinity (in the course of the proof, we will find that we may take any sequence with ). By (2.2) and the central limit theorem (which we may use as 's satisfy the Lyapunov condition), with probability tending to , we have

We assume that in the interval there exist constants and such that whenever , ; we assume that these constants hold for all regions investigated and for all . (If our distribution has unbounded support, for any , we can truncate it on both sides so that the omitted probability is at most . Our result is then trivially modified to be within of shifted exponential behavior.) Thus,implying that is of size . If we assume that has at least a second-order Taylor expansion, then As we assume that the first and second derivatives are uniformly bounded, as well as being bounded away from zero in the intervals under consideration, all Big-Oh constants below are independent of . Thus,

We now investigate the order statistics of the of the 's that lie in . We know ; by setting , then is the conditional density function for , given that . Thus, integrates to 1, and for , we have

We have an interval of size , and of the lying in the interval (remember that are any nondecreasing sequence tending to infinity). Thus, with probability tending to 1, the average spacing between adjacent ordered isin particular, we see that we must choose . As , if we fix a such that , then we expect the next to the right of to be about units away, where is of size . For a given , we can compute the conditional probability that the next is between and units to the right. It is simply the difference of the probability that all the other of the 's in are not in the interval and the probability that all other in are not in the interval ; note that we are using the wrapped interval .

Some care is required in these calculations. We have a conditional probability as we assume that and that exactly of the are in . Thus, these probabilities depend on two random variables, namely, and . This is not a problem in practice, however (e.g., is tightly concentrated about its mean value).

Recalling our expansion for (and that and is of size ), after simple algebra, we find that with probability tending to 1, for a given and , the first probability isThe above integral equals (use the Taylor series expansion in (2.7) and note that the interval is of size ). Using (2.3), it is easy to see that this is a.s. equal toWe, therefore, find that as , the probability that of the 's () are in , conditioned on and , converges to . (Some care is required, as the exceptional set in our a.s. statement can depend on . This can be surmounted by taking expectations with respect to our conditional probabilities and applying the dominated convergence theorem.)

The calculation of the second probability, the conditional probability that the other ' that are not in the interval , given and , follows analogously by replacing with in the previous argument. We thus find that this probability is . Aswe find that the density of the difference between adjacent order statistics tends to the standard (unit) exponential density; thus, the proof of Theorem 1.4 now follows from Theorem 1.3.

3. Proofs of Theorems 1.5 and 1.7

We generalize the notation from Section 2. Let be any distribution with a second-order Taylor series at each point with first and second derivatives uniformly bounded, and let be the order statistics. We fix a , and for , we consider bins such thatthere are such bins. By the central limit theorem (see (2.3)), if is the number of order statistics in , then, provided that with probability tending to , we haveof course we also require , as, otherwise, the error term is larger than the main term.

Remark 3.1. Before we considered just one fixed interval; as we are studying intervals simultaneously, we need in the exponent so that with high probability, all intervals have to first order order statistics. For the arguments below, it would have sufficed to have an error of size . We thank the referee for pointing out that , and provide his argument in Appendix B.

Similar to (2.8), the average spacing between adjacent order statistics in isNote that (3.3) is the generalization of (1.11); if is the uniform distribution on , then . By Theorem 1.4, as , the distribution of digits of the differences in each bin converges to shifted exponential behavior; however, the variation in the average spacing between bins leads to bin-dependent shifts in the shifted exponential behavior.

Similar to (1.11), we can study the distribution of digits of the differences of the normalized order statistics. If and are in , thenNote we are using the same normalization factor for all differences between adjacent order statistics in a bin. Later, we show that we may replace with . As we study all in the bin , it is useful to rewrite the above asWe have bins, so . As we only care about the limiting behavior, we may safely ignore the first and last bins. We may, therefore, assume that each is finite, and . (Of course, we know that both quantities are finite as we assumed that our distribution has compact support. We remove the last bins to simplify generalizations to noncompactly supported distributions.)

Let be the cumulative distribution function for . Then,For notational convenience, we relabel the bins so that ; thus .

We now prove our theorems which determine when these bin-dependent shifts cancel (yielding Benford behavior), or reinforce (yielding sums of shifted exponential behavior). Proof of Theorem 1.5. There are approximately differences in each bin . By Theorem 1.4, the distribution of the digits of the differences in each bin converges to shifted exponential behavior. As we assume that the first and second derivatives of are uniformly bounded, the Big-Oh constants in Section 2 are independent of the bins. The shift in the shifted exponential behavior in each bin is controlled by the last two terms on the right-hand side of (3.5). The shifts the shifted exponential behavior in each bin equally. The bin-dependent shift is controlled by the final term
Thus, each of the bins exhibits shifted exponential behavior, with a bin-dependent shift composed of the two terms in (3.7). By (1.12), are not small compared to , and hence the second term is negligible. In particular, this factor depends only very weakly on the bin, and tends to zero as .
Thus, the bin-dependent shift in the shifted exponential behavior is approximately . If these shifts are equidistributed modulo , then the deviations from Benford behavior cancel, and the shifted exponential behavior of each bin becomes Benford behavior for all the differences.

Remark 3.2. Consider the case when the density is a uniform distribution on some interval. Then, all are equal, and each bin has the same shift in its shifted exponential behavior. These shifts, therefore, reinforce each other, and the distribution of all the differences is also shifted exponential behavior, with the same shift. This is observed in numerical experiments (see Theorem 1.3 for an alternate proof).

We analyze the assumptions of Theorem 1.5. The condition from (1.12) is easy-to-check, and is often satisfied. For example, if the probability density is a finite union of monotonic pieces and is zero only finitely often, then (1.12) holds. This is because for , , and this is, therefore, independent of (if vanishes finitely often, we need to remove small subintervals from , but the analysis proceeds similarly). The only difficulty is basically a probability distribution with intervals of zero probability. Thus, (1.12) is a mild assumption.

If we choose any distribution other than a uniform distribution, then is not constant; however, (1.13) does not need to hold (i.e., does not need to be equidistributed as ). For example, consider a Pareto distribution with minimum value and exponent . The density is The Pareto distribution is known to be useful in modelling natural phenomena, and for appropriate choices of exponents, it yields approximately Benford behavior (see [16]).

Example 3.3. If is a Pareto distribution with minimum value and exponent , then does not satisfy the second condition of Theorem 1.5, (1.13).
To see this, note that the cumulative distribution function of is . As we only care about the limiting behavior, we need only to study . Therefore, implies that
The condition from (1.12) is satisfied, namely,as is of size .
Let . Then, the bin-dependent shifts areThus, for a Pareto distribution with exponent , the distribution of all the differences becomes Benford if and only if is Benford. This follows from the fact that a sequence is Benford if and only if its logarithms are equidistributed. For fixed , is not Benford (e.g., [6]), and thus the condition from (1.13) fails.

Remark 3.4. We chose to study a Pareto distribution because the distribution of digits of a random variable drawn from a Pareto distribution converges to Benford behavior (base 10) as ; however, the digits of the differences do not tend to Benford (or shifted exponential) behavior. A similar analysis holds for many distributions with good closed-form expressions for the cumulative distribution function. In particular, if is the density of an exponential or Weibull distribution (or ), then does not satisfy the second condition of Theorem 1.5, (1.13).

Modifying the proof of Theorem 1.5 yields our result on the distribution of digits of the normalized differences.

Proof of Theorem 1.7. If is the uniform distribution, there is nothing to prove. For general , rescaling the differences eliminates the bin-dependent shifts. LetIn Theorem 1.5, we use the same scale factor for all differences in a bin (see (3.4)). As we assume the first and second derivatives of are uniformly bounded, (2.5) and (2.6) imply that for ,and the Big-Oh constants are independent of . As we assume that satisfies (1.12), the error term is negligible.
Thus, our assumptions on imply that is basically constant on each bin, and we may replace the local rescaling factor with the bin rescaling factor . Thus, each bin of normalized differences has the same shift in its shifted exponential behavior. Therefore all the shifts reinforce, and the digits of all the normalized differences exhibit shifted exponential behavior as .

As an example of Theorem 1.7, in Figure 1 we consider 500,000 independent random variables drawn from the Pareto distribution with exponent(we chose to make the variance equal ). We study the distribution of the digits of the differences in base 10. The amplitude is about , which is the amplitude of the shifted exponential behavior of Theorem 1.1 (see the equation in [23, Theorem 2] or (1.5) of Theorem 1.1).

Remark 3.5. The universal behavior of Theorem 1.7 suggests that if we are interested in the behavior of the digits of all the differences, the natural quantity to study is the normalized differences. For any distribution with uniformly bounded first and second derivatives and a second-order Taylor series expansion at each point, we obtain shifted exponential behavior.

Appendices

A. Proof of Theorem 1.1

To prove Theorem 1.1, it suffices to study the distribution of when has the standard exponential distribution (see (1.4)). We have the following useful chain of equalities. Let . Then,It suffices to investigate (A.1) in the special case when , as the probability of any interval can always be found by subtracting the probability of from . We are, therefore, led to study, for , the cumulative distribution function of ,This series expansion converges rapidly, and Benford behavior for is equivalent to the rapidly converging series in (A.2) equalling for all .

As Benford behavior is equivalent to equals for all , it is natural to compare to . If the derivatives were identically , then would equal plus some constant. However, (A.2) is zero when , which implies that this constant would be zero. It is hard to analyze the infinite sum for directly. By studying the derivative , we find a function with an easier Fourier transform than the Fourier transform of , which we then analyze by applying Poisson summation.

We use the fact that the derivative of the infinite sum is the sum of the derivatives of the individual summands. This is justified by the rapid decay of the summands (see, e.g., [28, Corollary 7.3]). We findwhere for , we set .

Let ; note . As is of rapid decay in , we may apply Poisson summation (e.g., [29]). Thus,where is the Fourier transform of : . Therefore,Let us change variables by taking . Thus, or . As , we havewhere we have used the definition of the -functionAs , we have

Remark A.1. The above series expansion is rapidly convergent, and shows the deviations of from being equidistributed as an infinite sum of special values of a standard function. As , we have , which gives a Fourier series expansion for with coefficients arising from special values of the -function.

We can improve (A.8) by using additional properties of the -function. If , then from (A.7), we have (where the bar denotes complex conjugation). Thus, the th summand in (A.8) is the sum of a number and its complex conjugate, which is simply twice the real part. We have formulas for the absolute value of the -function for large argument. We use (see [30, page 946, equation (8.332)]) thatWriting the summands in (A.8) as , (A.8) becomesThe rest of the claims of Theorem 1.1 follow from simple estimation, algebra, and trigonometry.

With constants as in the theorem, if we take and (resp., ) the error is at most (resp., .378), while if and (resp., ), the error is at most (resp., .006). Thus, just one term is enough to get approximately five digits of accuracy base , and two terms give three digits of accuracy base . For many bases, we have reduced the problem to evaluate . This example illustrates the power of Poisson summation, taking a slowly convergent series expansion and replacing it with a rapidly converging one. Corollary A.2. Let have the standard exponential distribution. There is no base such that is Benford base .Proof. Consider the infinite series expansion in (1.5). As is a sum of a cosine and a sine term, (1.5) gives a rapidly convergent Fourier series expansion. If were Benford base , then must be identically ; however, is never zero for a positive integer because its modulus is nonzero (see (A.9)). As there is a unique rapidly convergent Fourier series equal to (namely, ; see [29] for a proof), our cannot identically equal 1.

B. Analyzing Intervals Simultaneously

We show why in addition to we also needed when we analyzed intervals simultaneously in (3.2); we thank one of the referees for providing this detailed argument.

Let be i.i.d.r.v. with , , , and set . Let denote the cumulative distribution function of the standard normal. Using a (nonuniform) sharpening of the Berry-Esséen estimate (e.g., [31]), we find that for some constant ,Taking , where is defined by (2.1), yieldsThus, (B.1) becomesfor all (for some sufficiently large, depending on ).

For each , , and consider the eventThen, as , we haveprovided thatas . Using (B.3) gives(e.g., [32]). Thus, the sum in (B.6) is at mostand this is provided that and .

Acknowledgments

The authors would like to thank Ted Hill, Christoph Leuenberger, Daniel Stone, and the referees for numerous helpful comments. S. J. Miller was partially supported by NSF (Grant no. DMS-0600848).