Research Article  Open Access
D. J. Best, J. C. W. Rayner, O. Thas, "Tests of Fit for the Logarithmic Distribution", Advances in Decision Sciences, vol. 2008, Article ID 463781, 8 pages, 2008. https://doi.org/10.1155/2008/463781
Tests of Fit for the Logarithmic Distribution
Abstract
Smooth tests for the logarithmic distribution are compared with three tests: the first is a test due to Epps and is based on a probability generating function, the second is the AndersonDarling test, and the third is due to Klar and is based on the empirical integrated distribution function. These tests all have substantially better power than the traditional PearsonFisher test of fit for the logarithmic. These traditional chisquared tests are the only logarithmic tests of fit commonly applied by ecologists and other scientists.
1. Introduction
Species diversity data can sometimes be modeled by a zerotruncated negative binomial distribution with index parameter near zero. Fisher et al. [1] examined the limit as the index parameter of this distribution approached zero and so derived the logarithmic distribution. A random variable X has this distribution if and only if in which 0 < ฮฒ < 1 and . The logarithmic or logseries distribution is often applied to species diversity data.
As an example of species diversity data which the logarithmic distribution may fit, consider the following data on insect catches from the Sierra Tarahuma, Mexico, reported by Aldrete [2]. Ten species were caught precisely once, three species were caught precisely twice, and so on according to Table 1. The expected line in Table 1 shows the expected counts on fitting a logarithmic distribution. For these data, the alpha index of diversity is 9.01, where in which n is the total number of insects and is the maximum likelihood estimator of ฮฒ. The AI quoted here is defined in Krebs [3, 12.13]; for a discussion of the index of diversity see Krebs [3, Section 12.4.1]. Note that larger AI implies more diversity while smaller AI implies less.

It would seem sensible to test the data for consistency with the logarithmic distribution before quoting an AI value. However, the only statistic that appears to be commonly used by ecologists as a test of fit for the logarithmic distribution is the socalled chisquared test, which, as Krebs [3, Section 12.4.1] notes, may not always have good power.
โThe goodnessoffit of the logarithmic series โฆ can be tested by the usual chisquared goodnessoffit test โฆ this means low power โฆ. Thus in most cases the decision to use the logarithmic series โฆ must be made on ecological grounds, rather than statistical goodnessoffit criteria.โ [3, page 429].
In this paper, we will examine a number of statistical tests which are considerably more powerful than the traditional PearsonFisher test. These include tests of fit based on components of Neymanโs smooth test statistic, the AndersonDarling test discussed by Lockhart et al. [4], an empirical integrated distribution function test given by Klar [5], and a test due to Epps [6] based on a probability generating function (pgf). We suggest that these could be used to help make a decision as to whether or not to use the logarithmic series based on statistical as well as ecological criteria. In particular, the dispersion statistic, , defined subsequently, should be useful for identifying the not infrequent case of data for which the abundance species are more abundant than predicted by the logarithmic series.
Our second example is included for its somewhat curious interest and is not involved with conventional species diversity. Collins and Hand [7] have counted the number of times, in the period 1983 to 2000, that a Hollywood film won one Oscar, two Oscars, three Oscars, and so on, giving the data in Table 2. The film with 11 Oscars was โTitanic.โ

2. Tests of Fit for the Logarithmic
A discussion of smooth tests of fit and their components, particularly when testing for the logarithmic, is given in Appendix A. These tests may be derived as a routine application of Rayner and Best [8, Theorem 6.1.1]. The firstorder component is identically zero when ฮฒ is estimated by maximum likelihood, or, equivalently, by method of moments. The test based on the component suggests whether or not the data are consistent with the logarithmic variance while the test based on suggests whether or not the data are consistent with logarithmic moments up to the third. To find pvalues for these tests, it is suggested that the parametric bootstrap is to be used as convergence to the asymptotic standard normal distribution is very slow. See Gรผrtler and Henze [9] and Appendix B for details of the parametric bootstrap in a goodness of fit context.
In Section 3, we give powers for the AndersonDarling test based on the statistic in which , and O_{x} is the number of observations equal to x. Summation is halted when x is the minimum such that O_{x} = 0 and . We also give powers for a test given by Klar [5], based on the empirical integrated distribution function with test statistic in which M is the largest observation. Finally, for comparison purposes, we quote powers of the pgf and tests given by Epps [6].
3. Power Comparisons
Random deviates from the logarithmic (L), positive Poisson (P+), and positive geometric (G+) distributions were generated using IMSL [10] routines RNLGR, RNPOI, and RNGEO. Random zeta deviates (Z) and random Yule deviates (Y) were found using algorithms of Devroye [11, pages 551 and 553]. Table 3 gives powers for the same alternatives as used by Epps [6], but with the addition of two Yule alternatives. For convenience, we reproduce the powers given by Epps for his pgf and tests. The powers we give for , T_{n}, , , and + were found using parametric bootstrap with 1000 simulations both for the inner and the outer loops. Note that the calculation of can involve large numbers, and calculation of the pgf and statistics can involve small numbers. Care with rounding error may be needed. The statistics T_{n} and are less prone to rounding error. Klar [5] notes that the smooth tests, the test, and the pgf test are not consistent against all alternatives.

From Table 3 our powers for T_{n} are a little greater than those of Klar [5], and we observe that the power for the Z(1.0) alternative is 0.73, somewhat larger than the 0.40 reported by Klar [5]. Also from Table 3, we see that the test is not generally competitive with the other tests.
The test based on + performs reasonably well. The test based on the T_{n} statistic has power a little less than that for the pgf and based tests. An advantage of the test based on T_{n} is that Klar [5] showed it is consistent.
The test based on the dispersion statistic has good power for the zeta and Yule alternatives, while the and pgf tests generally have competitive powers for all alternatives. Clearly, the test based on will not have good power for alternatives with similar dispersion to the logarithmic distribution. If the test based on is not significant but that based on is, this suggests a skewness departure from the logarithmic distribution. However, if the test based on is significant, then this suggests that the test based on may be significant due to either a dispersion or a skewness departure of the data from the logarithmic distribution. Notice that we say the test based on suggests how the data deviate from the logarithmic. We do not claim that the data actually do deviate in this manner. See the comments of Henze and Klar [12].
On the basis of Table 3 powers, we suggest that the tests based on and Klarโs T_{n} are considered as tests of fit for the logarithmic distribution. These tests have good power and are consistent. We recommend that the tests based on T_{n} and are augmented by the use of and in a data analytic fashion.
4. Examples
In the following parametric bootstrap, pvalues for the tests based on , , , and T_{n} are given. These use 1000 random samples of the logarithmic distribution with parameter as given below. We give and values because they may suggest how the data deviate from the logarithmic. We give the and T_{n} values because the tests based on these statistics are consistent and have good power.
4.1. Insect Data
From the data in Table 1, we find with pvalue 0.52, with pvalue 0.19, with pvalue 0.82, and with pvalue 0.33. It appears that the logarithmic distribution is a good fit. In agreement with this, the PearsonFisher statistic takes the value 4.56 on 11 degrees of freedom when data greater than 12 have been combined.
4.2. Oscars Data
We find that and on 7 degrees of freedom if the classes greater than or equal to 9 are combined. The corresponding pvalues are 0.58, 0.08, and 0.62. It appears that a logarithmic distribution with = 0.7044 fits the data reasonably well. However, the pvalue for suggests that the data may not be quite as skewed as would be expected for the logarithmic distribution. Collins and Hand [7] suggest a Yule distribution fits the data well. In addition, we note that with pvalue 0.12 and with pvalue 0.26.
5. Conclusion
In this paper, we have examined a number of statistical tests which are considerably more powerful than the traditional PearsonFisher test. We suggest that these could be used to help make a decision as to whether or not to use the logarithmic series based on statistical as well as ecological criteria. A test of fit could be done before quoting the index of diversity. In particular, the dispersion statistic, , should be useful for identifying the not infrequent case of data for which the abundance species are more abundant than predicted by the logarithmic series.
Appendices
A. The Smooth Tests and Their Components
For distributions from exponential families the smooth tests can be derived as score statistics for testing H_{0}: ฮธ = 0 against K : ฮธโ 0 for observations from the model in which
(i) is a probability density function that depends on a vector of nuisance parameters ฮฒ and for which we test;(ii) is a complete orthonormal set on ;(iii) is a normalizing constant. For details see Rayner and Best [8].
The score test statistic has a particularly appealing form where Here, is the maximum likelihood estimator of ฮฒ assuming that is true.
To define , central moments of up to order 2r are required. For example, to directly define components up to to test for the logarithmic, we note that the equation to estimate ฮฒ is โก 0, as discussed below. To define and requires and , which in turn require central logarithmic moments up to order six. These are given by To calculate further orthonormal polynomials directly, we could use the result that for the logarithmic, generates cumulants and hence central moments, but it is more efficient to use recurrence as described in Rayner et al. [13]. Proceeding directly, the first six central moments can be used to calculate in which These formulas give the first three orthonormal polynomials for any univariate distribution.
The components can be called smooth components as they are analogous to the components of the smooth test for uniformity introduced by Neyman [14]. His smooth components also used orthonormal polynomials. When testing for distributions from exponential families these components are asymptotically independent and asymptotically have the standard normal distribution.
For the logarithmic distribution, the maximum likelihood and method of moments estimators of ฮฒ coincide, given by โก 0 or To solve this equation, the NewtonRaphson algorithm can be used. An initial estimate of and other details helpful in the solution are given in Birch [15]. Note also that for the logarithmic, is proportional to where , so the test based on tests for the dispersion of the logarithmic distribution. Similarly, if then the numerator of is of the form , so the test based on assesses whether the data are consistent with moments of the logarithmic up to the third.
B. PValues via the Parametric Bootstrap
Gรผrtler and Henze [9, page 223] suggest that pvalues can be obtained using an analogue of the parametric bootstrap. If W_{n} denotes a test statistic, calculate where denote, as usual, the data. Find an estimate from the data and conditional on this estimate, generate say pseudorandom samples of size n, each having the logarithmic () distribution. For compute the value on each random sample. The parametric bootstrap pvalue is then the proportion of the that are at least the observed w_{n}, namely, .
The above requires random logarithmic (ฮฒ) values. Devroye [11, page 547] outlines an algorithm for generating random logarithmic deviates. Alternatively, the routine RNLGR from IMSL [10] can be used. To obtain pvalues for twotailed tests proceed as above and find the pvalue, say P. Then if , the twotailed pvalue is 2P, while if , the twotailed pvalue is .
Acknowledgment
Research of Olivier Thas was supported by IAP research network Grant no. P6/03 of the Belgian Government (Belgian Science Policy).
References
 R. A. Fisher, A. S. Corbet, and C. B. Williams, โThe relation between the number of species and the number of individuals in a random sample of an animal population,โ Journal of Animal Ecology, vol. 12, no. 1, pp. 42โ58, 1943. View at: Publisher Site  Google Scholar
 A. N. G. Aldrete, โPsocoptera (insecta) from the Sierra tarahumara, Chihuahua, Mexico,โ Anales del Instituto de Biologรญa, Universidad Nacional Autรณnoma de Mรฉxico, Serie Zoologรญa, vol. 73, no. 2, pp. 145โ156, 2002. View at: Google Scholar
 C. J. Krebs, Ecological Methodology, Addison Wesley Longman, New York, NY, USA, 1998.
 R. A. Lockhart, J. J. Spinelli, and M. A. Stephens, โCramรฉrvon Mises statistics for discrete distributions with unknown parameters,โ Canadian Journal of Statistics, vol. 35, no. 1, pp. 125โ133, 2007. View at: Google Scholar  Zentralblatt MATH  MathSciNet
 B. Klar, โGoodnessoffit tests for discrete models based on the integrated distribution function,โ Metrika, vol. 49, no. 1, pp. 53โ69, 1999. View at: Google Scholar  Zentralblatt MATH  MathSciNet
 T. W. Epps, โA test of fit for lattice distributions,โ Communications in Statistics: Theory and Methods, vol. 24, no. 6, pp. 1455โ1479, 1995. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 A. Collins and C. Hand, โVote clustering in tournaments: what can Oscar tell us?โ Creativity Research Journal, vol. 18, no. 4, pp. 427โ434, 2006. View at: Publisher Site  Google Scholar
 J. C. W. Rayner and D. J. Best, Smooth Tests of Goodness of Fit, The Clarendon Press, Oxford University Press, New York, NY, USA, 1989. View at: Zentralblatt MATH  MathSciNet
 N. Gรผrtler and N. Henze, โRecent and classical goodnessoffit tests for the Poisson distribution,โ Journal of Statistical Planning and Inference, vol. 90, no. 2, pp. 207โ225, 2000. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 IMSL, โUsers' Manual,โ IMSL, Houston, Tex, USA, 1995. View at: Google Scholar
 L. Devroye, NonUniform Random Variate Generation, Springer, New York, NY, USA, 1986. View at: Zentralblatt MATH  MathSciNet
 N. Henze and B. Klar, โProperly rescaled components of smooth tests of fit are diagnostic,โ Australian & New Zealand Journal of Statistics, vol. 38, no. 1, pp. 61โ74, 1996. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 J. C. W. Rayner, O. Thas, and B. De Boeck, โA generalised Emerson recurrence relation,โ to appear in Australian & New Zealand Journal of Statistics. View at: Google Scholar
 J. Neyman, โโSmooth testโ for goodness of fit,โ Skandinavisk Aktuarietidskrift, vol. 20, pp. 149โ199, 1937. View at: Google Scholar  Zentralblatt MATH
 M. W. Birch, โ194 Note: an algorithm for the logarithmic series distribution,โ Biometrics, vol. 19, no. 4, pp. 651โ652, 1963. View at: Publisher Site  Google Scholar  Zentralblatt MATH
Copyright
Copyright ยฉ 2008 D. J. Best et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.