Abstract

Anne Chao proposed a very popular, nonparametric estimator of the species richness of a community, on the basis of a limited size sampling of this community. This expression was originally derived on a statistical basis as a lower-bound estimate of the number of missing species in the sample and provides accordingly a minimal threshold for the estimation of the total species richness of the community. Hereafter, we propose an alternative, algebraic derivation of Chao’s estimator, demonstrating thereby that Chao’s formulation may also provide centered estimates (and not only a lower bound threshold), provided that the sampled communities satisfy a specific type of SAD (species abundance distribution). This particular SAD corresponds to the case when the number of unrecorded species in the sample tends to decrease exponentially with increasing sampling size. It turns out that the shape of this “ideal” SAD often conforms approximately to the usually recorded types in nature, such as “log-normal” or “broken-stick.”. Accordingly, this may explain why Chao’s formulation is generally recognized as a particularly satisfying nonparametric estimator.

1. Introduction

Estimating the total species richness within large communities of species, using only samplings limited in sizes, is a common, long standing challenge which has elicited numerous procedures of estimations.

For a few decades, a series of so-called nonparametricestimators provide elegant and convenient solutions to this question. These new estimators are ordinarily simple format formulations which, moreover, require no specific assumption regarding the statistical distribution of the abundances of species (which, thus, makes these formulations “nonparametric”).

Within this category of formulations, Anne Chao proposed a very popular nonparametric expression, which actually stands among the most commonly used estimators of the total species richness of a sampled community.

Let be the number of species that are missed by (i.e., unrecorded within) the limited sampling of a large community of species. Then, according to Chao’s formulation [1], with and as the numbers of species encountered only once and twice, respectively, in the sample.

This expression was later generalised [2] as with standing for the number of species recorded times in the sample.

These expressions, derived on a statistical basis, provide a lower-bound estimate of the number of missing species in a sampled community [1, 2]. That is, Chao’s formulation is expected to provide only a minimal threshold (in the statistical sense) for the estimated species richness. Yet, several decades of practice (especially in the field of ecological purposes and biodiversity surveys) call for placing Chao’s expression among the most valuable and reliable estimators [37] since, in many occasions, this expression may well appear to provide approximatelycentered rather than only lower-bound estimates. In short, although designed conceptually as a lower-bound evaluation, Chao’s formulation may, nonetheless, fairly often provide rather centered estimates in the common practice.

Hereafter, we address this apparent paradox and propose new insights and argumentations, issued from an alternative, algebraic derivation of the originally statistically derived formulation by Chao.

2. The Specific Condition Allowing an Alternative, Algebraic Derivation of the “Chao” Formulation

We demonstrate (see Appendix A for mathematical details) that the general expression of Chao’s estimator, established originally on a statistical basis, may also admit, alternatively, an algebraic derivation, leading to exactly the same expression as the statistically derived formulation .

Yet, while the statistical derivation of Chao’s formulation requires no particular restriction (“nonparametric”), the algebraic derivation, implies a particular shape for the expected decrease of the proportion of missing (i.e., unrecorded) species when the sample size increases. In fact, as demonstrated in Appendix A, this asymptotic decrease should conform to a negative exponential: with(i)” as the species richness, that is, the total (unknown) number of species of the community,(ii)” as a constant,(iii)” as the sample size, that is, the number of individuals recorded in the sample.

In turn, this particular form of the decrease of the number of missing species with enlarging sample sizes constrains the shape of the species abundance distribution (the “SAD,” that is, the distribution of species abundances when species are conventionally ranked by decreasing order of abundance).

According to (3), the number “” of recorded species in the sample is Let be the number of individuals belonging to the less abundant species among the “” species recorded in the sample (i.e., the species of rank “” when species are ranked by decreasing order of abundance). Then, the relative abundance of the species of rank is expected to be inversely proportional to , as According to (4) and (5), According to the sample size , every species may be called to play the role of the less abundant species within the sample (since, by continuously decreasing the size of the sample, each species of the community [including, at last, the most common] would successively play the role of the least abundant species in the sample).

Therefore, (6) stands the same for any species of any rank “” in the SAD: This equation thus describes the shape of the species abundance distribution () when the proportion of unrecorded species in a sample is exponentially decreasing with the sample size, that is, the shape of the species abundance distribution which conditions the validity of the algebraic derivation of Chao estimator. Figure 1 provides examples of the corresponding shapes of the SAD.

3. The Resulting Restrictive Condition Which Allows the “Chao” Formulation to Become a Centered Estimator of Species Richness

As the algebraic derivation is deterministic by essence, it therefore provides a centered estimate of and of the resulting total species richness , instead of being only a lower-bound estimate, as is the case in the nonparametric context. As mentioned above, the algebraic derivation of Chao’s formulation requires that the sampled community satisfies, at least approximately, the particular type of SAD defined by (7) and illustrated at Figure 1. This restrictive condition is the “price” to be paid for the more accurate estimate, namely, the loss of the strict “nonparametric” character of the statistically based conception.

Yet, this condition assigned to the shape of the distribution of species abundances might not be so restrictive in practice, at least as a first approximation. Reasons for this may be as follows: (a)an asymptotic decrease to zero of with (equation (3)) seems logical and intuitive, because aiming to estimate the total number of species in a community implicitly requires that this number does exist and might be actually reached progressively with sampling size increasing continuously; (b)among the different types of accumulation curves with such an asymptotic evolution, the negative exponential answer of to increasing sampling size is, admittedly, one among the most simple, robust, and seemingly common [8, 9]; (c)the sigmoidal shape of the prescribed SAD (equation (7) and Figure 1) is not so far from the most classically referred empirical types, broken-stick and log-normal distributions [10]. Yet, a strict conformity is not expected a priori with any empirical models. For example, the equation for the SAD corresponding to a broken-stick distribution is [10], which is formally different from (7).

Accordingly, it is no real surprise that Chao’s formula often approaches a strictly centred estimate, in spite of being only a lower-bound estimate in all generality.

This would explain why, in ecological practice in particular, Chao’s formulation is yet considered one of the more accurate and reliable estimators of the total species richness within partially sampled communities.

As mentioned in particular by Gotelli (personal communication), a trend would remain for Chao’s estimates to increase somehow when a series of sampling of growing sizes are extracted from the same community instead of remaining ideally stable on average. This, however, is not necessarily contradictory to preceding arguments but should certainly result from residual discrepancy between the real SAD and the ideal model described by (7) and exemplified at Figure 1.

Appendix

A. An Alternative, Algebraic Derivation of the Chao 1 Estimator

Consider a community of species, actually containing an unknown total number “” of species.

Let be the probability of occurrence (assimilated to the relative abundance) of species “” within this community and let be the sample size, that is, the number of individuals recorded in the sample.

The estimated number of species that escape recording during sampling of the community is : with as the operation summation extended to the totality of the “” species “” (either recorded or not).

The number of species recorded times in the sample is then, according to the binomial distribution:

A.1. Deriving as a Function of

The number of species recorded times in the sample may be derived as a function of .

(i) Consider According to (A.1), it becomes as follows: .

As appropriate sampling requires a size , the difference () can be likened to the corresponding derivative (). Then, where is the first derivative of with respect to . Thus, Similarly,

(ii) consider where is the second derivative of with respect to . Thus,

(iii) Consider where is the number of singletons that would be recorded in a sample of size instead of .

According to (A.5) and (A.7), where is the th derivate of with respect to , at point . Then, where is the third derivative of with respect to . Thus,

(iv) Generalising this approach to calculate the number of species recorded times in the sample: with as the summation from to . It becomes as follows: with as the summation from to ; that is, where and is the number of species recorded times during a sampling of size (instead of size ).

The same demonstration, which yields previously the expression of above (see (A.9)), applies for the (with up to ) and gives where is the th derivate of with respect to , at point . Then, which finally yields That is, where is the th derivative of with respect to , at point .

A relationship is thus derived between the series of the numbers of species recorded times and the series of derivatives at order for : These relations will serve as the basis for deriving the formulation of as a function of the .

A.2. Deriving a Generalised Expression for the “Chao” Formulation

Let us consider now the case when the number of missing species conforms (or is close) to a negative exponential with respect to the sampling size . Accordingly, would thus verify the following series of differential equations: that yields and then , where , , , and are the first, second, third, and derivatives of with respect to .

As and (see (A.5) and (A.18)), it follows for that Note that is since, in practice, remains, by far, quite smaller than . Accordingly, In particular,

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

Two anonymous reviewers are gratefully acknowledged for their useful suggestions on a previous version of the paper. Also, Anne Chao and Nick Gotelli provide encouraging comments on the original version of the paper.