Abstract

We derive some simple relations that demonstrate how the posterior convergence rate is related to two driving factors: a “penalized divergence” of the prior, which measures the ability of the prior distribution to propose a nonnegligible set of working models to approximate the true model and a “norm complexity” of the prior, which measures the complexity of the prior support, weighted by the prior probability masses. These formulas are explicit and involve no essential assumptions and are easy to apply. We apply this approach to the case with model averaging and derive some useful oracle inequalities that can optimize the performance adaptively without knowing the true model.

1. Introduction

In Jiang [1], there are some general results on the posterior convergence rate, which were very simple and easy to be applied. The current paper is related and developed from the ideas of Jiang [1]. There is no essential new idea behind the proofs. However, the results have been much simplified from the earlier groundbreaking works in this area, such as Ghosal et al. [2], Walker [3], and Ghosal et al. [4] so that the current work can be applied much more easily and displays the intrinsic driving factors behind the convergence rate more directly.

The current paper cannot be used to derive new convergence rates better than what are achievable in the existing literature, except that the current convergence rate is described in a -divergence for any (defined later), which is more general than the squared Hellinger distances (corresponding to ). A recent work by Norets [5] actually has obtained convergence rates in a stronger Kullback-Leiber divergence (corresponding to the limit ); however, the rate in is suddenly much worse than the rate in (e.g., about in parametric cases, instead of about ). Interestingly, although we allow any for , the rate stays at about , which is essentially as good as in the Hellinger case at and does not deteriorate to the case of the Kullback-Leibler limit at .

Aside from this technical difference in the divergence measures used in this current paper, the difference from the previous works is essentially esthetical. The key difference from the previous works is that the previous results are presented as bounds of the posterior probability (outside a neighborhood of the true density), while the current paper presents almost sure bounds of the distance (or divergences) from the true density directly. Applications of previous works sometimes need to make a guess on the convergence rate and then check that it simultaneously satisfies several inequality conditions, while the current paper presents explicit formulas that are essentially assumption free.

2. Main Results

Let denote the observed data generated from a probability distribution . Consider a prior distribution supported on a set of densities . Define for a subset the posterior probability as . Denote . Let be a countable convex cover of , so that , is convex, and is a countable set.

Define the divergence between densities and , on a suitable dominating measure, for . (The Kullback-Leiber divergence corresponds to the limiting case of ; the squared Hellinger distance corresponds to the case ; the divergence corresponds to the case .) Then we have the following result.

Proposition 1. For any and any , for any , one has

This result requires no assumption essentially. The relation displayed is explicit. In the result, , are regarded as probability densities of the entire set of data .

Now consider the case with iid assumption, so that , being iid (independent and identically distributed), generated from density for a single copy . Let be a prior distribution supported on a set of densities of . Consider any with any of its countable convex covers . Using relations such as and for any real , the previous result becomes as follows.

Proposition 2. For iid data with sample size , for any , any , one has

The only essential assumption here is iid. The relation displayed is explicit.

We will now consider a sequence of densities for iid data, which are generated from the posterior distributions based on iid data with increasing sample sizes , and study how they converge to the true density .

Condition 1 (“posterior sequence” of random densities for iid data). A “posterior sequence” (labeled by sample size ) of random density functions in , in a probability space, satisfies, for any subset , Here is a set of density functions, is the prior distribution of , and is the fraction in the integrand, which can be regarded as the posterior distribution of based on iid data .

At any fixed sample size , this probability law is equivalent to assuming that is sampled from the posterior given data , and is an iid sample of with density . We will often omit the superscript and write .

Suppose can be covered by a finite number of the 's, each being an ball with radius . Then the following result can be obtained.

Proposition 3. Consider a “posterior sequence” of densities for iid data satisfying Condition 1. For any and any , with probability 1, for almost all large sample size , one has where and is the support of the prior .

Here we define a “penalized divergence, for some divergence of a prior from the true density .

Remark 4. The result can be extended to the continuously valued . This is because the divergence is monotonically increasing in . For any so that is not an integer, we can use a more stringent divergence with the being the next larger value from the integer range to bound the convergence rate in .

Remark 5. In this and other works, we notice that we often encounter in the convergence rate results a quantity similar to the “penalized divergence” of the form , related to a prior . This first part describes the maximal divergence of a set (proposed by a prior ) from . We can understand this part as the approximation error of the prior when it is used to propose densities to approximate a true density . The second part penalizes an unlikely set with a small prior . Combining the two parts, we can perhaps try to interpret as the approximation error (away from ) by a not-too-unlikely set proposed by a prior . This “penalized divergence” is a critically important driving factor for determining the convergence rates in the previous results. It is noted that although this factor corresponds to the approximation ability of , it already has a complexity penalty built in it implicitly. This is from the penalty against a small prior; the second part is , which is, roughly speaking, about , where is the number of parameters proposed by the prior (e.g., for a uniform prior , for a small -dimensional cube with volume , we have ).

Remark 6. The other factor behind the convergence rate is related to the complexity of the model, which is proportional to where is some number that increases with the number of small convex balls needed to cover the prior support of the model. Typically, this “complexity factor'' is roughly about , up to some logarithm factors, where is the dimension of the parameters involved in the prior. It is noted, however, that with model averaging the higher dimensional model can be downweighted by the model prior, so that effectively one can make to be of order for this complexity factor, so that the convergence rate will be controlled by the first factor (“the penalized divergence”) alone.

The convergence rate result in Proposition 3 can be extended to the case of model averaging, when the prior is , jointly over a model index in a set of nonoverlapping models , and density (the support of prior model ) (we assume nonoverlapping models for simplicity, where for any two different model indexes and . This is only a technical convention for defining the prior supports, which typically does not affect the real applications see, e.g., Section 3.1) and posterior for an event . In this case, let , the balls of radius needed to cover the prior-support under model . Then we have the following, under the iid assumption.

Proposition 7. Consider a “posterior sequence” of densities for iid data satisfying Condition 1. For any and any , with probability 1, for almost all large sample size , one has where and are supports of the mixing prior and the model- prior , respectively.

This is an oracle inequality that achieves the best performance of all models for the bound on the right hand side. Again, the convergence rate is displayed explicitly, and we will try to explain the driving factors of the convergence rate later. This is unlike the previous works where one has to conjecture a rate and check that it satisfies many conditions.

So far, we have assumed existence of a finite covering number for the prior support, such as in Proposition 3 or in Proposition 7. They determine the “complexity factor” as commented in Remark 6. A deeper analysis of the “complexity factor” is to regard it as an upper bound for a better complexity measure related to the prior , developing an idea pioneered by Walker [3].

Remark 8. The complexity in Remark 6 is not satisfactory when the prior support is unbounded and the covering number is infinity. However, the proofs of the propositions can be easily adapted to show that the covering number can be replaced by from Proposition 2, where we have relaxed to be a cover of the entire prior support of , and we have freedom in choosing the cover . Therefore, we can define a quantity that is related to the prior itself. Let the be the infimum of the norm over all such covers of , where each is an ball of radius . We may name it as the “-norm prior complexity” for covering the prior support. An unbounded prior support may still be coverable by infinitely many 's, so that is finite, even with an infinite covering number . Then we have a better way of formulating a bound corresponding to Proposition 3: where is the “-norm complexity” of this prior defined in this remark.

Remark 9. We now describe heuristically how to bound the “norm complexity” defined in the previous remark in parametric models, where densities are parameterized by a dimensional parameter , and a prior on induces a prior on the densities in . A more rigorous treatment is given in the example of Section 3.2. In typical situations with some smoothness conditions on the densities, we can relate the distance between two densities and by the maximal norm : for some constant . Then, to cover the parameter space, we can use ball 's in the parameter space with radius , so that the corresponding densities cover the -ball with the required radius . These sets , with small volumes , can be used to form a fine partition of the parameter space, so that the norm , where is the prior density function evaluated at some intermediate point in the set . The sum in the square bracket is a Riemann sum over a fine grid, which we will assume to be approximated by an integral under some regularity conditions, even if the domain may be unbounded. Therefore, we have an upper bound of the norm complexity as for all large enough . Assume that the prior density is integrable in the parameter space, and the norm scales as as in the case of an iid prior . Then the complexity term in the bound of Remark 8 can be derived as which increases with the dimension .

Remark 10. Similar to Remark 8, we have a better way of formulating a bound corresponding to Proposition 7: where is the “-norm complexity” of this prior , which in this case should be the infimum of the norm over all such covers of , where each is an ball of radius , and under each model , represents a cover of its prior support using possibly infinitely many balls. The defining expression of can also be related to the norm complexities of all the conditional priors given the model choices: . With model averaging using some suitable weights , this term and its effect on the convergence rate no longer diverge with the complexity of the model, in contrast to the conclusion of Remark 9. The convergence rate is then mainly determined by the penalized divergence . An example below (in its second part) is used to illustrate this.

3. A Simple Example for Illustration

This is a simple binary regression example intended for illustration. We will see that model averaging can be used to derive nearly optimal convergence rates that are adaptive to the assumptions on the true model. In the first part, we will illustrate how to bound the penalized divergence with a uniform prior with a bounded support. In the second part, we will illustrate how to bound the norm complexity when the prior has an unbounded support.

3.1. When the Prior Has a Bounded Support

Consider a binary regression model , , where the true conditional mean function is denoted as (for some small positive ), which is bounded away from 0 and 1, for any value of . We consider an -piecewise constant working model for the mean function . Suppose the prior is where indicates the -piecewise constant model . We consider an independent uniform prior . (For technically defining the prior supports to be nonoverlapping for different models, one can further require the 's to be mutually distinct. The resulting prior would be unchanged almost everywhere and would not affect the discussions later.)

We will consider two different setups of the true model.

Setup (dense true model). In the first setup, the true has continuous derivative bounded by . We call this a “dense” setup since we may need a large piecewise constant model (with large increasing with sample size ) to approximate this quite arbitrary true mean function .
To apply Proposition 7, we will use and . In the present case, we have , , if and . We will sometimes use the mean function 's to denote the corresponding distances as and so on.
The approximation property of the piecewise constants implies that, for any true , there exists a in the support of so that everywhere. In fact, we can take where . Now let be close to , in the sense that and , , for some small . Take to be the set of all such densities of . Then everywhere, since and for all .
Let and be the densities corresponding to and , respectively. Then due to the triangle inequality. The prior probability over is . Therefore, the “penalized divergence” We will take for some large enough constant and apply Proposition 7. (It can be shown that this will make the “complexity” term negligible compared to (by showing , we omit the tedious details here).)
We can take and for an upper bound of the . Therefore, the “penalized divergence” and the resulting convergence rate are both of order , which is within a factor to the minimax optimal result. It is noted that the model averaging automatically achieves this near optimal rate.

Setup (sparse true model). Consider a second setup, where we assume that the true model is a -piecewise constant, where we do not know the value . We call this a sparse case since we only need an -piecewise constant model to approximate the true mean function perfectly, where can be much smaller than the choice of in Setup 1.
Then modifying the above reasoning, we can bound the infimum in the “penalized divergence” by taking and and obtain (using ) the following: This and also the resulting posterior convergence rate are, therefore, both close to the parametric rate , as if we knew beforehand.

In summary, the prior is in the sense that in either the dense or the sparse case, the resulting posterior distribution works nearly optimally, even if we do not really know whether the true model is dense or sparse.

3.2. When the Prior Has an Unbounded Support

In the example above, we have considered a uniform prior for with a bounded support. In this subsection, we will consider a parametrization in the log-odds scale, with an unbounded prior support, for illustrating how to calculate the norm complexity described in Remarks 8, 9, and 10. The model is still , , where the true , denoted as , is bounded away from 0 and 1. We consider an -piecewise constant working model for the mean function . Suppose the prior is iid for each “log-odds” parameter supported on ; then

Consider two densities of : , with parameters 's and 's, respectively. Then one can easily derive a following relationship for the distance: . For covering all the densities in this working model with balls with radius (), we can use the densities with parameters in balls , with radius .

Then , since the priors of are iid.

Now assume that the prior density for each (given model ) is continuous, symmetric, and decreasing from the origin for some decreasing functions (which can be satisfied by, e.g., independent priors or double-exponential densities).

Then , where we used the decreasingness of in the last two steps. The integration exists since .

Then we have , which is finite despite the unbounded priors support. Then according to Remark 10, will be if for some large enough constant . So the norm complexity term in Remark 10 is of order , which, when compared with the last formula in Remark 9, behaves as if the dimension has become reduced to order by model averaging. Therefore, the norm complexity term does not affect the convergence rate significantly due to model averaging, and the convergence rate is mainly determined by the penalized divergence . The bounding of the penalized divergence is similar to the example discussed in the previous subsection and we omit the details. The resulting convergence rates are essentially the same as when the uniform priors (with bounded supports) are used, despite the fact that we now allow priors with unbounded supports (such as normal priors in the parametrization of log-odds).

4. Proofs

Proof of Proposition 1. for any “test” as a function of valued in . Consider where represents the expectation under the true density .
due to Fubini’s Theorem, where represents the expectation under density .
Using Markov's theorem, for any , we have
All these combine to ()
Now apply a result that is a straightforward extension of Ghosal et al. ([4], Lemma 6.1). For any convex set , there exist such that for any , , , .
Therefore, we can find so that
Given any , we can choose so that in the above statement.
If we take to be a convex cover of and define a combined test and plug it into (), we have
Therefore, we have
Take and
We get where . Then notice that ; we obtain the proof of Proposition 1. (For notational convenience, the densities which appeared in the proof here, such as and , are the densities for the entire data set and , resp., and we do not assume iid.)

Proof of Proposition 2. Use the fact that, for any ,
This leads to the proof by applying Proposition 1.

Proof of Proposition 3. This is a special case of Proposition 7, where focuses on only one model.

Proof of Proposition 7. Repeat the proofs of Propositions 1 and 2 for the case with model averaging, with the support of prior being , where is the support of .
Suppose the convex cover of is doubly indexed as , where has cardinality at most and is a convex cover of the support . Then the result in Proposition 2 holds with
In Proposition 2, let .
Suppose all the convex sets are such that . Then we have () where is an upper bound for the number of convex sets needed to cover .
Now we try to define the convex sets in more detail. They are used to cover , so without generality, each contains a point in , say , which is not close to since . If is small so that any two points are close together, then any point in (which may fall outside   ) can be made to be also not close to , so that for some related to . This would be easy to establish by a triangular inequality, were it not for the difficulty that the divergence is not a true distance for . So we would not be able to say, for example, that should be a small -ball.
To resolve this difficulty, we derive the following inequalities: due to Hölder's inequality, and for for any , we have , so that
Therefore, if is an ball with small enough radius , then any point in has a not too small divergence . We can take . The balls are also convex as required. We will take . Then the radius . Therefore, we conclude that we will take 's to be small balls with radius .
Now we try to find , an upper bound of the number of such small balls needed to cover . We can use this upper bound , the balls of radius needed to cover the entire prior-support under model .
Then () implies, for any , (), where and is the support of the prior.
Let the probability in () be bounded by under a choice of for the right-hand side of (). Then the event will happen for all large enough , almost surely, due to the Borel-Cantelli lemma.
Then
This leads to
The quantity , under a mixture prior , can be bounded by taking , for any , as .

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was based on Technical Report 14-01, Department of Statistics, Northwestern University. The author thanks the reviewers for their helpful comments. The author also thanks Qilu Securities Institute for Financial Studies, Shandong University, for the hospitality during his visit, when a revision of this work was done.