Hierarchical Models and Tuning of Random Walk Metropolis Algorithms
We obtain weak convergence and optimal scaling results for the random walk Metropolis algorithm with a Gaussian proposal distribution. The sampler is applied to hierarchical target distributions, which form the building block of many Bayesian analyses. The global asymptotically optimal proposal variance derived may be computed as a function of the specific target distribution considered. We also introduce the concept of locally optimal tunings, i.e., tunings that depend on the current position of the Markov chain. The theorems are proved by studying the generator of the first and second components of the algorithm and verifying their convergence to the generator of a modified RWM algorithm and a diffusion process, respectively. The rate at which the algorithm explores its state space is optimized by studying the speed measure of the limiting diffusion process. We illustrate the theory with two examples. Applications of these results on simulated and real data are also presented.
Random walk Metropolis (RWM) algorithms are widely used to sample from complex or multidimensional probability distributions [1, 2]. The simplicity and versatility of these samplers often make them the default option in the MCMC toolbox. Implementing a RWM algorithm involves a tuning step, to ensure that the process explores its state space as fast as possible and that the sample produced be representative of the probability distribution of interest (the target distribution). In this paper, we solve an aspect of the tuning problem for a large class of target distributions with correlated components. This issue has mainly been studied for product target densities, but attention has recently turned towards more complex target models [3, 4]. The specific type of target distribution considered here is formed of components which are related according to a hierarchical structure. These distributions are ubiquitous in several fields of research (finance, biostatistics, and physics, to name a few) and constitute the basis of many Bayesian inferences.
Bayesian hierarchical models comprise a likelihood function , which is the statistical model for the observed data . The parameters are then modeled using a prior distribution ; since this prior might not be easy to determine, it is common practice to assume that the hyperparameters are themselves distributed according to a noninformative prior distribution . The various models thus represent different levels of hierarchy and give rise to a posterior distribution , which is often quite complex. Most of the time, this distribution cannot be studied analytically or sampled directly, and thus simulation algorithms such as MCMC methods are required to perform a statistical analysis. Samplers such as the RWM, RWM-within-Gibbs, and Adaptive Metropolis (see ) are usually the default algorithms for such targets.
The idea behind RWM algorithms is to build a Markov chain having the Bayesian posterior (target) distribution as its stationary distribution. To implement this method, users must select a proposal distribution from which are generated candidates for the Markov chain. This distribution should ideally be similar to the target, while remaining accessible from a sampling viewpoint. A pragmatic choice is to let the proposed moves be normally distributed around the latest value of the sample. Tuning the variance of the normal proposal distribution () has a significant impact on the speed at which the sampler explores its state space (hereafter referred to as “efficiency”), with extremal variances leading to slow-mixing algorithms. In particular, large variances seldom induce suitable candidates and result in lazy processes; small variances yield hyperactive processes whose tiny steps lead to a time-consuming exploration of the state space. Seeking for an intermediate value that optimizes the efficiency of the RWM algorithm, i.e., a proposal variance offering sizable steps that are still accepted a reasonable proportion of the time, is called the optimal scaling problem.
The optimal scaling issue of the RWM algorithm with a Gaussian proposal has been addressed by many researchers over the last few decades. It has been determined in  that target densities formed of independent and identically distributed (i.i.d.) components correspond to an optimal proposal variance , where is the density of one target component and the number of target components. This optimal proposal variance has also been shown to correspond to an optimal expected acceptance rate of 23.4%, where the acceptance rate is defined as the proportion of candidates that are accepted by the algorithm. Generalizing this conclusion is an intricate task, and further research on the subject has mainly been restricted to the case of target distributions formed of independent components (see [7–12]). In the specific case of multivariate normal target distributions however, the optimal variance and acceptance rate may be easily determined (see [11, 13]). Lately, [3, 4] have also performed scaling analyses of nonproduct target densities. These advances are important, as MCMC methods are mainly used when dealing with complex models, which only rarely satisfy the independence assumption among target components. These results however assume that the correlation structure among target components is known and used in generating candidates for the chain. This is a restrictive assumption that leads, as expected, to an optimal acceptance rate of 23.4% (see  for an explanation).
In this paper, we focus on solving the optimal scaling problem for a wide class of models that include a dependence relationship, the hierarchical distributions. Weak convergence results are derived without explicitly characterizing the dependency among target components, and thus rely on a Gaussian proposal distribution with diagonal covariance matrix. The optimal proposal variance may then be obtained from these results, i.e., by maximizing the speed measure of the limiting diffusion process. This constitutes significant advances in understanding the theoretical underpinnings of the RWM sampler. More importantly in practice, the results theoretically support the use of RWM-within-Gibbs over RWM samplers and provide a convenient approach for obtaining a new type of proposal variances. These proposal variances are a function of the current state of the Markov chain; they thus evolve with the chain and lead to more appropriate candidates in the RWM-within-Gibbs algorithm.
In the next section, we describe the target distribution and introduce some notation related to the RWM sampler. The theoretical optimal scaling results are stated in Section 3 and then illustrated with two examples using RWM samplers in Section 4. In Section 5, the potential of RWM-within-Gibbs with local scalings is illustrated in Bayesian contexts through a simulation study and an application on real data. Extensions are briefly discussed in Section 6, while appendices contain proofs.
Consider an -dimensional target distribution consisting of a mixing component and conditionally i.i.d. components () given . Suppose that this distribution has a target density with respect to Lebesgue measure, where
To obtain a sample from the target density in (1), we rely on a RWM algorithm with a Gaussian proposal distribution. This sampler builds an -dimensional Markov chain having as its stationary density. Given , the time- state of the Markov chain, one iteration is performed according to the following steps:(1)Generate a candidate from a , where is a diagonal variance matrix with elements . In particular, set , where is a tuning parameter and the -dimensional identity matrix(2)Compute the acceptance probability (3)Generate (4)If , accept the candidate and set ; otherwise, the Markov chain remains at the same state for another time interval and
Optimal scaling results widely rely on the use of Gaussian proposal distributions which, due to their symmetry, lead to a simplified form of the acceptance probability. Although generally not emphasized in the literature, we note that the proposal variance could also be a function of , which would result in a nonhomogeneous random walk sampler. In that case, there would be no simplification in the Metropolis–Hastings acceptance probability and Step 2 would then replaced bywhere is the density of a .
In what follows, we work towards finding the optimal value of , i.e., leading to an optimally mixing chain. The proofs of the theoretical results rely on CLTs and LLNs; as such, the results are obtained by letting . This is a common approach in MCMC theory and does not prevent users from applying the asymptotically optimal value of in lower dimensional contexts (as small as or 15). Indeed, a particularity of optimal scaling results is that the asymptotic behaviour kicks in extremely rapidly, as shall be witnessed in the examples of Section 4.
The first thought of most MCMC users when facing a target density as in (1) would be to use a RWM-within-Gibbs algorithm, which consecutively updates subgroups of the components in a given iteration. The tuning of RWM-within-Gibbs algorithms has been addressed in , but only for target distributions with i.i.d. components and Gaussian targets with correlation. Focusing on RWM algorithms is thus a good starting point to understand the behaviour of samplers applied to hierarchical target distributions. The results expounded in this paper lead to the concept of local tunings, which is particularly appealing in the context of RWM-within-Gibbs. Incidentally, the proofs in appendices provide a theoretical justification for the use of locally optimal scalings in RWM-within-Gibbs (see ). These findings are illustrated in the examples of Section 5.
In Sections 2.1, 2.2, and 3, we expound how to obtain asymptotically optimal variances and for RWM and RWM-within-Gibbs, respectively. Section 2.1 describes the regularity conditions imposed on , while Section 2.2 explains why the proposal matrix is the optimal choice for obtaining the theoretical results that shall be presented in Section 3.
2.1. Assumptions on the Target Density
To characterize the asymptotic behaviour of the conditionally i.i.d. components (), we impose some regularity conditions on the densities and in (1). The density is assumed to be a continuous function on, with forming an open interval.
For all fixed , is a positive density on and is Lipschitz continuous with constant such that . Here, denotes the space of real-valued functions with continuous second derivative. For all fixed , is a function on and is Lipschitz continuous with constant such that . Furthermore,and, hereafter, the notation means that the expectation is computed with respect to conditionally on the other variables in the expression; the first expectation in (3) is thus obtained according to the conditional distribution of given . Where there is no confusion possible, shall be used to denote an expectation with respect to all random variables in the expression. The above regularity conditions constitute an extension of those stated in  for target distributions with independent components and are weaker than would be a Lipschitz continuity assumption on the bivariate function . They also imply that the Lipschitz constants and themselves satisfy a Lipschitz condition.
We now impose further conditions on to account for the movements of the coordinate when studying the asymptotic behaviour of a component (). These movements should not be too abrupt, so for almost all fixed , is Lipschitz continuous with constant such that and
Finally, in order to characterize the asymptotic behaviour of the mixing component , we introduce assumptions that are closely related to the Bernstein von Mises Theorem. Let , , and denote convergence in probability. Assume that , and denote such that as , with . Hereafter, we make a small abuse of notation by letting and sometimes denote the random variable or the realisation, depending on the context. Furthermore, define ; for almost all , the conditional density of given , , is assumed to converge almost surely to , a continuous density on with respect to Lebesgue measure. In fact, the information on increases linearly in , meaning that the limiting density of is degenerate, but that a standard rescaling leads to a nontrivial density on (normal distribution).
2.2. Form of the Proposal Variance Matrix
In Section 3, we focus on deriving weak convergence and optimal scaling results for the RWM algorithm with a Gaussian proposal by letting , the dimension of the target density in (1), approach . Traditionally, asymptotically optimal scaling results have been obtained by studying the limiting path of a given component ( say) as . In the case of target distributions with i.i.d. components (and some extensions), the components of the RWM algorithm are asymptotically independent of each other and their limiting behaviour is regimented by identical one-dimensional Markovian processes. In the current correlated framework, we expect the presence of an asymptotic dependence relationship among () and , in the spirit of (1). In the following section, we thus study the limiting behaviour of components and separately, on their respective conditional space. This approach allows us to quantify the mixing rate of each component conditionally on the others and to propose optimal scalings for the sampler.
To obtain nontrivial limiting processes describing the behaviour of the RWM sampler as , we need to fix the form of the proposal scalings . Whilst the proposals are independent, a single accept-reject step is used, which makes the paths of the components dependent. We aim to choose the maximal scalings that avoid a degenerate limit (of either 0 or 1) for this acceptance probability. Since the distribution of conditional on contracts at a rate of , if , the proposed jumps in will be too large. If , then the change in makes no contribution to the acceptance probability in the limit; to maximise movements, we, therefore, require . Now, the conditional distribution of given does not contract with . Nonetheless, when proposing jumps in using , the odds of rejecting an -dimensional candidate increase with and lead to a degenerate (null) acceptance probability. To overcome this problem, we then let the proposal variance be a decreasing function of the dimension. In fact, since Lipschitz conditions control the contribution to the accept-reject ratio coming from the movements of , a similar argument to that which leads to in the case of i.i.d. targets applies again here. We therefore set , where is a tuning parameter and the -dimensional identity matrix.
As , it becomes necessary to speed up time to compensate for the reduced movement along components . The time interval between each proposed candidate is thus set to , and we study the continuous-time, sped up version of the initial Markov chain defined as , where is the floor function. Similarly to the i.i.d. case, a limiting diffusion is obtained for the rescaled one-dimensional process related to (), but this time its behaviour is conditional on .
Since the first coordinate converges to a point , a transformation is required to obtain the limiting behaviour of this component. We thus study the continuous-time process ; in other words, we are now looking at a magnified, centered version of the path associated to . This transformation leads to proposal distributions and , with ; it thus cancels the effect of in . Without the speed up of time, the limiting process for is then a propose-accept-reject on the conditional density for , given the current values of ; this is made precise in Theorem 1. When considering the diffusion limit for with time sped-up, this effectively means that at every instant, is simply a sample from its conditional distribution given the current values of ; this is made precise in Theorem 2.
We note that an alternative scaling of could also be applied. The sped-up limiting process would then be a diffusion for all coordinates and would be easier to describe. However, this would also be a deliberate handicapping of the algorithm since the change in would make no contribution to the acceptance probability in the limit. A suboptimal , besides altering the movements of , would thus also indirectly affect the efficiency according to which explores its state space.
3. Asymptotics of the RWM Algorithm
In this section, we introduce results about the limiting behaviour (as ) of the time- and scale-adjusted univariate processes and (). From these results, we determine the asymptotically optimal scaling (AOS) values and acceptance rate (AOAR) that optimize the mixing of the algorithm.
Hereafter, we let denote weak convergence in the Skorokhod topology and a Brownian motion at time ; the cumulative distribution function of a standard normal random variable is denoted by .
Theorem 1. Consider a RWM algorithm with proposal distribution used to sample from a target density as in (1). Suppose that satisfies the conditions on and specified in Section 2.1 and that is distributed according to in (1).
Ifwiththen the magnified process . Here, and () are distributed according to the densities and , respectively, which implies that is distributed according to the density in Section 2.1. Given the time- state , the process evolves as the continuous-time version of a special RWM algorithm applied to the target density ; the proposal distribution of this algorithm is a , and the acceptance rule is defined as
Proof. See Appendix A.1.
This result describes the limiting path associated to the coordinate as , which is Markovian with respect to the history of the multidimensional chain . We recall that the conditional distribution of given contracts at a rate of and that . Conditionally on , the transformed thus mixes according to and explores its conditional state space much more efficiently than the other components, as shall be witnessed in Theorem 2. The asymptotic process found can be described as an atypical one-dimensional RWM algorithm, whose acceptance rule and target density both vary according to at every iteration. The acceptance function in (7) satisfies the reversibility condition with respect to (see  for more details about this acceptance function).
Theorem 1 is interesting from a theoretical perspective but cannot be used to optimize the global mixing of the algorithm. Although we could try to determine the value of leading to the optimal mixing of on its conditional space, it will be wiser to focus instead on optimizing the mixing rate of on its own conditional space given . Since the distribution of contracts about , the position of this coordinate heavily depends on the current state of . We shall also see in Theorem 2 that given , the coordinates () explore their conditional state space according to . Since these coordinates take more time exploring their conditional distribution and heavily affect the position of , the global performance of the sampler is subjected to the mixing of conditionally on .
Theorem 2. Consider a RWM algorithm with proposal distribution used to sample from a target density as in (1). Suppose that satisfies the conditions on and specified in Section 2.1 and that is distributed according to in (1).
For , we have , where is distributed according to , and according to . Conditionally on , the evolution of over an infinitesimal interval satisfieswith
Proof. See Appendix A.2.
Equation (8) describes the behaviour of the process at the next instant, (), given its position at . This expression should not come as a surprise: each rescaled component () asymptotically behaves according to a diffusion process that is Markovian with respect to . Examination of (8) also tells us that is invariant for this diffusion process (see , for instance). We finally recall that and therefore, conditionally on , the rescaled mixes according to . Each coordinate thus requires more iterations than were required by the coordinate to explore its conditional state space.
Since and use different time rescaling factors, the asymptotic behaviour of these coordinates cannot be expressed as a bivariate diffusion process. To obtain such a diffusion, we would have to rely on inhomogeneous proposal variances to ensure that also mixes in iterations; as mentioned at the end of Section 2, this would require setting , for . This framework would, of course, be suboptimal as it would restrain the movements. Proposed jumps for would then become insignificant, and so the first term in (11) would be null.
Remark 1. Studying the limiting behaviour of and () separately does not cause information loss. In fact, studying the paths of simultaneously would require letting the test function of the generator in (A.3) be a function of . Such a generator would however be developed as an expression in which cross-derivative terms (e.g., ) are null, which confirms that given the current state of the asymptotic process, one-dimensional moves are performed independently for each coordinate.
The limiting processes in Theorems 1 and 2 indicate that the component explores its conditional state space at a different (higher) rate than explores its own. Combined to the specific Markovian forms of the limiting processes obtained (with respect to and , respectively), this points towards the need for updating and separately, assessing the superiority of RWM-within-Gibbs samplers for sampling from hierarchical targets. These algorithms update blocks of components successively, a design that allows fully exploiting the characteristics of the target considered. To our knowledge, this is the first time that asymptotic results are used to theoretically validate the superiority of RWM-within-Gibbs over RWM samplers for hierarchical target distributions. This theoretical superiority is obviously tempered in practice by an increased computational effort; the extent of this computational overhead is however difficult to quantify in full generality. To this end, Section 5 presents two examples that illustrate the performance of the RWM-within-Gibbs and compare it to RWM and Adaptive Metropolis samplers.
3.1. Optimal Tuning of the RWM Algorithm
To be confident that the -dimensional chain has entirely explored its state space, we must be certain that every one-dimensional path has explored its own space. In the correlated framework considered, the overall mixing rate of the RWM sampler is only as fast as the slowest component. As explained in Section 3, optimal mixing of the algorithm shall be attained by optimizing the mixing of the coordinates , . In the limit, the only quantity that depends on the proposal variance (i.e., on ) is in (9). To optimize mixing, it thus suffices to find the diffusion process that goes the fastest; i.e., the value of for which the speed measure is optimized.
The speed measure in (9) is quite intuitive; it is in fact similar to that obtained when studying i.i.d. target densities. The main difference lies in the form of which, in the i.i.d. case, is given by the constant term . The second term in (11) is thus equivalent to and consists in a measure of roughness of the conditional density under a variation of (). In the case of hierarchical target distributions, we find an extra term that might be viewed as a measure of roughness of under a variation of . This term is weighted by , the square of the (standardized) candidate increment for the first component; in other words, the further the candidate is from the current , the greater is the weight attributed to the associated measure of roughness. Of course, in optimizing the speed measure function, we do not need to know in advance the exact value of the proposed standardized increment ; the speed measure averages over this quantity.
It is interesting to note that optimizing the speed measure leads to local proposal variances of the form . Such proposal variances would then be used for proposing a candidate at the next instant , given the position of the mixing coordinate at time . These local proposal variances thus vary from one iteration to another, by opposition to usual tunings in the literature that are fixed for the duration of the algorithm. Naturally, if both expectations in (11) are constant with respect to , then the proposal variance obtained by maximizing the speed measure also is constant.
Remark 2. It turns out that local proposal variances optimizing (9) are bounded above by , the asymptotically optimal scaling (AOS) values for targets with i.i.d. components given a fixed . Indeed, if is fixed across iterations, we find ourselves in an i.i.d. setting and the associated speed measure is expressed as . The mentioned upper bounds then follow from the fact that the function in (9) decreases faster in than in the above expression.
Relying on a local variance to propose a candidate for the next time interval is usually time-consuming, as it involves numerically solving for the appropriate local proposal variance at every iteration. Since the process is assumed to start in stationarity and explores its conditional state space faster than the other coordinates, we might determine a value that is fixed across iterations by integrating the speed measure over with respect to the marginal distribution . Hence, the global (unconditional) asymptotically optimal scaling value maximizes the functionwhere is the probability density function of a standard normal random variable.
Remark 3. The asymptotic process introduced in Theorem 2 naturally leads us to the concept of local proposal variances. It is however unclear whether the local tunings obtained by maximizing (9) really optimize the mixing rate of the algorithm. Indeed, the proof of Theorem 2 is carried out with constant; this allows, among other things, relying on the simplified form for the acceptance probability. In order to claim that the local proposal variances obtained are optimal, a weak convergence result would need to be proven using a general proposal variance of the form . This extension is not trivial, as the ratio of proposal densities would then need to be included in the acceptance probability. Since the concept of locally optimal proposal variances is numerically demanding in the current framework, we choose to focus on constant.
In RWM-within-Gibbs, the blocks and are updated consecutively and the situation is therefore different. In that case, local variances of the form obtained by maximizing (9) may be used to update the block . Since is updated separately, the first term in (11) is null, which makes local variances easier to compute. Furthermore, since local variances only depend on (which is updated separately), the ratio is equal to 1 and does not need to be included in the acceptance probability. Local variances are thus very appealing in that context and shall be studied in Section 5.
Rather than tuning the sampler using the global AOS value, one may instead monitor the acceptance rate in order to work with an optimally mixing version of the RWM algorithm. To express optimal scaling results in terms of acceptance rates, we introduce the expected acceptance rate of the -dimensional stationary RWM algorithm with a normal proposal:where denotes the probability density function of an -dimensional standard normal random variable. Optimal mixing results for the RWM sampler are summarized in the following corollary.
Corollary 1. In the settings of Theorem 2, the global asymptotically optimal scaling value maximizes
Furthermore, we have thatand the corresponding asymptotically optimal acceptance rate is given by .
In contrast to the i.i.d. case, the AOAR found is not independent of the densities and . Hence, there is not a huge advantage in choosing to tune the acceptance rate of the algorithm over the proposal variance; in fact, both approaches involve the same effort. Although it would also be possible to compute an overall acceptance rate associated to using local proposal variances, it could not be used to tune the algorithm. Building an optimal Markov chain based on local proposal variances would imply modifying the proposal variance at every iteration, which cannot be achieved by solely monitoring the acceptance rate.
For simplicity, the theoretical results expounded in this section attribute the same tuning constant to all components. In practice, when a RWM algorithm is used to sample from a hierarchical target, users will likely want to use a different proposal variance for the mixing component . In fact, the proofs of Theorems 1 and 2 easily generalize to the case of inhomogeneous proposal variances.
Corollary 2. Let with and , where are independent. Then, Theorems 1 and 2 hold as stated, except that the limiting proposal distribution in Theorem 1 is and the random variable in Theorem 2 is such that .
In this paper, we consider the simple, yet useful hierarchical model described in (1) and featuring a single mixing component . This is a natural starting point to study weak convergence of RWM algorithms for hierarchical targets, and even for correlated targets in general. There exist many generalizations of (1), just as there are many extensions of the proposal distribution considered. Some extensions of the hierarchical target are considered in the discussion, but we do not aim at presenting a detailed treatment of these cases.
4. Numerical Studies
To illustrate the theoretical results of Section 3, we consider two toy examples: the first target distribution considered is a normal-normal hierarchical model in which the components are related through their mean, while the second one is a gamma-normal hierarchical model in which are related through their variance. In both cases, we show how to compute the optimal variance . We also study the performance of RWM samplers and conclude that even in relatively low-dimensional settings, the samplers behave according to the asymptotic results previously detailed.
4.1. Normal-Normal Hierarchical Distribution
Consider an -dimensional hierarchical target such that and for . To sample from this distribution, we use a RWM algorithm with a proposal distribution. This simple target shall relate Theorem 2 to the theoretical results derived in .
Standard calculations lead to ; as , almost surely. If we let and , then . Furthermore, the term is reexpressed as and thus converges in probability to , where denote independent standard normal random variables. By Theorem 1, we can thus affirm that the component asymptotically behaves according to a one-dimensional RWM algorithm with a standard normal target and acceptance function as in (7); these do not, in the current case, depend on .
Evaluating the function in (11) is a simple task and leads to . The AOS value is then found by maximizingwith respect to , where . This yields an AOS of and a corresponding AOAR of . These values are naturally smaller than those obtained for a target with i.i.d. components (5.66 and 0.234, respectively); indeed, the proposal distribution is formed of i.i.d. components and accordingly better suited for similar targets. Relying on a proposal with correlated components would however require a certain understanding of the target correlation structure, which goes against the general framework we wish to consider.
It is worth pointing out that the speed measure of the limiting diffusion process does not depend on in the present case. This holds for arbitrary densities and satisfying the conditions in Section 2.1, provided that is a location parameter for (). Since a variation in the location parameter does not perturb the roughness of the distribution, the AOS and AOAR found are valid both locally and globally. This means that , which remains fixed across iterations, is the best possible proposal scaling conditionally on the last position of the component (i.e., ).
A second peculiarity of this example is that the target distribution is jointly normal with mean and covariance matrix given by , (), and (). Normal distributions being invariant under orthogonal transformations, we can find a transformation under which the target components become mutually independent. The covariance matrix is thus transformed into a diagonal matrix whose diagonal elements consist in the eigenvalues of . In moderate to large dimensions, the eigenvalues can be approximated by . It turns out that the optimal scaling problem for target distributions of this sort (i.e., formed of components that are i.i.d. up to a scaling term) has been studied in . Solving for the AOS value and AOAR of the transformed target using Theorem 1 and Corollary 2 in  leads to values that are consistent with those obtained using Theorem 2 in Section 3.
To illustrate these theoretical results, we consider the 20-dimensional normal-normal target described above and run 50 RWM algorithms that differ by their proposal variance only. For each sampler, we perform 100,000 iterations (sufficient for convergence according to the autocorrelation function) and measure efficiency by recording the average squared jumping distancewhere is the number of iterations and is the dimension of the target distribution. We also record the average acceptance rate of each algorithm, expressed as
We repeat these steps for 50- and 100-dimensional normal-normal targets and combine all three curves of efficiency versus acceptance rate on a graph along with the theoretical efficiency curve of versus the expected acceptance rate (Figure 1(b), bottom set of curves). To assess the limiting behaviour of the coordinate , we also plot the ASJD of this single component (for the 20-, 50-, and 100-dimensional cases) along with the ASJD for the limiting one-dimensional RWM sampler described in Theorem 1 (Figure 1(a), top set of curves).
We now repeat the numerical experiment by taking advantage of the available target variances in the tuning of the proposal distribution. Specifically, we let be independent of and run the RWM algorithm in dimensions 20, 50, and 100. The resulting simulated and theoretical efficiency curves are illustrated in Figure 1(a) (bottom set of curves) and Figure 1(b) (top set of curves). Although efficiency curves for are lower when using inhomogeneous proposal variances, this approach still results in a better overall performance (the curves in the right graph are higher than with homogeneous variances). The optimized theoretical efficiency is 0.974, which is related to an AOAR of 0.221. Despite the fact that Theorems 1 and 2 are valid asymptotically, the simulation study yields efficiency curves that are very close together; the theorems thus seem applicable in relatively low-dimensional settings.
Each set of curves in Figure 1(b) agrees about the optimal acceptance rates 0.205 and 0.221, respectively. These optimal rates have been obtained by running an homogeneous sampler with optimal variance and an inhomogeneous sampler with optimal variance , each optimizing (9). Any other proposal variance leads to a point that is lower on the efficiency curve.
According to the shape of these curves, tuning the acceptance rate anywhere between 0.15 and 0.3 would yield a loss of at most 10% in efficiency and would still result in a Markov chain that rapidly explores its state space; in particular, using the usual 0.234 for this target would yield an almost optimal algorithm. Beyond finding the exact AOAR for a specific target distribution, there is thus a need for understanding when and why AOARs significantly differ from 0.234. At the present time, the only way to answer this question is by solving the optimal scaling problem for target distributions of interest.
4.2. Gamma-Normal Hierarchical Distribution
As a second example, consider a gamma-normal hierarchical target such that and , . Although () are still normally distributed, the coordinate now acts through the variance of the normal variables. This results in a target that significantly differs from the distribution considered in the previous section, falling slightly outside the framework of Section 2 ( is now only locally Lipschitz continuous). We run the usual RWM algorithm to obtain a sample from this distribution.
Standard calculations lead to and as , . The WLLN-type expression in Theorem 1 may be reexpressed as = , where are independent standard normal random variables. The condition is thus satisfied as it converges in probability to . Using Stirling’s formula, it is not difficult to show that the density of converges almost surely to that of a . By Theorem 1, the coordinate asymptotically behaves according to an atypical one-dimensional RWM algorithm with a normal target; the target variance however varies from one iteration to the next, and so does the acceptance function in (7).
To optimize the efficiency of the algorithm, we analyze the speed measure in (9); in the present case, it is expressed aswhere . Maximizing the function in Corollary 1 with respect to leads to the global AOS value, which is fixed across iterations; when for instance, we find and .
The simulation study described in Section 4.1 has been performed for the gamma-normal target model with various and . Specifically, for fixed and , we consider a 10-dimensional gamma-normal target distribution and run 50 RWM algorithms possessing their own proposal variance. For each sampler, we perform 1,000,000 iterations (again sufficient for convergence according to the autocorrelation function) and measure efficiency by recording the ASJD of each chain. We then repeat these steps for 20- and 50-dimensional targets. Table 1 presents the optimal efficiency and acceptance rate for various and . Those results are compared to the theoretical optimal values obtained by maximizing .
Although the corresponding graphs are omitted here, they yield curves similar to those obtained in Figure 1 for the normal-normal target. We note that even if the gamma-normal departs from a jointly normal distribution assumption and does not yield as nice a target distribution as in the previous example, the AOAR obtained is not too far from the 0.234 found for i.i.d. targets. The AOAR however tends to decrease as increases (e.g., 0.152 for ).
In the current example, it also turns out that the agreement between theoretical and simulation results is altered for some values . As mentioned above, one of the Lipschitz conditions is only valid locally and so the change in becomes arbitrarily steep as . The amplitude of movements is, therefore, not adequately controlled for some choices of that yield a density assigning a significant probability close to 0. In cases where regularity assumptions are not all satisfied, the applicability of theoretical results may thus be affected by the choice of hyperparameters.
5. Applications in Bayesian Contexts
The theoretical results presented in this paper have wide applicability and may be used to improve not only RWM algorithms but other samplers as well (RWM-within-Gibbs, for instance). The examples below study the performance of optimally tuned samplers in the context of hierarchical Bayesian models. They show that the RWM-within-Gibbs sampler with local variances (i.e., variances that are a function of the current state of the chain) is superior to its counterpart with a fixed variance. It is also superior to traditional RWM algorithms and even Adaptive Metropolis (AM) samplers, which use the history of the chain to recursively update the covariance matrix of their proposal distribution (see ).
5.1. Scottish Secondary School Scores
The dataset ScotsSec in the package mlmRev in R contains the scores attained by 3,435 Scottish secondary school students on a standardized test taken at age 16. The primary schools attended by students are also recorded in this dataset; there are different primary schools, and the number of students per primary school varies between 1 and 72. We use the following multilevel Bayesian framework to model these data:
In this model, the variables represent the observed scores obtained by the students having attended primary school , . These observations are modeled according to a normal distribution with mean and variance . The group sizes range from to . The variables , which represent the mean scores of the standardized test for students having attended each of the 148 primary schools, are modeled using a Student distribution with degrees of freedom. A translated and scaled Student distribution has a density proportional to . The mean and precision of the Student distribution, along with the precision of the normally distributed data, are attributed noninformative priors: , , and .
This model leads to the -dimensional posterior density:
The posterior density is too complex for analytic computation, and numerical integration must be ruled out due to the dimensionality of the problem. This distribution is best sampled with MCMC methods, although a classical Gibbs sampler must be ruled out, as the Student distribution destroys conjugacy. In the current setting, we propose to use a RWM-within-Gibbs with four blocks of variables: , , , and . We are also interested in assessing the performance of full-dimensional RWM and AM algorithms in which , , , and are updated at once.
The RWM-within-Gibbs performs one-dimensional updates of , , and using target densities , , and . It then performs an -dimensional update of with respect to the conditional density . Since each block of variables is updated individually using a RWM sampler, we may compute local proposal variances for the fourth block using (9) and (11) in Theorem 2. The proposal variances maximizing (9) are adjusted according to the roughness of their corresponding target component’s distribution and should offer a better performance than a fixed proposal variance.
The target distribution of the fourth block satisfiesand, hence, the partial derivative of the one-dimensional log density with respect to iswhere , . Since the variables are updated separately, the first term in (11) is null, leading to
Optimizing (9) leads to local, inhomogeneous proposal variances of the form .
The terms in the proposal variances are not easy to obtain explicitly as the expectation in (24) must be computed with respect to the conditional distribution of given , which is not a Student distribution anymore. However, the terms may be averaged over the random variables . Squaring (23) and computing the expectation first with respect to and then with respect to easily leads to
These terms yield local proposal variances that have been averaged over all possible datasets; these are the best local variances for the model under study when no information about the observations is available.
The RWM-within-Gibbs is then implemented using Gaussian proposal distributions with , , and for , , and . This yields acceptance rates in the range 35%–50% for each subalgorithm, as prescribed in the literature for one-dimensional target distributions (see ). We update using a Gaussian proposal with local variances .
These steps are then repeated by running a RWM-within-Gibbs in which is updated using a fixed proposal variance of . We also run a -dimensional RWM sampler with a proposal distribution, and an AM algorithm in which the tuning factor of the proposal covariance matrix is .
The ASJD of the chain in (17) offers a reliable way of comparing the four samplers; it is reported in the first column of Table 2. A large value of this measure (relative to other samplers) is indicative of a process that rapidly explores its space, and is equivalent to ordering samplers according to their lag-1 autocorrelations. We also compare the relative efficiency of these samplers by calculating the effective sample size (ESS) of the variables , , , and . The effective sample size represents the number of uncorrelated samples that are produced from the output of the sampler. It is also used as a convergence diagnostic: when its value is too small (), we may have reasonable doubts that the chain really has converged. It is computed aswhere is the number of samples and is the sum of lag- sample autocorrelations. An ESS is produced for each variable; since we want to measure the number of samples that are uncorrelated over all variables, we report the minimum ESS (2nd column of Table 2). The ASJD and minimum ESS values are averaged over 10 runs of 100,000 iterations each, with a burn-in period of 1,000. These quantities are then normalized relative to the average running time of samplers (3rd column); this, respectively, yields the average square jumping distance per second (4th column), and the number of uncorrelated samples generated every second (5th column).
According to these results, the RWM-within-Gibbs with local variances is 1.3 times more efficient than the one with a fixed variance; the efficiency gain is even greater (1.7) if we consider the minimum ESS instead of the ASJD. Although the RWM sampler offers a slight improvement in terms of running time, it still results in efficiency measures that are significantly smaller than those of the RWM-within-Gibbs. The Adaptive Metropolis sampler could be an interesting alternative to the RWM-within-Gibbs, if it were not as expensive in terms of computational resources. Indeed, even if its ASJD is smaller than that of the RWM-within-Gibbs, its minimum ESS is greater. This sampler however requires significantly more time than the other samplers to complete its 100,000 iterations. When correcting for computational effort, it thus badly loses ground to its competitors. The results in Table 2 thus illustrate that there is an important efficiency gain that is available from preferring a local RWM-within-Gibbs over its constant counterpart. Given that running times for both approaches are equivalent, we should clearly use local proposal variances whenever possible.
5.2. Stochastic Volatility Model
As a second example, we wish to study the performance of MCMC samplers in the context of a Bayesian hierarchical model that does not respect the regularity assumptions imposed by the theory of Section 3. We consider a stochastic volatility model in which the latent volatilities form an order-1 autoregressive process. The model, similar to those studied in [16, 17], expresses the mean corrected returns and log volatilites , for , as
The variables and are uncorrelated white noises, and we set . Priors for the parameters and are and , where is the inverse gamma distribution with density proportional to . This model leads to an -dimensional posterior density .
Before pursuing the analysis, we note that and are constrained to subsets of the real line; since the target density is rather sensitive to changes in these parameters, this will potentially affect the performance of MCMC approaches. To ensure fluidity in the samplers implemented, we apply the transformations and . The new variables and take values in and the resulting -dimensional posterior density is given by
Using a 100-dimensional dataset exhibiting low correlation (obtained from the stochastic volatility model with and ), we sample this posterior density using RWM-within-Gibbs (local and fixed variances), traditional RWM, and AM algorithms. Hyperparameters are set to , , , and .
For the RWM-within-Gibbs, we propose to divide the variables into 3 blocks: , , and . The proposal standard deviations associated to and are set to and , respectively; each subalgorithm thus accepts candidates according to a rate of . The -dimensional update of is performed according to the conditional target density . In the case of the RWM-within-Gibbs with local variances, the termsin (11) are not easy to obtain as the full conditional distribution (given the data) is not normally distributed anymore. As before, we solve this problem by computing the expectation above with respect to first, and then with respect to . The resulting proposal variances are thus averaged over all possible datasets; they are the best local proposal variances, independently of the specific dataset considered. Optimizing (9) for yields the -dimensional vector:
For the RWM-within-Gibbs with a fixed proposal variance, the proposal standard deviations associated to and are still and . We then use the theory of Section 3 to obtain an approximately optimal acceptance rate of 0.2 for the block . We reach a similar conclusion for the traditional RWM sampler. Naturally, we have to keep in mind that regularity assumptions are violated in the current context; the theoretical results might not be robust to a departure from those assumptions. In fact, given that the ’s are correlated, we expect the Adaptive Metropolis sampler to better capture this design and to outdo its competitors.
The initial covariance matrix of the Adaptive Metropolis algorithm is the -dimensional identity matrix. We tune its acceptance rate as close as possible to 0.234, as suggested in the literature. For each sampler, we average the ASJD and minimum ESS over 10 runs of 200,000 iterations each, from which the first 10,000 iterations are discarded as burn-in. Time-adjusted ESJD and minimum ESS are again used a measures of efficiency; their values are reported in Table 3.
In terms of ASJD, the RWM-within-Gibbs with local variances is the best option, although its competitors also offer decent performances. The AM sampler does better, in absolute, for the minimum ESS; when accounting for computational effort however, the AM ends up outdone by the RWM-within-Gibbs (local and fixed). As before, we notice a net efficiency gain when preferring local variances to a fixed one in the RWM-within-Gibbs (net gain between 5% and 13%, depending on the efficiency measure). This modest gain is explained by the fact that, for the specific model studied, variations in and do not have a huge impact on the value of the local variances in (30). In spite of this, the impact of using local variances remains positive; generally, there does not seem to be a risk associated to using such local variances. Furthermore, the theoretical results seem applicable to contexts where regularity assumptions are violated (to some extent).
In this paper, we have studied the tuning of RWM algorithms applied to single-level hierarchical target distributions. The optimal variance of the Gaussian proposal distribution has been found to depend on a measure of roughness of the density with respect to as before, but also with respect to the mixing coordinate . This leads to local proposal variances that are a function of the mixing parameter . It is however possible to average over the random variable to find a globally optimal proposal variance. In the case where is a location parameter, it does not affect the roughness of the density , and the optimal proposal scaling is valid both locally and globally.
Higher-level hierarchies could be studied using a similar approach. A target featuring mixing components, expressed aswith would lead to a result similar to Theorem 2, but with the functionwhere come from independent random variables. For a target whose mixing component ( say) depends itself on higher-level mixing components , expressed as , the conclusions of Theorem 2 are still valid. These generalizations also hold for Corollary 2, with obvious adjustments (). Similar extensions may be derived for other hierarchical models.
In the simulation study of Section 4, we found that the optimal acceptance rate most often lies around 0.2. In the gamma-normal example, there were some values of and that led to significantly lower optimal acceptance rates (0.15 when ). The usual 0.234 is thus quite robust and, if preferred, should lead to an efficient version of the sampler. In the case of correlated targets, it would however be wiser to settle for an acceptance rate slightly below 0.234. Since we investigate correlated targets with a proposal distribution featuring a diagonal covariance matrix, it is not surprising to find an AOAR lower than 0.234; the latter is the AOAR for exploring a target distribution with independent components, which is an ideal situation when relying on a proposal distribution with independent components.
We conclude by outlining that the concept of locally optimal proposal variances reveals itself to be of interest with other types of samplers, such as RWM-within-Gibbs algorithms. Indeed, the asymptotic results of Section 3 are proof of the theoretical superiority of RWM-within-Gibbs over RWM when sampling from hierarchical targets. The examples of Section 5 illustrate the efficiency gain from using a RWM-within-Gibbs with local variances over some competitors, including an adaptive sampler. Similar ideas may also be applied to different samplers such as Metropolis-adjusted Langevin algorithms (MALA), but this goes beyond the scope of this paper.
A. Proofs of Theorems
We now proceed to prove Theorems 1 and 2. To assess weak convergence of the processes and in the Skorokhod topology, we first verify weak convergence of finite-dimensional distributions. Whereas these processes are not themselves Markovian, they are -progressive and -progressive -valued processes, respectively, and the aim of this section is to establish their convergence to some Markov processes. According to Theorem 8.2 of Chapter 4 in , we thus look at the pseudogenerator of (resp., ), the univariate process associated to the component (resp., ) in the rescaled RWM algorithm introduced at the end of Section 2. We then verify -convergence to the generator of the special RWM sampler with acceptance rule (7) (resp., the generator of the diffusion in (8)).
To complete the proofs, Theorem 7.8 of Chapter 3 in  says that we must also assess the relative compactness of and for , as well as the existence of a countable dense set on which the finite-dimensional distributions weakly converge. This is achieved by using Corollary 8.6 of Chapter 4 in ; in the setting of Theorem 1, the satisfaction of applicability conditions is immediate; in the setting of Theorem 2, the satisfaction of the first condition is immediate, while the verification of the second condition is briefly discussed in Section A.2.
A.1. Proof of Theorem 1
In Theorem 1, it is assumed that is the component of interest in . Define the pseudogenerator of aswhere is an arbitrary test function. By setting and , conditions in part (c) of Theorem 8.2 (Chap. 4 in ) reduce to as for (the space of continuous and bounded functions on ), where is the generator of the special RWM sampler described in Theorem 1.
The above may be reexpressed as as , wherewith and similarly for . The density of is with as in (1), and thus ; hereafter, . Furthermore,with as in (7). Note that there is a slight abuse of notation as, although is a function of only, the generator is a function of ; a similar remark holds for . We now proceed to verify this condition. Hereafter, we use , , and to denote convergence almost surely, in probability, and in distribution.
In the current context where there is no time-rescaling factor, the limiting process shall remain a RWM algorithm. For and some , the triangle inequality implieswhere the function shall be defined in Lemma B.1 and . Here,with representing the conditional density of given .
By Lemmas B.1 and B.2, the first and second terms in (A.1), respectively, converge to 0 as ; in the sequel, we thus study the last term. Since , the second and third terms on the right of (A.2) are normally distributed with mean and variance , where .
By assumption, this variance term converges in probability to ; hence, the last two terms on the right of (A.2) converge in probability to a