Colin Wei and Iain Murray arXived a new version of their paper on doubly-intractable distributions, which is to be presented at AISTATS. It builds upon the Russian roulette estimator of Lyne et al. (2015), which itself exploits the debiasing technique of McLeish et al. (2011) [found earlier in the physics literature as in Carter and Cashwell, 1975, according to the current paper]. Such an unbiased estimator of the inverse of the normalising constant can be used for pseudo-marginal MCMC, except that the estimator is sometimes negative and has to be so as proved by Pierre Jacob and co-authors. As I discussed in my post on the Russian roulette estimator, replacing the negative estimate with its absolute value does not seem right because a negative value indicates that the quantity is close to zero, hence replacing it with zero would sound more appropriate. Wei and Murray start from the property that, while the expectation of the importance weight is equal to the normalising constant, the expectation of the inverse of the importance weight converges to the inverse of the weight for an MCMC chain. This however sounds like an harmonic mean estimate because the property would also stand for any substitute to the importance density, as it only requires the density to integrate to one… As noted in the paper, the variance of the resulting Roulette estimator “will be high” or even infinite. Following Glynn et al. (2014), the authors build a coupled version of that solution, which key feature is to cut the higher order terms in the debiasing estimator. This does not guarantee finite variance or positivity of the estimate, though. In order to decrease the variance (assuming it is finite), backward coupling is introduced, with a Rao-Blackwellisation step using our 1996 Biometrika derivation. Which happens to be of lower cost than the standard Rao-Blackwellisation in that special case, O(N) versus O(N²), N being the stopping rule used in the debiasing estimator. Under the assumption that the inverse importance weight has finite expectation [wrt the importance density], the resulting backward-coupling Russian roulette estimator can be proven to be unbiased, as it enjoys a finite expectation. (As in the generalised harmonic mean case, the constraint imposes thinner tails on the importance function, which then hampers the convergence of the MCMC chain.) No mention is made of achieving finite variance for those estimators, which again is a serious concern due to the similarity with harmonic means…
Archive for Rao-Blackwellisation
[Here is a reply sent to me by Luca Martino, Victor Elvira, and Gustau Camp-Vallis, after my earlier comments on their paper.]
We provide our contribution to the discussion, reporting our experience with the application of Metropolis-within-Gibbs schemes. Since in literature there are miscellaneous opinions, we want to point out the following considerations:
– according to our experience, the use of M>1 steps of the Metropolis-Hastings (MH) method for drawing from each full-conditional (with or without recycling), decreases the MSE of the estimation (see code Ex1-Ex2 and related Figure 7(b) and Figures 8). If the corresponding full conditional is very concentrated, one possible solution is to applied an adaptive or automatic MH for drawing from this full-conditional (it can require the use of M internal steps; see references in Section 3.2).
– Fixing the number of evaluations of the posterior, the comparison between a longer Gibbs chain with a single step of MH and a shorter Gibbs chain with M>1 steps of MH per each full-conditional, is required. Generally, there is no clear winner. The better performance depends on different aspects: the specific scenario, if and adaptive MH is employed or not, if the recycling is applied or not (see Figure 10(a) and the corresponding code Ex2).
The previous considerations are supported/endorsed by several authors (see the references in Section 3.2). In order to highlight the number of controversial opinions about the MH-within-Gibbs implementation, we report a last observation:
– If it is possible to draw directly from the full-conditionals, of course this is the best scenario (this is our belief). Remarkably, as also reported in Chapter 1, page 393 of the book “Monte Carlo Statistical Methods”, C. Robert and Casella, 2004, some authors have found that a “bad” choice of the proposal function in the MH step (i.e., different from the full conditional, or a poor approximation of it) can improve the performance of the MH-within-Gibbs sampler. Namely, they assert that a more “precise” approximation of the full-conditional does not necessarily improve the overall performance. In our opinion, this is possibly due to the fact that the acceptance rate in the MH step (lower than 1) induces an “accidental” random scan of the components of the target pdf in the Gibbs sampler, which can improve the performance in some cases. In our work, for the simplicity, we only focus on the deterministic scan. However, a random scan could be also considered.
Luca Martino, Victor Elvira and Gustau Camps-Valls have arXived a paper on recycling for Gibbs sampling. The argument therein is to take advantage of all simulations induced by MCMC simulation for one full conditional, towards improving estimation if not convergence. The context is thus one when Metropolis-within-Gibbs operates, with several (M) iterations of the corresponding Metropolis being run instead of only one (which is still valid from a theoretical perspective). While there are arguments in augmenting those iterations, as recalled in the paper, I am not a big fan of running a fixed number of M of iterations as this does not approximate better the simulation from the exact full conditional and even if this approximation was perfect, the goal remains simulating from the joint distribution. As such, multiplying the number of Metropolis iterations does not necessarily impact the convergence rate, only brings it closer to the standard Gibbs rate. Moreover, the improvement does varies with the chosen component, meaning that the different full conditionals have different characteristics that produce various levels of variance reduction:
- if the targeted expectation only depends on one component of the Markov chain, multiplying the number of simulations for the other components has no clear impact, except in increasing time;
- if the corresponding full conditional is very concentrated, repeating simulations should produce quasi-repetitions, and no gain.
The only advantage in computing time that I can see at this stage is when constructing the MCMC sampler for the full proposal is much more costly than repeating MCMC iterations, which are then almost free and contribute to the reduction of the variance of the estimator.
This analysis of MCMC-withing-Gibbs strategies reminds me of a recent X validated question, which was about the proper degree of splitting simulations from a marginal and from a corresponding conditional in the chain rule, the optimal balance being in my opinion dependent on the relative variances of the conditional expectations.
A last point is that recycling in the context of simulation and Monte Carlo methodology makes me immediately think of Rao-Blackwellisation, which is surprisingly absent from the current paper. Rao-Blackwellisation was introduced in the MCMC literature and to the MCMC community in the first papers of Alan Gelfand and Adrian Smith, in 1990. While this is not always producing a major gain in Monte Carlo variability, it remains a generic way of recycling auxiliary variables as shown, e.g., in the recycling paper we wrote with George Casella in 1996, one of my favourite papers.
The recent arXival by Takashi Goda of Computing the variance of a conditional expectation via non-nested Monte Carlo led me to read it as I could not be certain of the contents from only reading the title! The short paper considers the issue of estimating the variance of a conditional expectation when able to simulate the joint distribution behind the quantity of interest. The second moment E(E[f(X)|Y]²) can be written as a triple integral with two versions of x given y and one marginal y, which means that it can approximated in an unbiased manner by simulating a realisation of y then conditionally two realisations of x. The variance requires a third simulation of x, which the author seems to deem too costly and that he hence replaces with another unbiased version based on two conditional generations only. (He notes that a faster biased version is available with bias going down faster than the Monte Carlo error, which makes the alternative somewhat irrelevant, as it is also costly to derive.) An open question after reading the paper stands with the optimal version of the generic estimator (5), although finding the optimum may require more computing time than it is worth spending. Another one is whether or not this version of the expected conditional variance is more interesting (computation-wise) that the difference between the variance and the expected conditional variance as reproduced in (3) given that both quantities can equally be approximated by unbiased Monte Carlo…
Thanks to a very early start from Paris, and despite horrendous traffic jams in Marseilles, I managed to reach CIRM with ten minutes to spare before my course. After my one-hour class, I was suddenly made aware of the (simplistic) idea that the slice sampling uniforms are simply auxiliary, meaning they can be used in many different ways.
I noticed Natesh Pillai just arXived an extension of his earlier Cauchy paper with XL. He proves that the result on the Cauchy distribution of any convex combination of normal ratios still holds when the pair of vectors is distributed from a product of elliptically symmetric functions. Some of Natesh’s remarks reminded me of the 1970 Sankhyã paper by Kelker on spherically symmetric variables. Especially because of Kelker’s characterisation of elliptically symmetric functions as scale mixtures of normals, which makes perfect sense since the scale cancels.
As I skimmed through my slides yesterday, fearing everyone knew about the MCMC basics, I decided to present today the Rao-Blackwellisation slides I gave in Warwick a few months ago.
Last week, on arXiv, Espen Bernton, Shihao Yang, Yang Chen, Neil Shephard, and Jun Liu (all from Harvard) proposed a weighting scheme to associated MCMC simulations, in connection with the parallel MCMC of Ben Calderhead discussed earlier on the ‘Og. The weight attached to each proposal is either the acceptance probability itself (with the rejection probability being attached to the current value of the MCMC chain) or a renormalised version of the joint target x proposal, either forward or backward. Both solutions are unbiased in that they have the same expectation as the original MCMC average, being some sort of conditional expectation. The proof of domination in the paper builds upon Calderhead’s formalism.
This work reminded me of several reweighting proposals we made over the years, from the global Rao-Blackwellisation strategy with George Casella, to the vanilla Rao-Blackwellisation solution we wrote with Randal Douc a few years ago, both of whom also are demonstrably improving upon the standard MCMC average. By similarly recycling proposed but rejected values. Or by diminishing the variability due to the uniform draw. The slightly parallel nature of the approach also connects with our parallel MCM version with Pierre Jacob (now Harvard as well!) and Murray Smith (who now leaves in Melbourne, hence the otherwise unrelated picture).
As posted a few days ago, Mathieu Gerber and Nicolas Chopin will read this afternoon a Paper to the Royal Statistical Society on their sequential quasi-Monte Carlo sampling paper. Here are some comments on the paper that are preliminaries to my written discussion (to be sent before the slightly awkward deadline of Jan 2, 2015).
Quasi-Monte Carlo methods are definitely not popular within the (mainstream) statistical community, despite regular attempts by respected researchers like Art Owen and Pierre L’Écuyer to induce more use of those methods. It is thus to be hoped that the current attempt will be more successful, it being Read to the Royal Statistical Society being a major step towards a wide diffusion. I am looking forward to the collection of discussions that will result from the incoming afternoon (and bemoan once again having to miss it!).
“It is also the resampling step that makes the introduction of QMC into SMC sampling non-trivial.” (p.3)
At a mathematical level, the fact that randomised low discrepancy sequences produce both unbiased estimators and error rates of order
means that randomised quasi-Monte Carlo methods should always be used, instead of regular Monte Carlo methods! So why is it not always used?! The difficulty stands [I think] in expressing the Monte Carlo estimators in terms of a deterministic function of a fixed number of uniforms (and possibly of past simulated values). At least this is why I never attempted at crossing the Rubicon into the quasi-Monte Carlo realm… And maybe also why the step had to appear in connection with particle filters, which can be seen as dynamic importance sampling methods and hence enjoy a local iid-ness that relates better to quasi-Monte Carlo integrators than single-chain MCMC algorithms. For instance, each resampling step in a particle filter consists in a repeated multinomial generation, hence should have been turned into quasi-Monte Carlo ages ago. (However, rather than the basic solution drafted in Table 2, lower variance solutions like systematic and residual sampling have been proposed in the particle literature and I wonder if any of these is a special form of quasi-Monte Carlo.) In the present setting, the authors move further and apply quasi-Monte Carlo to the particles themselves. However, they still assume the deterministic transform
which the q-block on which I stumbled each time I contemplated quasi-Monte Carlo… So the fundamental difficulty with the whole proposal is that the generation from the Markov proposal
has to be of the above form. Is the strength of this assumption discussed anywhere in the paper? All baseline distributions there are normal. And in the case it does not easily apply, what would the gain bw in only using the second step (i.e., quasi-Monte Carlo-ing the multinomial simulation from the empirical cdf)? In a sequential setting with unknown parameters θ, the transform is modified each time θ is modified and I wonder at the impact on computing cost if the inverse cdf is not available analytically. And I presume simulating the θ’s cannot benefit from quasi-Monte Carlo improvements.
The paper obviously cannot get into every detail, obviously, but I would also welcome indications on the cost of deriving the Hilbert curve, in particular in connection with the dimension d as it has to separate all of the N particles, and on the stopping rule on m that means only Hm is used.
Another question stands with the multiplicity of low discrepancy sequences and their impact on the overall convergence. If Art Owen’s (1997) nested scrambling leads to the best rate, as implied by Theorem 7, why should we ever consider another choice?
In connection with Lemma 1 and the sequential quasi-Monte Carlo approximation of the evidence, I wonder at any possible Rao-Blackwellisation using all proposed moves rather than only those accepted. I mean, from a quasi-Monte Carlo viewpoint, is Rao-Blackwellisation easier and is it of any significant interest?
What are the computing costs and gains for forward and backward sampling? They are not discussed there. I also fail to understand the trick at the end of 4.2.1, using SQMC on a single vector instead of (t+1) of them. Again assuming inverse cdfs are available? Any connection with the Polson et al.’s particle learning literature?
Last questions: what is the (learning) effort for lazy me to move to SQMC? Any hope of stepping outside particle filtering?