Archive for Rao-Blackwellisation

an interesting identity

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on March 1, 2018 by xi'an

Another interesting X validated question, another remembrance of past discussions on that issue. Discussions that took place in the Institut d’Astrophysique de Paris, nearby this painting of Laplace, when working on our cosmostats project. Namely the potential appeal of recycling multidimensional simulations by permuting the individual components in nearly independent settings. As shown by the variance decomposition in my answer, when opposing N iid pairs (X,Y) to the N combinations of √N simulations of X and √N simulations of Y, the comparison

\text{var} \hat{\mathfrak{h}}^2_N=\text{var} (\hat{\mathfrak{h}}^1_N)+\frac{mn(n-1)}{N^2}\,\text{var}^Y\left\{ \mathbb{E}^{X}\left\{\mathfrak{h}(X,Y)\right\}\right\}

+\frac{m(m-1)n}{N^2}\,\text{var}^X\left[\mathbb{E}^Y\left\{\mathfrak{h}(X,Y)\right\}\right]

unsurprisingly gives the upper hand to the iid sequence. A sort of converse to Rao-Blackwellisation…. Unless the production of N simulations gets much more costly when compared with the N function evaluations. No wonder we never see this proposal in Monte Carlo textbooks!

Russian roulette still rolling

Posted in Statistics with tags , , , , , , , , , , , , on March 22, 2017 by xi'an

Colin Wei and Iain Murray arXived a new version of their paper on doubly-intractable distributions, which is to be presented at AISTATS. It builds upon the Russian roulette estimator of Lyne et al. (2015), which itself exploits the debiasing technique of McLeish et al. (2011) [found earlier in the physics literature as in Carter and Cashwell, 1975, according to the current paper]. Such an unbiased estimator of the inverse of the normalising constant can be used for pseudo-marginal MCMC, except that the estimator is sometimes negative and has to be so as proved by Pierre Jacob and co-authors. As I discussed in my post on the Russian roulette estimator, replacing the negative estimate with its absolute value does not seem right because a negative value indicates that the quantity is close to zero, hence replacing it with zero would sound more appropriate. Wei and Murray start from the property that, while the expectation of the importance weight is equal to the normalising constant, the expectation of the inverse of the importance weight converges to the inverse of the weight for an MCMC chain. This however sounds like an harmonic mean estimate because the property would also stand for any substitute to the importance density, as it only requires the density to integrate to one… As noted in the paper, the variance of the resulting Roulette estimator “will be high” or even infinite. Following Glynn et al. (2014), the authors build a coupled version of that solution, which key feature is to cut the higher order terms in the debiasing estimator. This does not guarantee finite variance or positivity of the estimate, though. In order to decrease the variance (assuming it is finite), backward coupling is introduced, with a Rao-Blackwellisation step using our 1996 Biometrika derivation. Which happens to be of lower cost than the standard Rao-Blackwellisation in that special case, O(N) versus O(N²), N being the stopping rule used in the debiasing estimator. Under the assumption that the inverse importance weight has finite expectation [wrt the importance density], the resulting backward-coupling Russian roulette estimator can be proven to be unbiased, as it enjoys a finite expectation. (As in the generalised harmonic mean case, the constraint imposes thinner tails on the importance function, which then hampers the convergence of the MCMC chain.) No mention is made of achieving finite variance for those estimators, which again is a serious concern due to the similarity with harmonic means…

recycling Gibbs auxiliaries [a reply]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on January 3, 2017 by xi'an

[Here is a reply sent to me by Luca Martino, Victor Elvira, and Gustau Camp-Vallis, after my earlier comments on their paper.]

We provide our contribution to the discussion, reporting our experience with the application of Metropolis-within-Gibbs schemes. Since in literature there are miscellaneous opinions, we want to point out the following considerations:

– according to our experience, the use of M>1 steps of the Metropolis-Hastings (MH) method for drawing from each full-conditional (with or without recycling), decreases the MSE of the estimation (see code Ex1-Ex2 and related Figure 7(b) and Figures 8). If the corresponding full conditional is very concentrated, one possible solution is to applied an adaptive or automatic MH for drawing from this full-conditional (it can require the use of M internal steps; see references in Section 3.2).

– Fixing the number of evaluations of the posterior, the comparison between a longer Gibbs chain with a single step of MH and a shorter Gibbs chain with M>1 steps of MH per each full-conditional, is required. Generally, there is no clear winner. The better performance depends on different aspects: the specific scenario, if and adaptive MH is employed or not, if the recycling is applied or not (see Figure 10(a) and the corresponding code Ex2).

The previous considerations are supported/endorsed by several authors (see the references in Section 3.2). In order to highlight the number of controversial opinions about the MH-within-Gibbs implementation, we report a last observation:

– If it is possible to draw directly from the full-conditionals, of course this is the best scenario (this is our belief). Remarkably, as also reported in Chapter 1, page 393 of the book “Monte Carlo Statistical Methods”, C. Robert and Casella, 2004, some authors have found that a “bad” choice of the proposal function in the MH step (i.e., different from the full conditional, or a poor approximation of it) can improve the performance of the MH-within-Gibbs sampler. Namely, they assert that a more “precise” approximation of the full-conditional does not necessarily improve the overall performance. In our opinion, this is possibly due to the fact that the acceptance rate in the MH step (lower than 1) induces an “accidental” random scan of the components of the target pdf in the Gibbs sampler, which can improve the performance in some cases. In our work, for the simplicity, we only focus on the deterministic scan. However, a random scan could be also considered.

recycling Gibbs auxiliaries

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on December 6, 2016 by xi'an

wreck of the S.S. Dicky, Caloundra beach, Qld, Australia, Aug. 19, 2012Luca Martino, Victor Elvira and Gustau Camps-Valls have arXived a paper on recycling for Gibbs sampling. The argument therein is to take advantage of all simulations induced by MCMC simulation for one full conditional, towards improving estimation if not convergence. The context is thus one when Metropolis-within-Gibbs operates, with several (M) iterations of the corresponding Metropolis being run instead of only one (which is still valid from a theoretical perspective). While there are arguments in augmenting those iterations, as recalled in the paper, I am not a big fan of running a fixed number of M of iterations as this does not approximate better the simulation from the exact full conditional and even if this approximation was perfect, the goal remains simulating from the joint distribution. As such, multiplying the number of Metropolis iterations does not necessarily impact the convergence rate, only brings it closer to the standard Gibbs rate. Moreover, the improvement does varies with the chosen component, meaning that the different full conditionals have different characteristics that produce various levels of variance reduction:

  • if the targeted expectation only depends on one component of the Markov chain, multiplying the number of simulations for the other components has no clear impact, except in increasing time;
  • if the corresponding full conditional is very concentrated, repeating simulations should produce quasi-repetitions, and no gain.

The only advantage in computing time that I can see at this stage is when constructing the MCMC sampler for the full proposal is much more costly than repeating MCMC iterations, which are then almost free and contribute to the reduction of the variance of the estimator.

This analysis of MCMC-withing-Gibbs strategies reminds me of a recent X validated question, which was about the proper degree of splitting simulations from a marginal and from a corresponding conditional in the chain rule, the optimal balance being in my opinion dependent on the relative variances of the conditional expectations.

A last point is that recycling in the context of simulation and Monte Carlo methodology makes me immediately think of Rao-Blackwellisation, which is surprisingly absent from the current paperRao-Blackwellisation was introduced in the MCMC literature and to the MCMC community in the first papers of Alan Gelfand and Adrian Smith, in 1990. While this is not always producing a major gain in Monte Carlo variability, it remains a generic way of recycling auxiliary variables as shown, e.g., in the recycling paper we wrote with George Casella in 1996, one of my favourite papers.

Computing the variance of a conditional expectation via non-nested Monte Carlo

Posted in Books, pictures, Statistics, University life with tags , , , , on May 26, 2016 by xi'an

Fushimi Inari-Taisha shrine, Kyoto, June 27, 2012The recent arXival by Takashi Goda of Computing the variance of a conditional expectation via non-nested Monte Carlo led me to read it as I could not be certain of the contents from only reading the title! The short paper considers the issue of estimating the variance of a conditional expectation when able to simulate the joint distribution behind the quantity of interest. The second moment E(E[f(X)|Y]²) can be written as a triple integral with two versions of x given y and one marginal y, which means that it can approximated in an unbiased manner by simulating a realisation of y then conditionally two realisations of x. The variance requires a third simulation of x, which the author seems to deem too costly and that he hence replaces with another unbiased version based on two conditional generations only. (He notes that a faster biased version is available with bias going down faster than the Monte Carlo error, which makes the alternative somewhat irrelevant, as it is also costly to derive.) An open question after reading the paper stands with the optimal version of the generic estimator (5), although finding the optimum may require more computing time than it is worth spending. Another one is whether or not this version of the expected conditional variance is more interesting (computation-wise) that the difference between the variance and the expected conditional variance as reproduced in (3) given that both quantities can equally be approximated by unbiased Monte Carlo…

at CIRM

Posted in Books, Mountains, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , on March 1, 2016 by xi'an

Crêt Saint-Michel, Morgiou, Marseille, June 8, 2010Thanks to a very early start from Paris, and despite horrendous traffic jams in Marseilles, I managed to reach CIRM with ten minutes to spare before my course. After my one-hour class, I was suddenly made aware of the (simplistic) idea that the slice sampling uniforms are simply auxiliary, meaning they can be used in many different ways.

I noticed Natesh Pillai just arXived an extension of his earlier Cauchy paper with XL. He proves that the result on the Cauchy distribution of any convex combination of normal ratios still holds when the pair of vectors is distributed from a product of elliptically symmetric functions. Some of Natesh’s remarks reminded me of the 1970 Sankhyã paper by Kelker on spherically symmetric variables. Especially because of Kelker’s characterisation of elliptically symmetric functions as scale mixtures of normals, which makes perfect sense since the scale cancels.

As I skimmed through my slides yesterday, fearing everyone knew about the MCMC basics, I decided to present today the Rao-Blackwellisation slides I gave in Warwick a few months ago.

locally weighted MCMC

Posted in Books, Statistics, University life with tags , , , , , , , , on July 16, 2015 by xi'an

Street light near the St Kilda Road bridge, Melbourne, July 21, 2012Last week, on arXiv, Espen Bernton, Shihao Yang, Yang Chen, Neil Shephard, and Jun Liu (all from Harvard) proposed a weighting scheme to associated MCMC simulations, in connection with the parallel MCMC of Ben Calderhead discussed earlier on the ‘Og. The weight attached to each proposal is either the acceptance probability itself (with the rejection probability being attached to the current value of the MCMC chain) or a renormalised version of the joint target x proposal, either forward or backward. Both solutions are unbiased in that they have the same expectation as the original MCMC average, being some sort of conditional expectation. The proof of domination in the paper builds upon Calderhead’s formalism.

This work reminded me of several reweighting proposals we made over the years, from the global Rao-Blackwellisation strategy with George Casella, to the vanilla Rao-Blackwellisation solution we wrote with Randal Douc a few years ago, both of whom also are demonstrably improving upon the standard MCMC average. By similarly recycling proposed but rejected values. Or by diminishing the variability due to the uniform draw. The slightly parallel nature of the approach also connects with our parallel MCM version with Pierre Jacob (now Harvard as well!) and Murray Smith (who now leaves in Melbourne, hence the otherwise unrelated picture).