**C**olin Wei and Iain Murray arXived a new version of their paper on doubly-intractable distributions, which is to be presented at AISTATS. It builds upon the Russian roulette estimator of Lyne et al. (2015), which itself exploits the debiasing technique of McLeish et al. (2011) [found earlier in the physics literature as in Carter and Cashwell, 1975, according to the current paper]. Such an unbiased estimator of the inverse of the normalising constant can be used for pseudo-marginal MCMC, except that the estimator is sometimes negative and has to be so as proved by Pierre Jacob and co-authors. As I discussed in my post on the Russian roulette estimator, replacing the negative estimate with its absolute value does not seem right because a negative value indicates that the quantity is close to zero, hence replacing it with zero would sound more appropriate. Wei and Murray start from the property that, while the expectation of the importance weight is equal to the normalising constant, the expectation of the inverse of the importance weight converges to the inverse of the weight for an MCMC chain. This however sounds like an harmonic mean estimate because the property would also stand for any substitute to the importance density, as it only requires the density to integrate to one… As noted in the paper, the variance of the resulting Roulette estimator “will be high” or even infinite. Following Glynn et al. (2014), the authors build a coupled version of that solution, which key feature is to cut the higher order terms in the debiasing estimator. This does not guarantee finite variance or positivity of the estimate, though. In order to decrease the variance (assuming it is finite), backward coupling is introduced, with a Rao-Blackwellisation step using our 1996 Biometrika derivation. Which happens to be of lower cost than the standard Rao-Blackwellisation in that special case, O(N) versus O(N²), N being the stopping rule used in the debiasing estimator. Under the assumption that the *inverse* importance weight has finite expectation [wrt the importance density], the resulting backward-coupling Russian roulette estimator can be proven to be unbiased, as it enjoys a finite expectation. (As in the generalised harmonic mean case, the constraint imposes thinner tails on the importance function, which then hampers the convergence of the MCMC chain.) No mention is made of achieving finite variance for those estimators, which again is a serious concern due to the similarity with harmonic means…

## Archive for Biometrika

## Russian roulette still rolling

Posted in Statistics with tags AISTATS 2017, Biometrika, coupling, debiasing, doubly intractable problems, harmonic mean estimator, MCMC, MCMC algorithm, normalising constant, Peter Glynn, pseudo-marginal MCMC, Rao-Blackwellisation, Russian roulette on March 22, 2017 by xi'an## Wilfred Keith Hastings [1930-2016]

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags Bell Labs, Biometrika, Canada, Julian Besag, Metropolis-Hastings algorithm, obituary, Peskun ordering, University of Canterbury, University of Victoria, Victoria, Wilfred Keith Hastings on December 9, 2016 by xi'an**A** few days ago I found on the page Jeff Rosenthal has dedicated to Hastings that he has passed away peacefully on May 13, 2016 in Victoria, British Columbia, where he lived for 45 years as a professor at the University of Victoria. After holding positions at University of Toronto, University of Canterbury (New Zealand), and Bell Labs (New Jersey). As pointed out by Jeff, Hastings’ main paper is his 1970 Biometrika description of Markov chain Monte Carlo methods, Monte Carlo sampling methods using Markov chains and their applications. Which would take close to twenty years to become known to the statistics world at large, although you can trace a path through Peskun (his only PhD student) , Besag and others. I am sorry it took so long to come to my knowledge and also sorry it apparently went unnoticed by most of the computational statistics community.

## Savage-Dickey supermodels

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags astrostatistics, Bayes factor, Biometrika, Brad Carlin, bridge sampling, cosmology, encompassing model, MCMC, mixtures of distributions, nested sampling, Péru, Sid Chib on September 13, 2016 by xi'an**A**. Mootoovaloo, B. Bassett, and M. Kunz just arXived a paper on the computation of Bayes factors by the Savage-Dickey representation through a supermodel (or encompassing model). (I wonder why Savage-Dickey is so popular in astronomy and cosmology statistical papers and not so much elsewhere.) Recall that the trick is to write the Bayes factor in favour of the encompasssing model as the ratio of the posterior and of the prior for the tested parameter (thus eliminating nuisance or common parameters) at its null value,

B^{10}=π(φ⁰|x)/π(φ⁰).

Modulo some continuity constraints on the prior density, and the assumption that the conditional prior on nuisance parameter is the same under the null model and the encompassing model [given the null value φ⁰]. If this sounds confusing or even shocking from a mathematical perspective, check the numerous previous entries on this topic on the ‘Og!

The supermodel created by the authors is a mixture of the original models, as in our paper, and… *hold the presses!*, it is a mixture of the likelihood functions, as in Phil O’Neill’s and Theodore Kypraios’ paper. Which is not mentioned in the current paper and should obviously be. In the current representation, the posterior distribution on the mixture weight α is a linear function of α involving both evidences, α(m¹-m²)+m², times the artificial prior on α. The resulting estimator of the Bayes factor thus shares features with bridge sampling, reversible jump, and the importance sampling version of nested sampling we developed in our Biometrika paper. In addition to O’Neill and Kypraios’s solution.

The following quote is inaccurate since the MCMC algorithm needs simulating the parameters of the compared models in realistic settings, hence representing the multidimensional integrals by Monte Carlo versions.

“Though we have a clever way of avoiding multidimensional integrals to calculate the Bayesian Evidence, this new method requires very efficient sampling and for a small number of dimensions is not faster than individual nested sampling runs.”

I actually wonder at the sheer rationale of running an intensive MCMC sampler in such a setting, when the weight α is completely artificial. It is only used to jump from one model to the next, which sound quite inefficient when compared with simulating from both models separately and independently. This approach can also be seen as a special case of Carlin’s and Chib’s (1995) alternative to reversible jump. Using instead the Savage-Dickey representation is of course infeasible. Which makes the overall reference to this method rather inappropriate in my opinion. Further, the examples processed in the paper all involve (natural) embedded models where the original Savage-Dickey approach applies. Creating an additional model to apply a pseudo-Savage-Dickey representation does not sound very compelling…

Incidentally, the paper also includes a discussion of a weird notion, the likelihood of the Bayes factor, B¹², which is plotted as a distribution in B¹², most strangely. The only other place I met this notion is in Murray Aitkin’s book. Something’s unclear there or in my head!

“One of the fundamental choices when using the supermodel approach is how to deal with common parameters to the two models.”

This is an interesting question, although maybe not so relevant for the Bayes factor issue where it should not matter. However, as in our paper, multiplying the number of parameters in the encompassing model may hinder convergence of the MCMC chain or reduce the precision of the approximation of the Bayes factor. Again, from a Bayes factor perspective, this does not matter [while it does in our perspective].

## Turing’s Bayesian contributions

Posted in Books, Kids, pictures, Running, Statistics, University life with tags Alan Turing, Banbury, Biometrika, Bletchley Park, Cryptonomicon, England, Enigma code machine, I.J. Good, Kullback-Leibler divergence, missing species problem, Shannonś information, statistical evidence, WW II on March 17, 2015 by xi'an**F**ollowing The Imitation Game, this recent movie about Alan Turing played by Benedict “Sherlock” Cumberbatch, been aired in French theatres, one of my colleagues in Dauphine asked me about the Bayesian contributions of Turing. I first tried to check in Sharon McGrayne‘s book, but realised it had vanished from my bookshelves, presumably lent to someone a while ago. *(Please return it at your earliest convenience!)* So I told him about the Bayesian principle of updating priors with data and prior probabilities with likelihood evidence in code detecting algorithms and ultimately machines at Bletchley Park… I could not got much farther than that and hence went checking on Internet for more fodder.

“Turing was one of the independent inventors of sequential analysis for which he naturally made use of the logarithm of the Bayes factor.” (p.393)

I came upon a few interesting entries but the most amazìng one was a 1979 note by I.J. Good (assistant of Turing during the War) published in *Biometrika* retracing the contributions of Alan Mathison Turing during the War. From those few pages, it emerges that Turing’s statistical ideas revolved around the Bayes factor that Turing used “without the qualification `Bayes’.” (p.393) He also introduced the notion of ban as a unit for the weight of evidence, in connection with the town of Banbury (UK) where specially formatted sheets of papers were printed “for carrying out an important classified process called Banburismus” (p.394). Which shows that even in 1979, Good did not dare to get into the details of Turing’s work during the War… And explains why he was testing simple statistical hypothesis against simple statistical hypothesis. Good also credits Turing for the expected weight of evidence, which is another name for the Kullback-Leibler divergence and for Shannon’s information, whom Turing would visit in the U.S. after the War. In the final sections of the note, Turing is also associated with Gini’s index, the estimation of the number of species (processed by Good from Turing’s suggestion in a 1953 Biometrika paper, that is, prior to Turing’s suicide. In fact, Good states in this paper that “a very large part of the credit for the present paper should be given to [Turing]”, p.237), and empirical Bayes.

## proper likelihoods for Bayesian analysis

Posted in Books, Statistics, University life with tags ABC, approximate likelihood, asymptotic normality, Bayesian Analysis, Biometrika, Montpellier, Padova, summary statistics on April 11, 2013 by xi'an**W**hile in Montpellier yesterday (where I also had the opportunity of tasting an excellent local wine!), I had a look at the 1992 Biometrika paper by Monahan and Boos on “*Proper likelihoods for Bayesian analysis*“. This is a paper I missed and that was pointed out to me during the discussions in Padova. The main point of this short paper is to decide when a method based on an approximative likelihood function is truly (or properly) Bayes. Just the very question a bystander would ask of ABC methods, wouldn’t it?! The validation proposed by Monahan and Boos is one of calibration of credible sets, just as in the recent arXiv paper of Dennis Prangle, Michael Blum, G. Popovic and Scott Sisson I reviewed three months ago. The idea is indeed to check by simulation that the true posterior coverage of an α-level set equals the nominal coverage α. In other words, the predictive based on the likelihood approximation should be uniformly distributed and this leads to a goodness-of-fit test based on simulations. As in our ABC model choice paper, *Proper likelihoods for Bayesian analysis* notices that Bayesian inference drawn upon an insufficient statistic is proper and valid, simply less accurate than the Bayesian inference drawn upon the whole dataset. The paper also enounces a conjecture:

A [approximate] likelihood L is a coverage proper Bayesian likelihood if and inly if L has the form L(y|θ) = c(s) g(s|θ) where s=S(y) is a statistic with density g(s|θ) and c(s) some function depending on s alone.

conjecture that sounds incorrect in that noisy ABC is also well-calibrated. (I am not 100% sure of this argument, though.) An interesting section covers the case of pivotal densities as substitute likelihoods and of the confusion created by the double meaning of the parameter θ. The last section is also connected with ABC in that Monahan and Boos reflect on the use of large sample approximations, like normal distributions for estimates of θ which are a special kind of statistics, but do not report formal results on the asymptotic validation of such approximations. All in all, a fairly interesting paper!

**R**eading this highly interesting paper also made me realise that the criticism I had made in my review of Prangle et al. about the difficulty for this calibration method to address the issue of summary statistics was incorrect: when using the true likelihood function, the use of an arbitrary summary statistics is validated by this method and is thus proper.

## discussione a Padova

Posted in Statistics, University life with tags ABC, ABC model choice, Biometrika, calibration, credible set, discussion, empirical Bayes methods, empirical likelihood, Italia, Madrid, Padova, Trieste, Università degli studi di Padova on March 25, 2013 by xi'an**H**ere are the slides of my talk in Padova for the workshop Recent Advances in statistical inference: theory and case studies (very similar to the slides for the Varanasi and Gainesville meetings, obviously!, with Peter Müller commenting [at last!] that I had picked the wrong photos from Khajuraho!)

**T**he worthy Padova addendum is that I had two discussants, Stefano Cabras from Universidad Carlos III in Madrid, whose slides are :

and Francesco Pauli, from Trieste, whose slides are:

**T**hese were kind and rich discussions with many interesting openings: Stefano’s idea of estimating the pivotal function *h* is opening new directions, obviously, as it indicates an additional degree of freedom in calibrating the method. Esp. when considering the high variability of the empirical likelihood fit depending on the the function *h*. For instance, one could start with a large collection of candidate functions and build a regression or a principal component reparameterisation from this collection… (Actually I did not get point #1 about ignoring *f*: the empirical likelihood is by essence ignoring anything outside the identifying equation, so as long as the equation is valid..) Point #2: Opposing sample free and simulation free techniques is another interesting venue, although I would not say ABC is “sample free”. As to point #3, I will certainly get a look at Monahan and Boos (1992) to see if this can drive the choice of a specific type of pseudo-likelihoods. I like the idea of checking the “coverage of posterior sets” and even more “the likelihood must be the density of a statistic, not necessarily sufficient” as it obviously relates with our current ABC model comparison work… Esp. when the very same paper is mentioned by Francesco as well. ** Grazie, Stefano!** I also appreciate the survey made by Francesco on the consistency conditions, because I think this is an important issue that should be taken into consideration when designing ABC algorithms. (Just pointing out again that, in the theorem of Fearnhead and Prangle (2012) quoting Bernardo and Smith (1992), some conditions are missing for the mathematical consistency to apply.) I also like the agreement we seem to reach about ABC being evaluated per se rather than an a poor man’s Bayesian method. Francesco’s analysis of Monahan and Boos (1992) as validating or not empirical likelihood points out a possible link with the recent coverage analysis of Prangle et al., discussed on the ‘Og a few weeks ago. And an unsuspected link with Larry Wasserman!

*Grazie, Francesco!*