This afternoon, at JSM 2015, in Seattle, we had the Bayesian Computation I and II sessions that Omiros Papaspiliopoulos and myself put together (sponsored by IMS and ISBA). Despite this being Sunday and hence having some of the participants still arriving, the sessions went on well in terms of audience. Thanks to Mark Girolami’s strict presidency, we were so much on time in Bayesian Computation I that we had 20mn left for a floor discussion that turned into a speakers’ discussion! All talks were of obvious interest for MCMCists, but Ryan Adams’ presentation on firefly Monte Carlo got me thinking for most of the afternoon on different ways of exploiting the existence of a bound on the terms composing the target. With little to show by the end of the afternoon! On the mundane side, I was sorry to miss Pierre Jacob, who was still in France due to difficulties in obtaining a working visa for Harvard (!), and surprised to see Dawn Woodard wearing a Uber tee-shirt, until she told us she was now working at Uber! Which a posteriori makes sense, given her work on traffic predictions!
Archive for MCMC algorithms
“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)
Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.
As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.
In the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.
“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)
This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.
While visiting Dauphine, Natesh Pillai and Aaron Smith pointed out this interesting paper of Joris Bierkens (Warwick) that had escaped my arXiv watch/monitoring. The paper is about turning Metropolis-Hastings algorithms into non-reversible versions, towards improving mixing.
In a discrete setting, a way to produce a non-reversible move is to mix the proposal kernel Q with its time-reversed version Q’ and use an acceptance probability of the form
where ε is any weight. This construction is generalised in the paper to any vorticity (skew-symmetric with zero sum rows) matrix Γ, with the acceptance probability
where ε is small enough to ensure all numerator values are non-negative. This is a rather annoying assumption in that, except for the special case derived from the time-reversed kernel, it has to be checked over all pairs (x,y). (I first thought it also implied the normalising constant of π but everything can be set in terms of the unormalised version of π, Γ or ε included.) The paper establishes that the new acceptance probability preserves π as its stationary distribution. An alternative construction is to make the proposal change from Q in H such that H(x,y)=Q(x,y)+εΓ(x,y)/π(x). Which seems more pertinent as not changing the proposal cannot improve that much the mixing behaviour of the chain. Still, the move to the non-reversible versions has the noticeable plus of decreasing the asymptotic variance of the Monte Carlo estimate for any integrable function. Any. (Those results are found in the physics literature of the 2000’s.)
The extension to the continuous case is a wee bit more delicate. One needs to find an anti-symmetric vortex function g with zero integral [equivalent to the row sums being zero] such that g(x,y)+π(y)q(y,x)>0 and with same support as π(x)q(x,y) so that the acceptance probability of g(x,y)+π(y)q(y,x)/π(x)q(x,y) leads to π being the stationary distribution. Once again g(x,y)=ε(π(y)q(y,x)-π(x)q(x,y)) is a natural candidate but it is unclear to me why it should work. As the paper only contains one illustration for the discretised Ornstein-Uhlenbeck model, with the above choice of g for a small enough ε (a point I fail to understand since any ε<1 should provide a positive g(x,y)+π(y)q(y,x)), it is also unclear to me that this modification (i) is widely applicable and (ii) is relevant for genuine MCMC settings.
A question on Cross Validated led me to realise I had never truly considered the issue of periodic Gibbs samplers! In MCMC, non-aperiodic chains are a minor nuisance in that the skeleton trick of randomly subsampling the Markov chain leads to a aperiodic Markov chain. (The picture relates to the skeleton!) Intuitively, while the systematic Gibbs sampler has a tendency to non-reversibility, it seems difficult to imagine a sequence of full conditionals that would force the chain away from the current value..!In the discrete case, given that the current state of the Markov chain has positive probability for the target distribution, the conditional probabilities are all positive as well and hence the Markov chain can stay at its current value after one Gibbs cycle, with positive probabilities, which means strong aperiodicity. In the continuous case, a similar argument applies by considering a neighbourhood of the current value. (Incidentally, the same person asked a question about the absolute continuity of the Gibbs kernel. Being confused by our chapter on the topic!!!)
With my friends Peter Green (Bristol), Krzysztof Łatuszyński (Warwick) and Marcello Pereyra (Bristol), we just arXived the first version of “Bayesian computation: a perspective on the current state, and sampling backwards and forwards”, which first title was the title of this post. This is a survey of our own perspective on Bayesian computation, from what occurred in the last 25 years [a lot!] to what could occur in the near future [a lot as well!]. Submitted to Statistics and Computing towards the special 25th anniversary issue, as announced in an earlier post.. Pulling strength and breadth from each other’s opinion, we have certainly attained more than the sum of our initial respective contributions, but we are welcoming comments about bits and pieces of importance that we miss and even more about promising new directions that are not posted in this survey. (A warning that is should go with most of my surveys is that my input in this paper will not differ by a large margin from ideas expressed here or in previous surveys.)
Following the previous post, I went and had a (long) look at Puolamäki and Kaski’s paper. I must acknowledge that, despite having several runs through the paper, I still have trouble with the approach… From what I understand, the authors use a Bernoulli mixture pseudo-model to reallocate the observations to components. That is, given an MCMC output with simulated allocations variables (a.k.a., hidden or latent variables), they create a (TxK)xn matrix of component binary indicators e.g., for a three component mixture,
0 1 0 0 1 0…
1 0 0 0 0 0…
0 0 1 1 0 1…
0 1 0 0 1 1…
and estimate a probability to be in component j for each of the n observations, according to the (pseudo-)likelihood
It took me a few days, between morning runs and those wee hours when I cannot get back to sleep (!), to make some sense of this Bernoulli modelling. The allocation vectors are used together to estimate the probabilities of being “in” component j together. However the data—which is the outcome of an MCMC simulation and de facto does not originate from that Bernoulli mixture—does not seem appropriate, both because it is produced by an MCMC simulation and is made of blocks of highly correlated rows [which sum up to one]. The Bernoulli likelihood above also defines a new model, with many more parameters than in the original mixture model. And I fail to see why perfect, partial or inexistent label switching [in the MCMC sequence] is not going to impact the estimation of the Bernoulli mixture. And why an argument based on a fixed parameter value (Theorem 3) extends to an MCMC outcome where parameters themselves are subjected to some degree of label switching. Bemused, I remain…