## firefly Monte Carlo

**A**nd here is yet another arXived paper using a decomposition of the posterior distirbution as a product of terms to run faster, better and higher MCMC algorithms! This one is by Douglas Maclaurin and Ryan Adams: “Firefly Monte Carlo: Exact MCMC with Subsets of Data“. (While a swarm of fireflies make sense to explain the name, I may miss some cultural subliminal meaning in the title as Firefly and Monte Carlo seem to be places in Las Vegas (?), and a car brand, Firefly is a TV series, a clothes brand, and maybe other things…)

“The evolution of the chain evokes an image of fireflies, as the individual data blink on and out due to updates of the z_{n}.”

**T**he fundamental assumption of Maclaurin’s and Adams’ approach is that each product term in the likelihood (expressed as a product) can be bounded by a cheaper lower bound. This lower bound is used to create a Bernoulli auxiliary variable with probability equal to the ratio of the lower bound to the likelihood term, auxiliary variable that helps to reduce the number of evaluations of the original likelihood terms. Obviously, there is a gain only if (a) the lower bound is close or tight enough and (b) simulating the auxiliary variables is cheap enough.

**A**bout (a), the paper gives the tight example of a logistic, with a case of a 98% tightness. How generic is that and how those bounds can be derived in a cheap or automated manner? If one needs to run a variational Bayes approximation first, the gain in efficiency is unlikely to hold. About (b), I do not fully get it: if generating z_{n} requires the evaluation of the original likelihood we loose the entire appeal of the method. Admittedly, I can see the point in changing a very small portion α of the z_{n}‘s between moves on the parameter θ, since the number of likelihood evaluations is the same portion α of the total number of terms N. But decreasing the portion α is also reducing the mixing efficiency of the algorithm. In the efficient ways of updating the auxiliary brightness variables (ways proposed in the paper), I get the idea of making a proposal first before eventually computing the true probability of a Bernoulli. A proposal making use of the previous value of the probability (i.e. for the previous value of the parameter θ) could also reduce the number of evaluations of likelihood terms. However, using a “cached” version of the likelihood is only relevant within the same simulation step since a change in θ requires recomputing the likelihood.

“In each experiment we compared FlyMC, with two choices of bound selection, to regular full-posterior MCMC. We looked at the average number of likelihoods queried at each iteration and the number of effective samples generated per iteration, accounting for autocorrelation.”

**T**his comparison does not seem adequate to me: by construction, the algorithm in the paper reduces the number of likelihood evaluations, so this is not a proper comparative instrument. The effective sample size is a transform of the correlation, not an indicator of convergence. For instance, if the z_{n}‘s were hardly to change between iterations, thus the overall sampler was definitely far from converging, we would get θ’s simulated from almost the same distribution, hence being uncorrelated. In other words, if the joint chain in (θ,z_{n}) does not converge, it is harder to establish that the subchain in θ converges at all. Indeed, in this logistic example where the computation of the likelihood is not a massive constraint, I am surprised there is any possibility of a huge gain in using the method, unless the lower bound is essentially the likelihood, which is actually the case for logistic regression models. Another point made by Dan Simpson is that the whole dataset needs to remain on-hold, full-time, which may be a challenge to the computer memory. And stops short of providing really Big Data solutions.

May 15, 2016 at 8:10 pm

Thanks for the post about this paper. Even though I like the idea and contributions of the paper, I was wondering about any possible drawbacks, and it’s sometimes hard for me to do that without spending exorbitant amounts of time dissecting the material.

February 9, 2015 at 10:06 pm

so what’s your comments on comparing this paper to the previous one “SPEEDING UP MCMC BY EFFICIENT DATA SUBSAMPLING”?Thanks!

April 2, 2014 at 3:09 am

Thanks for taking the time to read the paper and blog about it! I agree that there are various shortcomings here; the point of the paper is not the be the last word on this topic, but to show that it is actually possible to use subsets of data and still achieve the exact stationary distribution. There has been a lot of interest in this lately, and I think a lot of us have found it somewhat disappointing that these methods have only yielded approximate stationary distributions.

Your main criticisms are spot-on, although we did think carefully about how to update the auxiliary variables without having to touch the likelihoods again — hence the consideration of data structures and efficient M-H moves rather than doing the obvious Gibbs thing. Also, you’re totally correct that the auxiliary variables almost certainly hurt mixing — the question is just by how much. While we don’t have any theoretical results on this, we think that autocorrelations on the marginals of \theta are illuminating. We of course used standard convergence heuristics to assess mixing, but these are imperfect as you well know.

The reason there is a huge gain in performance is because it is possible collapse a large number of the bounds into a single bound, something you can’t easily do with even something as simple as a logistic likelihood. That is, once you have, e.g, Gaussian lower bounds for each likelihood term, you can summarize even millions of them will a single Gaussian and update it rapidly just by adding/subtracting data that become “bright”. So on the problem with 1.8M molecules, we can represent almost all of them with a single Gaussian evaluation.

Regarding the issue of keeping things in core: I agree this is a long-term concern and we don’t have an immediate solution. It is ultimately a “systems” question and I think we can all agree that we want to worry about validity first.

Thanks again for the thoughtful comments.

April 2, 2014 at 8:10 am

Thanks for taking the time to reply to my ramblings. And get ready for another post on the prefetching technique, which is in the to-publish list now…