And here is yet another arXived paper using a decomposition of the posterior distirbution as a product of terms to run faster, better and higher MCMC algorithms! This one is by Douglas Maclaurin and Ryan Adams: “Firefly Monte Carlo: Exact MCMC with Subsets of Data“. (While a swarm of fireflies make sense to explain the name, I may miss some cultural subliminal meaning in the title as Firefly and Monte Carlo seem to be places in Las Vegas (?), and a car brand, Firefly is a TV series, a clothes brand, and maybe other things…)
“The evolution of the chain evokes an image of fireflies, as the individual data blink on and out due to updates of the zn.”
The fundamental assumption of Maclaurin’s and Adams’ approach is that each product term in the likelihood (expressed as a product) can be bounded by a cheaper lower bound. This lower bound is used to create a Bernoulli auxiliary variable with probability equal to the ratio of the lower bound to the likelihood term, auxiliary variable that helps to reduce the number of evaluations of the original likelihood terms. Obviously, there is a gain only if (a) the lower bound is close or tight enough and (b) simulating the auxiliary variables is cheap enough.
About (a), the paper gives the tight example of a logistic, with a case of a 98% tightness. How generic is that and how those bounds can be derived in a cheap or automated manner? If one needs to run a variational Bayes approximation first, the gain in efficiency is unlikely to hold. About (b), I do not fully get it: if generating zn requires the evaluation of the original likelihood we loose the entire appeal of the method. Admittedly, I can see the point in changing a very small portion α of the zn‘s between moves on the parameter θ, since the number of likelihood evaluations is the same portion α of the total number of terms N. But decreasing the portion α is also reducing the mixing efficiency of the algorithm. In the efficient ways of updating the auxiliary brightness variables (ways proposed in the paper), I get the idea of making a proposal first before eventually computing the true probability of a Bernoulli. A proposal making use of the previous value of the probability (i.e. for the previous value of the parameter θ) could also reduce the number of evaluations of likelihood terms. However, using a “cached” version of the likelihood is only relevant within the same simulation step since a change in θ requires recomputing the likelihood.
“In each experiment we compared FlyMC, with two choices of bound selection, to regular full-posterior MCMC. We looked at the average number of likelihoods queried at each iteration and the number of effective samples generated per iteration, accounting for autocorrelation.”
This comparison does not seem adequate to me: by construction, the algorithm in the paper reduces the number of likelihood evaluations, so this is not a proper comparative instrument. The effective sample size is a transform of the correlation, not an indicator of convergence. For instance, if the zn‘s were hardly to change between iterations, thus the overall sampler was definitely far from converging, we would get θ’s simulated from almost the same distribution, hence being uncorrelated. In other words, if the joint chain in (θ,zn) does not converge, it is harder to establish that the subchain in θ converges at all. Indeed, in this logistic example where the computation of the likelihood is not a massive constraint, I am surprised there is any possibility of a huge gain in using the method, unless the lower bound is essentially the likelihood, which is actually the case for logistic regression models. Another point made by Dan Simpson is that the whole dataset needs to remain on-hold, full-time, which may be a challenge to the computer memory. And stops short of providing really Big Data solutions.