accelerating MCMC via parallel predictive prefetching
¨The idea is to calculate multiple likelihoods ahead of time (“pre-fetching”), and only use the ones which are needed.” A. Brockwell, 2006
Yet another paper on parallel MCMC, just arXived by Elaine Angelino, Eddie Kohler, Amos Waterland, Margo Seltzer, and Ryan P. Adams. Now, besides “prefetching” found in the title, I spotted “speculative execution”, “slapdash treatment”, “scheduling decisions” in the very first pages: this paper definitely is far from shying away from using fancy terminology! I actually found the paper rather difficult to read to the point I had to give up my first attempt during an endless university board of governors meeting yesterday. (I also think “prefetching” is awfully painful to type!)
What is “prefetching” then? It refers to a 2006 JCGS paper by Anthony Brockwell. As explained in the above quote from Brockwell, prefetching means computing the 2², 2³, … values of the likelihood that will be needed in 2, 3, … iterations. Running a regular Metropolis-Hastings algorithm then means building a decision tree back to the current iteration and drawing 2,3, … uniform to go down the tree to the appropriate branch. So in the end only one path of the tree is exploited, which does not seem particularly efficient when vanilla Rao-Blackwellisation and recycling could be implemented almost for free.
“Another intriguing possibility, suggested to the author by an anonymous referee, arises in the case where one can guess whether or not acceptance probabilities will be “high” or “low.” In this case, the tree could be made deeper down “high” probability paths and shallower in the “low” probability paths.” A. Brockwell, 2006
The current paper stems from Brockwell’s 2006 final remark, as reproduced above, by those “speculative moves” that considers the reject branch of the prefetching tree more often that not, based on some preliminary or dynamic evaluation of the acceptance rate. Using a fast but close enough approximation to the true target (and a fixed sequence of uniforms) may also produce a “single most likely path on which” prefetched simulations can be run. The basic idea is thus to run simulations and costly likelihood computations on many parallel processors along a prefetched path, path that has been prefetched for its high approximate likelihood. (With of courses cases where this speculative simulation is not helpful because we end up following another path with the genuine target.) The paper actually goes further than the basic idea to avoid spending useless time on paths that will not be chosen, by constructing sequences of approximations for the precomputations. The proposition for the sequence found therein is to subsample the original data and use a normal approximation to the difference of the log (sub-)likelihoods. Even though the authors describe the system implementation of the progressive approximation idea, it remains rather unclear (to me) how the adaptive estimation of the acceptance probability is compatible with the parallelisation idea. Because it seems (to me) that it induces a lot of communication between the cores. Also, the method is advocated mainly for burnin’ (or warmup, to follow Andrew’s terminology!), which seems to remove the need to use exact targets: if the approximation is close enough, the Markov chain will quickly reach a region of interest for the true target and from there there seems to be little speedup in implementing this nonetheless most interesting strategy.