asynchronous distributed Gibbs sampling

Alexander Terenin, Dan Simpson, and David Draper just arXived a paper on an alternative brand of Gibbs sampler, which they think can revolutionise the sampler and overcome its well-known bottlenecks. David had sent me the paper in advance and thus I had time to read it in the plane to Calgary. (This is also the very first paper I see acknowledging a pair of trousers..! With no connection whatsoever with bottlenecks!)

“Note that not all updates that are sent will be received by all other workers: because of network traffic congestion and other types of failures, a significant portion of the updates will be lost along the way.”

The approach is inherently parallel in that several “workers” (processors or graphical units) run Gibbs samplers in parallel, using their current knowledge of the system. This means they update a component of the model parameter, based on the information they have last received, and then send back this new value to the system. For physical reasons, there is not instantaneity in this transmission and thus all workers do not condition on the same “current” value, necessarily. Perceiving this algorithm as a “garden of forking paths” where each full conditional uses values picked at random from a collection of subchains (one for each worker), I can see why the algorithm should remain valid.

“Thus, the quality of this [ABC] method rises and falls with the ingenuity of the user in identifying nearly-sufficient statistics.”

It is also clear that this approach allows for any degree of parallelisation. However, it is less clear to me why this should constitute an improvement. With respect to the bottlenecks mentioned at the beginning of the paper, I do not truly see how the large data problem is bypassed. Except in cases where conditionals only depend on small parts of the data. Or why large dimensions can be more easily managed when compared with a plain Gibbs sampler or, better, parallel plain Gibbs samplers that would run on the same number of processors. (I do not think the paper runs the comparison in that manner, using instead a one-processor Gibbs sampler as its benchmark. Or less processors in the third example.) Since the forking paths repeatedly merge back at aperiodic stages, there is no multiplication or clear increase of the exploratory abilities of the sampler. Except for having competing proposed values [or even proposals] selected randomly. So maybe reaching a wee bit farther from time to time.

7 Responses to “asynchronous distributed Gibbs sampling”

  1. Gibbs sampling, huh? So retro! Fair enough, given that Aki and I have been doing a lot of importance sampling, and that’s an even more old-fashioned approach.

    Everything old is new again.

  2. I do not wish to resist la furia simpsoniese but still, I do not see why a comparison with parallel Gibbs samplers running on the same number of processors does not make sense.

    • Dan Simpson Says:

      Because there’s no obvious benchmark problem to compare on. Gibbs samplers are data-local, ADG isn’t, so any problem that a Gibbs sampler will work on is structurally different to the types of problems we’re targeting.

      Comparing across parallel mcmc methods is also hard. They all solve particular problems well, but it would be easy to (accidentally, on purpose, or anything in between) bias the comparison by the choice of problems.

  3. The main aim of parallelising a sampler is not to improve it’s exploration, but to improve it’s “independent samples per second”.

    It’s also worth noting that the reason we didn’t compare is that it’s hard to imagine a meaningful comparison of a serial method against a parallel one. The outcome will depend on the test case. We don’t come to bury, but to extend (etc etc etc)…

    I would also note that of course we have to deal with all of the data. The situations for which this is not true (where you can independently send a subsample to a processor an ignore the rest of the data) is trivial in the sense that a) we know how to do that already and b) we could probably subsample already. So any non-i.i.d. parallel algorithm has a “minimum communication cost”.

    The two serious examples in this paper cover cases where the data is not i.i.d. and there is less over-specification. In particular, it’s hard to imagine a situation with GP regression or mixed effects models where you didn’t need all the data available at once. The advantage of this method is that they don’t need to be available synchronistically.

    (Also not the caching in the mixed-effects example – it’s is not a mistake. Again, I think that it’s hard to argue that any method should exist for this problem that doesn’t cache here)

    Finally, you refer to “plain Gibbs samplers”. This is meaningless. In the case of GP regression, a “good” plain Gibbs sampler will have spectral gap while a “bad” plain Gibbs sampler will (asymptotically) not. The (synchronous version of the) Gibbs sampler used in the second version isn’t “plain”, so much as “very carefully constructed to be sensible”. Once again, this comes down to the behaviour of serial (non-trivial) Gibbs samplers vs asynchronous ones. The winner of a battle like this will depend on the size of the problem, the underlying spectral gap, and the reliability of the cluster that it’s being computed on.

    (And of course we do not compare parallel algorithms [or parallel and sequential algorithms] in general – how could we make a meaningful comparison?)

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.