## the fundamental incompatibility of HMC and data subsampling

**L**ast week, Michael Betancourt, from Warwick, arXived a neat wee note on the fundamental difficulties in running HMC on a subsample of the original data. The core message is that using only one fraction of the data to run an HMC with the hope that it will preserve the stationary distribution does not work. The only way to recover from the bias is to use a Metropolis-Hastings step using the whole data, a step that both kills most of the computing gain and has very low acceptance probabilities. Even the strategy that subsamples for each step in a single trajectory fails: there cannot be a significant gain in time without a significant bias in the outcome. Too bad..! Now, there are ways of accelerating HMC, for instance by parallelising the computation of gradients but, just as in any other approach (?), the information provided by the whole data is only available when looking at the whole data.

February 23, 2015 at 5:59 pm

The paper seems rather strange to me, seeing as he in the end points out that subsampling in a non-stochastic way can actually work. Indeed it does work, as I showed in my thesis twenty years ago, and mention in my review of HMC in the Handbook of Markov Chain Monte Carlo (see section 5.5.1.4). It works best if combined with my “windowing” scheme.

This has nothing to do with parallelizing MCMC by the way. More with not parallelizing, actually, since parallelization is more obviously applicable if you are computing a gradient from many independent data points, rather than a few. For parallelizing MCMC in general, you could try my “circular coupling” method.

February 24, 2015 at 8:52 am

Thanks, Radford! The question with these subsampling schemes is rather “how much does it work?”, i.e., what are the losses in terms of bias and variance, and how can they be quantified from the outcome, at little cost? Another comment on the recent “Unbiaised Bayes for Big Data” by Heiko Strathman, Dino Sejdinovic and Mark Girolami is scheduled for this Friday: keep posted!

February 24, 2015 at 12:06 pm

I’m assuming that we want an “exact” MCMC scheme, that converges asymptotically to the correct distribution. So I asume that when HMC is used with gradients from subsamples, we accept or reject the result of computing the trajectory using the full data set. Since one may well want to simulate a trajectory that’s thousands of steps long, the final accept/reject decision may not dominate even if it uses all the data. Of course, this won’t be true for really, really big data sets.

February 24, 2015 at 12:17 pm

Thanks. Since this solution forces one to use the full data set at the final A/R stage, this may not be palatable in really big data problems.

February 23, 2015 at 1:14 am

Or, to put it more succinctly, MCMC does not ever parallelise.

(Ever meaning, of course, unless your model is so phenomenally complicated that the parallel proposal hides the latency, yet so convenient that the acceptance ratio can be distributed. [i.e. never])

February 23, 2015 at 1:43 am

I kind of hope this paper kills off the field, to be honest. My interpretation of it is that in the situations where these “big bayes” data-splitting algorithms work, there is so much information in the data that it’s unclear to me why there is any benefit using Bayes at all. Why not whisper “Bernstein-von Mises” in front of a mirror three times and be done with it? The frequentist technology for solving big-data problems is significantly more advanced than “let’s just subsample and combine” (the bag of bootstraps excepted).

Does Bayesian inference have any place with exchangeable data? (It’s not clear to me that it does, except in the extremely small data, subjective Bayes limit)

March 4, 2015 at 10:14 am

A lot things do not give you normal even when n is big. Simple examples are logistic regression with rare events or dynamic models.

Also, would you like to enumerate several frequentist methods that are “significantly” more advanced than the “partition and combine” Bayesian algorithms?

First, frequentists are taking the same “partition and combine” approach for big data, for example, “average” or “median” scheme to combine estimates from subsets.

Second, within-iteration-parallel optimization does not always work well. For example, “ADMM” takes much longer steps for convergence, which is likely to kill the savings from parallelization.

Finally, stochastic gradient descent algorithms…well, people now use them just like “partition and combine” MCMC. You run the algorithm for hours and pick one point in the trajectory that gives the largest/smallest function value as your estimator. The performance heavily relies on your tune of the learning rate, and, in most cases, you don’t know whether your algorithm will eventually converge or not…

Oh, forgot to mention MCMC also gives you uncertainty and confidence.

Larger sample size does not necessarily imply a higher accuracy. In fact, the shrinking of the confidence interval somehow increases the risk of losing robustness, because you know your models are wrong…

February 23, 2015 at 12:45 am

“…the information provided by the whole data is only available when looking at the whole data.” Exactly! If only we can convince everyone else of that!