And since Winter is coming, warm feelings are most important!

]]>Xian: We completely agree with your remark about the changing tails, and we never use Lemma 3.6 (or indeed any of the bounds in that section) to compare the original Metropolis-Hastings kernel to a `subsampled’ kernel for exactly that reason. Instead, we use Lemma 3.6 to compare the `subsampled’ kernel to a kernel that is defined only for mathematical convenience, and which serves to interpolate between the `subsampled’ kernel and the `true’ Metropolis-Hastings dynamics. Indeed, when these results are used later on, the subsampled kernel does have a proposal kernel with a different scaling than the original MH kernel.

Dan: I suspect we can probably get into agreement on most of these issues, however far we might seem. To concede most of your point right off the bat: I think you’re right that the community is pretty far from a convincing argument that sampling algorithms are generally a good idea, and we certainly don’t give a lot of useful practical advice in this paper. However, we also view this work as pretty preliminary work on a topic that has only attracted a lot of attention from this community in the last few years. There are quite a few avenues for research that are far from played out.

It seems pretty clear that, if you really have a hard-to-evaluate posterior (e.g. due to large amounts of data), you have to do something besides just blindly running your original MCMC chain with all the data. There are many options that are obviously better when you have a lot of data, including just picking a subsample all at once and running your original chain on this subsample. What we (and we suspect others) are trying to do is understand a little bit when you can do better than this sort of approach. Clearly, we’re not there yet. In particular, I think that the `naive’ subsampling algorithm that we use as a basis for comparison in this paper is terrible in a number of ways, and certainly don’t advocate anybody using it as written. Xian points out one important flaw: like most of the more sophisticated subsampling algorithms that people have proposed, it requires access to all of the data at every step. As we point out in one of the later parts of the paper, on many problems it doesn’t seem to do a lot better than `one-off’ subsampling even if you ignore that major problem. I think we’re a little more optimistic than you about some design-based subsampling algorithms (e.g. work of Friel et al or Quiroz et al), but we’ll leave that for another time.

So, why did we write the paper and what are we trying to do?

First, we give a useful way to decompose the finite sample variance of the approximate chain; hopefully this will be used by subsequent researchers in the field. Next, we initially got interested in just showing a pretty obvious fact about subsampling MCMC that doesn”t show up in the literature: when the amount of data is very large, and if you measure computing time in terms of subsample size, subsampling MCMC really does `better’ than `standard’ MCMC, in some sense. Doing this turned out to be surprisingly non-trivial.

Finally, subsampling MCMC can only give so much improvement for problems involving i.i.d. data – the cost of evaluating the posterior, and the amount of information in a subsample, both grow linearly in the size of the subsample, and so you’re only tweaking constants. But not all problems in the MCMC world have those properties, and we believe that subsampling and related ideas will be able to do much better in some of those other situations, such as in GP regression!

Thanks again to both,

Natesh and Aaron.

Dan, they give you that warm feeling inside that provided you have a uniformly ergodic chain the world is safe and comforting.

]]>My favourite of the “if you do an ok approximation of the markov chain, you get an ok answer” papers is still the Nicholls, Fox and Watt one, where they did it with an intuitive (and implementable) coupling argument. Aside for technicalities, I’m not sure what else these papers give you. (Except the MCMC for Tall Data paper, which told us not to use subsampling because it doesn’t work)

]]>