## ergodicity of approximate MCMC chains with applications to large datasets

**A**nother arXived paper I read on my way to Warwick! And yet another paper written by my friend Natesh Pillai (and his co-author Aaron Smith, from Ottawa). The goal of the paper is to study the ergodicity and the degree of approximation of the true posterior distribution of approximate MCMC algorithms that recently flourished as an answer to “Big Data” issues… *[Comments below are about the second version of this paper.] *One of the most curious results in the paper is the fact that the approximation may prove better than the original kernel, in terms of computing costs! If asymptotically in the computing cost. There also are acknowledged connections with the approximative MCMC kernel of Pierre Alquier, Neal Friel, Richard Everitt and A Boland, briefly mentioned in an earlier post.

The paper starts with a fairly theoretical part, to follow with an application to austerity sampling *[and, in the earlier version of the paper, to the Hoeffding bounds of Bardenet et al., both discussed earlier on the ‘Og, to exponential random graphs (the paper being rather terse on the description of the subsampling mechanism), to stochastic gradient Langevin dynamics (by Max Welling and Yee-Whye Teh), and to ABC-MCMC]*. The assumptions are about the transition kernels of a reference Markov kernel and of one associated with the approximation, imposing some bounds on the Wasserstein distance between those kernels, K and K’. Results being generic, there is no constraint as to how K is chosen or on how K’ is derived from K. Except in Lemma 3.6 and in the application section, where the same proposal kernel L is used for both Metropolis-Hastings algorithms K and K’. While I understand this makes for an easier coupling of the kernels, this also sounds like a restriction to me in that modifying the target begs for a similar modification in the proposal, if only because the tails they are a-changin’…

**I**n the case of subsampling the likelihood to gain computation time (as discussed by Korattikara et al. and by Bardenet et al.), the austerity algorithm as described in Algorithm 2 is surprising as the average of the sampled data log-densities and the log-transform of the remainder of the Metropolis-Hastings probability, which seem unrelated, are compared until they are close enough. I also find hard to derive from the different approximation theorems bounding exceedance probabilities a rule to decide on the subsampling rate as a function of the overall sample size and of the computing cost. (As a side if general remark, I remain somewhat reserved about the subsampling idea, given that it requires the entire dataset to be available at every iteration. This makes parallel implementations rather difficult to contemplate.)

September 1, 2015 at 4:57 am

Dear Xian, Dan: Thanks for the helpful comments! A few replies:

Xian: We completely agree with your remark about the changing tails, and we never use Lemma 3.6 (or indeed any of the bounds in that section) to compare the original Metropolis-Hastings kernel to a `subsampled’ kernel for exactly that reason. Instead, we use Lemma 3.6 to compare the `subsampled’ kernel to a kernel that is defined only for mathematical convenience, and which serves to interpolate between the `subsampled’ kernel and the `true’ Metropolis-Hastings dynamics. Indeed, when these results are used later on, the subsampled kernel does have a proposal kernel with a different scaling than the original MH kernel.

Dan: I suspect we can probably get into agreement on most of these issues, however far we might seem. To concede most of your point right off the bat: I think you’re right that the community is pretty far from a convincing argument that sampling algorithms are generally a good idea, and we certainly don’t give a lot of useful practical advice in this paper. However, we also view this work as pretty preliminary work on a topic that has only attracted a lot of attention from this community in the last few years. There are quite a few avenues for research that are far from played out.

It seems pretty clear that, if you really have a hard-to-evaluate posterior (e.g. due to large amounts of data), you have to do something besides just blindly running your original MCMC chain with all the data. There are many options that are obviously better when you have a lot of data, including just picking a subsample all at once and running your original chain on this subsample. What we (and we suspect others) are trying to do is understand a little bit when you can do better than this sort of approach. Clearly, we’re not there yet. In particular, I think that the `naive’ subsampling algorithm that we use as a basis for comparison in this paper is terrible in a number of ways, and certainly don’t advocate anybody using it as written. Xian points out one important flaw: like most of the more sophisticated subsampling algorithms that people have proposed, it requires access to all of the data at every step. As we point out in one of the later parts of the paper, on many problems it doesn’t seem to do a lot better than `one-off’ subsampling even if you ignore that major problem. I think we’re a little more optimistic than you about some design-based subsampling algorithms (e.g. work of Friel et al or Quiroz et al), but we’ll leave that for another time.

So, why did we write the paper and what are we trying to do?

First, we give a useful way to decompose the finite sample variance of the approximate chain; hopefully this will be used by subsequent researchers in the field. Next, we initially got interested in just showing a pretty obvious fact about subsampling MCMC that doesn”t show up in the literature: when the amount of data is very large, and if you measure computing time in terms of subsample size, subsampling MCMC really does `better’ than `standard’ MCMC, in some sense. Doing this turned out to be surprisingly non-trivial.

Finally, subsampling MCMC can only give so much improvement for problems involving i.i.d. data – the cost of evaluating the posterior, and the amount of information in a subsample, both grow linearly in the size of the subsample, and so you’re only tweaking constants. But not all problems in the MCMC world have those properties, and we believe that subsampling and related ideas will be able to do much better in some of those other situations, such as in GP regression!

Thanks again to both,

Natesh and Aaron.

August 31, 2015 at 5:12 am

“rather difficult to contemplate” is a very polite way to put it…

My favourite of the “if you do an ok approximation of the markov chain, you get an ok answer” papers is still the Nicholls, Fox and Watt one, where they did it with an intuitive (and implementable) coupling argument. Aside for technicalities, I’m not sure what else these papers give you. (Except the MCMC for Tall Data paper, which told us not to use subsampling because it doesn’t work)

September 1, 2015 at 1:19 am

Dan, they give you that warm feeling inside that provided you have a uniformly ergodic chain the world is safe and comforting.

September 1, 2015 at 5:50 am

And since Winter is coming, warm feelings are most important!