## Unbiased Bayes for Big Data: Path of partial posteriors [a reply from the authors]

*[Here is a reply by Heiko Strathmann to my post of yesterday. Along with the slides of a talk in Oxford mentioned in the discussion.]*

Thanks for putting this up, and thanks for the discussion. Christian, as already exchanged via email, here are some answers to the points you make.

First of all, we don’t claim a free lunch — and are honest with the limitations of the method (see negative examples). Rather, we make the point that we *can* achieve computational savings in certain situations — essentially exploiting redundancy (what Michael called “tall” data in his note on subsampling & HMC) leading to fast convergence of posterior statistics.

Dan is of course correct noticing that if the posterior statistic does not converge nicely (i.e. all data counts), then truncation time is “mammoth”. It is also correct that it might be questionable to aim for an unbiased Bayesian method in the presence of such redundancies. However, these are the two extreme perspectives on the topic. The message that we want to get along is that there is a trade-off in between these extremes. In particular the GP examples illustrate this nicely as we are able to reduce MSE in a regime where posterior statistics have *not* yet stabilised, see e.g. figure 6.

“And the following paragraph is further confusing me as it seems to imply that convergence is not that important thanks to the de-biasing equation.”

To clarify, the paragraph refers to the *additional* convergence issues induced by alternative Markov transition kernels of mini-batch-based full posterior sampling methods by Welling, Bardenet, Dougal & co. For example, Firefly MC’s mixing time is increased by a factor of 1/q where q*N is the mini-batch size. Mixing of stochastic gradient Langevin gets worse over time. This is *not* true for our scheme as we can use standard transition kernels. It is still essential for the partial posterior Markov chains to converge (*if* MCMC is used). However, as this is a well studied problem, we omit the topic in our paper and refer to standard tools for diagnosis. All this is independent of the debiasing device.

**About MCMC convergence.**

Yesterday in Oxford, Pierre Jacob pointed out that if MCMC is used for estimating partial posterior statistics, the overall result is *not* unbiased. We had a nice discussion how this bias could be addressed via a two-stage debiasing procedure: debiasing the MC estimates as described in the “Unbiased Monte Carlo” paper by Agapiou et al, and then plugging those into the path estimators — though it is (yet) not so clear how (and whether) this would work in our case.

In the current version of the paper, we do not address the bias present due to MCMC. We have a paragraph on this in section 3.2. Rather, we start from a premise that full posterior MCMC samples are a gold standard. Furthermore, the framework we study is not necessarily linked to MCMC – it could be that the posterior expectation is available in closed form, but simply costly in N. In this case, we can still unbiasedly estimate this posterior expectation – see GP regression.

“The choice of the tail rate is thus quite delicate to validate against the variance constraints (2) and (3).”

It is true that the choice is crucial in order to control the variance. However, provided that partial posterior expectations converge at a rate n^{-β} with n the size of a minibatch, computational complexity can be reduced to N^{1-α} (α<β) without variance exploding. There is a trade-off: the faster the posterior expectations converge, more computation can be saved; β is in general unknown, but can be roughly estimated with the “direct approach” as we describe in appendix.

**About the “direct approach”**

It is true that for certain classes of models and φ functionals, the direct averaging of expectations for increasing data sizes yields good results (see log-normal example), and we state this. However, the GP regression experiments show that the direct averaging gives a larger MSE as with debiasing applied. This is exactly the trade-off mentioned earlier.

I also wonder what people think about the comparison to stochastic variational inference (GP for Big Data), as this hasn’t appeared in discussions yet. It is the comparison to “non-unbiased” schemes that Christian and Dan asked for.

March 17, 2015 at 11:19 am

[…] Feb: Our reply to Christian’s critique was also featured. Thanks! […]

February 27, 2015 at 1:34 am

On the issue of comparing with biased methods, I don’t know anything about stochastic variational approximations. (If it’s just VB for GPs, then no, we can probably do better than that).

It’s actually an interesting thing that happens here: it’s really really hard to make these sorts of comparisons. I think it really calls for a “big data test suite” of simulated experiments. In my naive dream world, each new “big data” algorithm would pick the problems it could solve and write a program that, for an instance of this “big data model generator” solves the problem. (This randomisation step is to guard against the chance of the chosen data set being good for a certain algorithm).

Why would this be useful? Well, to be honest, most of the time the reason that you don’t test against all of the other state-of-the-art algorithms/models is that it’s basically impossible to implement them well, test them and make them work. This is just one of those realities that happens once an algorithm hits a certain complexity: it is an investment to make it work.

There’s also the broader (and more knotty) question of “what is good” here. An unbiased target is a solid thing to aim for, but if we don’t need that, then it really comes down to understanding what an algorithm is good for. I suspect that for properly complex (i.e. things at least as complex as your GP example) Big Data applications, we’re going to have to go back to the ” likelihood + prior + utility” framework to target big-data algorithms correctly (no free lunches, no general solutions).

[ of course, this is what everyone’s been saying for ages, so i’m just paraphrasing. but it needs to be said ]

tl;dr: Big data means never having to say you’re sorry. Your method was just not aligned with the inferential aims of the study.