I just want to reiterate the second point raised by Remi. I guess we have not been clear enough in the paper. The paper is really not aiming at using an unbiased estimator of the likelihood function in a pseudo-marginal manner. Our aim was actually to provide some quantitative results explaining why it is not a good idea to use pseudo-marginal ideas in the tall data context. More precisely, we show that the “obvious” unbiased (and non-negative) estimates of the likelihood have a very high variance and thus would lead to very poorly mixing pseudo-marginal chains. This point motivates the development of methods which are not pseudo-marginal methods, i.e. they do not sample exactly from the posterior (even at equilibrium) but sample from a controlled approximation to it. We have focused in particular on subsampling methods coupled with variance reduction techniques.

I also agree with you that the real challenge is certainly not logistic regression but instead “non-iid hierarchical models with random effects and complex dependence structures”. However, we think it is important to gain first a good understanding of these methods in simple scenarios. It turns out that many methods recently proposed cannot even handle satisfactorily such simple problems as demonstrated in our experiments.

]]>Hi Dan. For our concentration inequality approach, I see the trade-off more as an “error in total variation against computational cost” trade-off. You’re explicitely trading in a small bias in the limiting distribution (see Eqn 31) and a bit of ergodicity constant (see Eqn 30) for the chance to require significantly less data points per iteration.

For your second comment on the paragraph at the end of Section 4.2, I see now that it’s slightly misleading, thanks for pointing this out. In the first part of Section 4.2, we try and fail to build a pseudomarginal algorithm using Rhee and Glynn’s unbiasing technique. In the last paragraph of Section 4.2, we comment on Strathmann et al’s approach, who also use Rhee and Glynn’s unbiasing technique, but for a rather different purpose: they propose to feed Rhee and Glynn’s unbiasing algorithm with estimates built from independent MCMC runs. This was also discussed on the current blog this blog some posts ago.

]]>Second, in your second paragraph, you say most of the paper’s proposal deals with unbiased estimators of the likelihood that can be plugged into pseudomarginal MH. This is true of Section 4, which is part of the review sections, so we don’t claim making any proposal ourselves. Our main proposal is detailed in Section 7, and our point is actually to forget about unbiasedness of likelihood estimators, but accept that subsampling yields only unbiasedness in the log domain, and control the subsampling error using concentration inequalities. That said, we do require bounds on the individual log likelihood minus any control variate, and this is indeed a strong requirement.

My third comment is on your remark that we shouldn’t need zillions of observations for a low-dimensional logistic regression. I agree, and we show some early elements of answer in the paper. This is precisely the point of using concentration inequalities: the smallest subsample size to make an acceptance decision is automatically detected at each iteration. Figure 10b precisely shows that for a small sacrifice in total variation error, 1000 samples are roughly enough in an easy 2D logistic regression example. Indeed, increasing the size of the dataset tenfold yields a fraction of required data points that is divided by ten! Now, I agree that 1) this isn’t a practical answer, as these 1000 samples are sampled uniformly from the whole dataset at each iteration, and 2) whether the right subsample size to make acceptance decisions in MH is also the right number of observations for Bayesian inference in some sense is an interesting question that we haven’t tackled yet.

As for “exchanging acceptance noise for subsampling noise”, we do make a quick comment on the limiting distribution. Basically, you could analyze this algorithm using exactly the same tools as we used in our ICML’14 paper. But we chose not to develop this here, for the reason given in the next paragraph: this would basically yield a O(n) cost per iteration, again. Furthermore, it is not clear in practice how we could have an empirical version of Berry-Esseen.

]]>It’s just not clear how to get a finite-time bias/variance tradeoff…

(The paragraph at the bottom of section 4.2 seems to imply that pseudo-marginal Metropolis Hastings gives unbiased estimators of posterior functionals, but I must be reading that wrong…)

]]>