## on Markov chain Monte Carlo methods for tall data

**R**émi Bardenet, Arnaud Doucet, and Chris Holmes arXived a long paper (with the above title) a month ago, paper that I did not have time to read in detail till today. The paper is quite comprehensive in its analysis of the current literature on MCMC for huge, tall, or big data. Even including our delayed acceptance paper! Now, it is indeed the case that we are all still struggling with this size difficulty. Making proposals in a wide range of directions, hopefully improving the efficiency of dealing with tall data. However, we are not there yet in that the outcome is either about as costly as the original MCMC implementation or its degree of approximation is unknown, even when bounds are available.

Most of the paper proposal is based on aiming at an unbiased estimator of the likelihood function in a pseudo-marginal manner à la Andrieu and Roberts (2009) and on a random subsampling scheme that presumes (a) iid-ness and (b) a lower bound on each term in the likelihood. It seems to me slightly unrealistic to assume that a much cheaper and tight lower bound on those terms could be available. Firmly set in the iid framework, the problem itself is unclear: do we need 10⁸ observations of a logistic model with a few parameters? The real challenge is rather in non-iid hierarchical models with random effects and complex dependence structures. For which subsampling gets much more delicate. None of the methods surveyed in the paper broaches upon such situations where the entire data cannot be explored at once.

An interesting experiment therein, based on the Glynn and Rhee (2014) unbiased representation, shows that the approach does not work well. This could lead the community to reconsider the focus on unbiasedness by coming full circle to the opposition between bias and variance. And between intractable likelihood and representative subsample likelihood.

Reading the (superb) coverage of earlier proposals made me trace back on the perceived appeal of the decomposition of Neiswanger et al. (2014) as I came to realise that the product of functions renormalised into densities has no immediate probabilistic connection with its components. As an extreme example, terms may fail to integrate. (Of course, there are many Monte Carlo features that exploit such a decomposition, from the pseudo-marginal to accept-reject algorithms. And more to come.) Taking samples from terms in the product is thus not directly related to taking samples from each term, in opposition with the arithmetic mixture representation. I was first convinced by using a fraction of the prior in each term but now find it unappealing because there is no reason the prior should change for a smaller sampler and no equivalent to the prohibition of using the data several times. At this stage, I would be much more in favour of raising a random portion of the likelihood function to the right power. An approach that I suggested to a graduate student earlier this year and which is also discussed in the paper. And considered too naïve and a “very poor approach” (Section 6, p.18), even though there must be versions that do not run afoul of the non-Gaussian nature of the log likelihood ratio. I am certainly going to peruse more thoroughly this Section 6 of the paper.

Another interesting suggestion in this definitely rich paper is the foray into an alternative bypassing the uniform sampling in the Metropolis-Hastings step, using instead the subsampled likelihood ratio. The authors call this “exchanging acceptance noise for subsampling noise” (p.22). However, there is no indication about the resulting stationary and I find the notion of *only* moving to higher likelihoods (or estimates of) counter to the spirit of Metropolis-Hastings algorithms. (I have also eventually realised the meaning of the log-normal “difficult” benchmark that I missed in the earlier : it means log-normal data is modelled by a normal density.) And yet another innovation along the lines of a control variate for the log likelihood ratio, no matter it sounds somewhat surrealistic.

June 23, 2015 at 10:19 am

Thanks for you comments Xian.

I just want to reiterate the second point raised by Remi. I guess we have not been clear enough in the paper. The paper is really not aiming at using an unbiased estimator of the likelihood function in a pseudo-marginal manner. Our aim was actually to provide some quantitative results explaining why it is not a good idea to use pseudo-marginal ideas in the tall data context. More precisely, we show that the “obvious” unbiased (and non-negative) estimates of the likelihood have a very high variance and thus would lead to very poorly mixing pseudo-marginal chains. This point motivates the development of methods which are not pseudo-marginal methods, i.e. they do not sample exactly from the posterior (even at equilibrium) but sample from a controlled approximation to it. We have focused in particular on subsampling methods coupled with variance reduction techniques.

I also agree with you that the real challenge is certainly not logistic regression but instead “non-iid hierarchical models with random effects and complex dependence structures”. However, we think it is important to gain first a good understanding of these methods in simple scenarios. It turns out that many methods recently proposed cannot even handle satisfactorily such simple problems as demonstrated in our experiments.

June 22, 2015 at 11:33 pm

Thanks for the detailed comments, Xi’an. I just want to comment on some aspects. The first one is that my last name is slightly misspelled ;)

Second, in your second paragraph, you say most of the paper’s proposal deals with unbiased estimators of the likelihood that can be plugged into pseudomarginal MH. This is true of Section 4, which is part of the review sections, so we don’t claim making any proposal ourselves. Our main proposal is detailed in Section 7, and our point is actually to forget about unbiasedness of likelihood estimators, but accept that subsampling yields only unbiasedness in the log domain, and control the subsampling error using concentration inequalities. That said, we do require bounds on the individual log likelihood minus any control variate, and this is indeed a strong requirement.

My third comment is on your remark that we shouldn’t need zillions of observations for a low-dimensional logistic regression. I agree, and we show some early elements of answer in the paper. This is precisely the point of using concentration inequalities: the smallest subsample size to make an acceptance decision is automatically detected at each iteration. Figure 10b precisely shows that for a small sacrifice in total variation error, 1000 samples are roughly enough in an easy 2D logistic regression example. Indeed, increasing the size of the dataset tenfold yields a fraction of required data points that is divided by ten! Now, I agree that 1) this isn’t a practical answer, as these 1000 samples are sampled uniformly from the whole dataset at each iteration, and 2) whether the right subsample size to make acceptance decisions in MH is also the right number of observations for Bayesian inference in some sense is an interesting question that we haven’t tackled yet.

As for “exchanging acceptance noise for subsampling noise”, we do make a quick comment on the limiting distribution. Basically, you could analyze this algorithm using exactly the same tools as we used in our ICML’14 paper. But we chose not to develop this here, for the reason given in the next paragraph: this would basically yield a O(n) cost per iteration, again. Furthermore, it is not clear in practice how we could have an empirical version of Berry-Esseen.

June 22, 2015 at 12:09 pm

I has always seemed weird to me to trade off asymptotic bias and asymptotic variance. It makes sense, maybe, in a classical setting where we get near the asymptotic limit with a relatively small amount of data, but in MCMC we know that we’re nowhere near asymptopia, so it seems that this is the wrong target.

It’s just not clear how to get a finite-time bias/variance tradeoff…

(The paragraph at the bottom of section 4.2 seems to imply that pseudo-marginal Metropolis Hastings gives unbiased estimators of posterior functionals, but I must be reading that wrong…)

June 22, 2015 at 11:36 pm

Hi Dan. For our concentration inequality approach, I see the trade-off more as an “error in total variation against computational cost” trade-off. You’re explicitely trading in a small bias in the limiting distribution (see Eqn 31) and a bit of ergodicity constant (see Eqn 30) for the chance to require significantly less data points per iteration.

For your second comment on the paragraph at the end of Section 4.2, I see now that it’s slightly misleading, thanks for pointing this out. In the first part of Section 4.2, we try and fail to build a pseudomarginal algorithm using Rhee and Glynn’s unbiasing technique. In the last paragraph of Section 4.2, we comment on Strathmann et al’s approach, who also use Rhee and Glynn’s unbiasing technique, but for a rather different purpose: they propose to feed Rhee and Glynn’s unbiasing algorithm with estimates built from independent MCMC runs. This was also discussed on the current blog this blog some posts ago.