adaptive subsampling for MCMC

Oxford to Coventry, Feb. 25, 2012

“At equilibrium, we thus should not expect gains of several orders of magnitude.”

As was signaled to me several times during the MCqMC conference in Leuven, Rémi Bardenet, Arnaud Doucet and Chris Holmes (all from Oxford) just wrote a short paper for the proceedings of ICML on a way to speed up Metropolis-Hastings by reducing the number of terms one computes in the likelihood ratio involved in the acceptance probability, i.e.

\prod_{i=1}^n\frac{L(\theta^\prime|x_i)}{L(\theta|x_i)}.

The observations appearing in this likelihood ratio are a random subsample from the original sample. Even though this leads to an unbiased estimator of the true log-likelihood sum, this approach is not justified on a pseudo-marginal basis à la Andrieu-Roberts (2009). (Writing this in the train back to Paris, I am not convinced this approach is in fact applicable to this proposal as the likelihood itself is not estimated in an unbiased manner…)

In the paper, the quality of the approximation is evaluated by Hoeffding’s like inequalities, which serves as the basis for a stopping rule on the number of terms eventually evaluated in the random subsample. In fine, the method uses a sequential procedure to determine if enough terms are used to take the decision and the probability to take the same decision as with the whole sample is bounded from below. The sequential nature of the algorithm requires to either recompute the vector of likelihood terms for the previous value of the parameter or to store all of them for deriving the partial ratios. While the authors adress the issue of self-evaluating whether or not this complication is worth the effort, I wonder (from my train seat) why they focus so much on recovering the same decision as with the complete likelihood ratio and the same uniform. It would suffice to get the same distribution for the decision (an alternative that is easier to propose than to create of course). I also (idly) wonder if a Gibbs version would be manageable, i.e. by changing only some terms in the likelihood ratio at each iteration, in which case the method could be exact… (I found the above quote quite relevant as, in an alternative technique we are constructing with Marco Banterle, the speedup is particularly visible in the warmup stage.) Hence another direction in this recent flow of papers attempting to speed up MCMC methods against the incoming tsunami of “Big Data” problems.

2 Responses to “adaptive subsampling for MCMC”

  1. Thanks for your comments Christian. I’d like to mention a couple of points.

    * The methodology developed in the paper is not limited to large data sets but is applicable to any scenario where the Metropolis-Hastings ratio is intractable as long as one has exact confidence intervals for a Monte Carlo estimate of this ratio; e.g. this could be applied to doubly intractable targets where using standard pseudo-marginal techniques is often impossible as obtaining non-negative unbiased estimates requires perfect samples. It is also worth mentioning that the idea of using confidence intervals for deciding whether to accept/reject a proposal within Metropolis in such intractable scenarios is not original and has been proposed several times since the late 80’s in operation research (see the references in our paper). Unfortunately the approximate confidence intervals used in all previous work we are aware of can provide misleading results as demonstrated empirically in our paper. Our adaptive sampling strategy combined to exact confidence intervals is more conservative and more robust and it allows us to provide a quantitative bound between the target of interest and the perturbed target we are sampling from.

    * The large data sets scenario is not a particularly compelling application of the methodology as it is clearly acknowledged in the paper. The problem is that, in a large data set context, the estimate of the MH ratio is obtained using a “blind” Monte Carlo strategy and typically admits a large variance (as we might miss very informative observations). We could obviously use an importance sampling type strategy to reduce this variance (and consequently improve the performance of the algorithm) but this does not appear very realistic from a practical point of view in the large data set context.

    • I recently read this paper and wanted to briefly comment on it. I like the idea of incorporating concentration inequalities, and the algorithm makes intuitive sense (helped by Fig. 2). My concern, however, is that the actual computation of those concentration inequalities may be intractable in general. If you look at Equation 7 in this paper for Hoeffding’s inequality, the “C” term requires iterating through the whole data, and this can’t be done prior to the algorithm because the bounds are conditioned on the “theta” values being sampled. In some cases (e.g., the Gaussians discussed later) one can determine an equivalent bound analytically and thus avoid this computation, but unless I am missing something, will there be issues with computing the concentration inequalities in practice? Just curious.

      (I know this is coming a bit late so I don’t expect a response on this, but I just wanted to briefly comment. That’s all!)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.