## light and widely applicable MCMC: approximate Bayesian inference for large datasets

**F**lorian Maire (whose thesis was discussed in this post), Nial Friel, and Pierre Alquier (all in Dublin at some point) have arXived today a paper with the above title, aimed at quickly analysing large datasets. As reviewed in the early pages of the paper, this proposal follows a growing number of techniques advanced in the past years, like pseudo-marginals, Russian roulette, unbiased likelihood estimators. firefly Monte Carlo, adaptive subsampling, sub-likelihoods, telescoping debiased likelihood version, and even our very own delayed acceptance algorithm. (Which is incorrectly described as restricted to iid data, by the way!)

The lightweight approach is based on an ABC idea of working through a summary statistic that plays the role of a pseudo-sufficient statistic. The main theoretical result in the paper is indeed that, when subsampling in an exponential family, subsamples preserving the sufficient statistics (modulo a rescaling) are optimal in terms of distance to the true posterior. Subsamples are thus weighted in terms of the (transformed) difference between the full data statistic and the subsample statistic, assuming they are both normalised to be comparable. I am quite (positively) intrigued by this idea in that it allows to somewhat compare inference based on two different samples. The weights of the subsets are then used in a pseudo-posterior that treats the subset as an auxiliary variable (and the weight as a substitute to the “missing” likelihood). This may sound a wee bit convoluted (!) but the algorithm description is not yet complete: simulating jointly from this pseudo-target is impossible because of the huge number of possible subsets. The authors thus suggest to run an MCMC scheme targeting this joint distribution, with a proposed move on the set of subsets and a proposed move on the parameter set conditional on whether or not the proposed subset has been accepted.

From an ABC perspective, the difficulty in calibrating the tolerance ε sounds more accute than usual, as the size of the subset comes as an additional computing parameter. Bootstrapping options seem impossible to implement in a large size setting.

An MCMC issue with this proposal is that designing the move across the subset space is both paramount for its convergence properties and lacking in geometric intuition. Indeed, two subsets with similar summary statistics may be very far apart… Funny enough, in the representation of the joint Markov chain, the parameter subchain is secondary if crucial to avoid intractable normalising constants. It is also unclear for me from reading the paper maybe too quickly whether or not the separate moves when switching and when not switching subsets retain the proper balance condition for the pseudo-joint to still be the stationary distribution. The stationarity for the subset Markov chain is straightforward by design, but it is not so for the parameter. In case of switched subset, simulating from the true full conditional given the subset would work, but not simulated by a fixed number L of MCMC steps.

The lightweight technology therein shows its muscles on an handwritten digit recognition example where it beats regular MCMC by a factor of 10 to 20, using only 100 datapoints instead of the 10⁴ original datapoints. While very nice and realistic, this example may be misleading in that 100 digit realisations may be enough to find a tolerable approximation to the true MAP. I was also intrigued by the processing of the probit example, until I realised the authors had integrated the covariate out and inferred about the mean of that covariate, which means it is not a genuine probit model.

March 24, 2015 at 4:24 pm

Many thanks, Christian for reading and blogging about our paper.

The probit example is mostly presented here as pedagogical (toy) example — we introduce it mainly as a means to illustrate our optimality result for exponential models. For the handwritten digits, you write “this example may be misleading in that 100 digits may be enough to find a tolerable approximation to the true MAP”. Here we want to make the important point that in this case if one chooses a *fixed* sub-sample of size 100 images through the algorithm that parameter estimation can be very poor. This is illustrated with the time series inference example, where we show that keeping a fixed subset yields a biased inference (see page 22). In particular, both examples shows that there is a clear evidence that refreshing the sub-sample is a fundamental and novel aspect of the LWA MCMC methology. Theoretically, the method indeed does not allow to obtain samples from “the pseudo data-augmented” distribution: as you mentioned, switching (or refreshing) a subset disturbs the chain stability. Although of great interest, this question remains an important and open question. The main appealing aspect from LWA MCMC comes from the significant computational gain it allows (for a striking illustration of this, see the binary classification example!).