## ergodicity of approximate MCMC chains with applications to large datasets

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on August 31, 2015 by xi'an

Another arXived paper I read on my way to Warwick! And yet another paper written by my friend Natesh Pillai (and his co-author Aaron Smith, from Ottawa). The goal of the paper is to study the ergodicity and the degree of approximation of the true posterior distribution of approximate MCMC algorithms that recently flourished as an answer to “Big Data” issues… [Comments below are about the second version of this paper.] One of the most curious results in the paper is the fact that the approximation may prove better than the original kernel, in terms of computing costs! If asymptotically in the computing cost. There also are acknowledged connections with the approximative MCMC kernel of Pierre Alquier, Neal Friel, Richard Everitt and A Boland, briefly mentioned in an earlier post.

The paper starts with a fairly theoretical part, to follow with an application to austerity sampling [and, in the earlier version of the paper, to the Hoeffding bounds of Bardenet et al., both discussed earlier on the ‘Og, to exponential random graphs (the paper being rather terse on the description of the subsampling mechanism), to stochastic gradient Langevin dynamics (by Max Welling and Yee-Whye Teh), and to ABC-MCMC]. The assumptions are about the transition kernels of a reference Markov kernel and of one associated with the approximation, imposing some bounds on the Wasserstein distance between those kernels, K and K’. Results being generic, there is no constraint as to how K is chosen or on how K’ is derived from K. Except in Lemma 3.6 and in the application section, where the same proposal kernel L is used for both Metropolis-Hastings algorithms K and K’. While I understand this makes for an easier coupling of the kernels, this also sounds like a restriction to me in that modifying the target begs for a similar modification in the proposal, if only because the tails they are a-changin’

In the case of subsampling the likelihood to gain computation time (as discussed by Korattikara et al. and by Bardenet et al.), the austerity algorithm as described in Algorithm 2 is surprising as the average of the sampled data log-densities and the log-transform of the remainder of the Metropolis-Hastings probability, which seem unrelated, are compared until they are close enough.  I also find hard to derive from the different approximation theorems bounding exceedance probabilities a rule to decide on the subsampling rate as a function of the overall sample size and of the computing cost. (As a side if general remark, I remain somewhat reserved about the subsampling idea, given that it requires the entire dataset to be available at every iteration. This makes parallel implementations rather difficult to contemplate.)

## lazy ABC

Posted in Books, Statistics, University life with tags , , , , , , , on June 9, 2014 by xi'an

“A more automated approach would be useful for lazy versions of ABC SMC algorithms.”

Dennis Prangle just arXived the work on lazy ABC he had presented in Oxford at the i-like workshop a few weeks ago. The idea behind the paper is to cut down massively on the generation of pseudo-samples that are “too far” from the observed sample. This is formalised through a stopping rule that puts the estimated likelihood to zero with a probability 1-α(θ,x) and otherwise divide the original ABC estimate by α(θ,x). Which makes the modification unbiased when compared with basic ABC. The efficiency appears when α(θ,x) can be computed much faster than producing the entire pseudo-sample and its distance to the observed sample. When considering an approximation to the asymptotic variance of this modification, Dennis derives a optimal (in the sense of the effective sample size) if formal version of the acceptance probability α(θ,x), conditional on the choice of a “decision statistic” φ(θ,x).  And of an importance function g(θ). (I do not get his Remark 1 about the case when π(θ)/g(θ) only depends on φ(θ,x), since the later also depends on x. Unless one considers a multivariate φ which contains π(θ)/g(θ) itself as a component.) This approach requires to estimate

$\mathbb{P}(d(S(Y),S(y^o))<\epsilon|\varphi)$

as a function of φ: I would have thought (non-parametric) logistic regression a good candidate towards this estimation, but Dennis is rather critical of this solution.

I added the quote above as I find it somewhat ironical: at this stage, to enjoy laziness, the algorithm has first to go through a massive calibration stage, from the selection of the subsample [to be simulated before computing the acceptance probability α(θ,x)] to the construction of the (somewhat mysterious) decision statistic φ(θ,x) to the estimation of the terms composing the optimal α(θ,x). The most natural choice of φ(θ,x) seems to be involving subsampling, still with a wide range of possibilities and ensuing efficiencies. (The choice found in the application is somehow anticlimactic in this respect.) In most ABC applications, I would suggest using a quick & dirty approximation of the distribution of the summary statistic.

A slight point of perplexity about this “lazy” proposal, namely the static role of ε, which is impractical because not set in stone… As discussed several times here, the tolerance is a function of many factors incl. all the calibration parameters of the lazy ABC, rather than an absolute quantity. The paper is rather terse on this issue (see Section 4.2.2). It seems to me that playing with a large collection of tolerances may be too costly in this setting.

## accelerated ABC

Posted in Books, Mountains, Statistics with tags , , , , , , on January 14, 2014 by xi'an

Richard Wilkinson arXived a paper on accelerated ABC during MCMSki 4, paper that I almost missed when quickly perusing the daily list. This is another illustration of the “invasion of Gaussian processes” in ABC settings. Maybe under the influence of machine learning.

The paper starts with a link to the synthetic likelihood approximation of Wood (2010, Nature), as in Richard Everitt’s talk last week. Richard (W.) presents the generalised ABC as a kernel-based acceptance probability, using a kernel π(y|x), when y is the observed data and x=x(θ) the simulated one. He proposes a Gaussian process modelling for the log-likelihood (at the observed data y), with a quadratic (in θ) mean and Matérn covariance matrix. Hence the connection with Wood’s synthetic likelihood. Another connection is with Nicolas’ talk on QMC(MC): the θ’s are chosen following a Sobol sequence “in order to minimize the number of design points”. Which requires a reparameterisation to [0,1]p… I find this “uniform” exploration of the whole parameter space delicate to envision in complex parameter spaces and realistic problems, since the likelihood is highly concentrated on a tiny subregion of the original [0,1]p. Not mentioning the issue of the spurious mass on the boundaries of the hypercube possibly induced by the change of variable. The sequential algorithm of Richard also attempts at eliminating implausible zones of the parameter space. i.e. zones where the likelihood is essentially zero. My worries with this interesting notion are that (a) the early Gaussian process approximations may be poor and hence exclude zones they should not; (b) all Gaussian process approximations at all iterations must be saved; (c) the Sobol sequences apply to the whole [0,1]p at each iteration but the non-implausible region shrinks at each iteration, which induces a growing inefficiency in the algorithm. The Sobol sequence should be restricted to the previous non-implausible zone.

Overall, an interesting proposal that would need more prodding to understand whether or not it is robust to poor initialisation and complex structures. And a proposal belonging to the estimated likelihood branch of ABC, which makes use of the final Gaussian process approximation to run an MCM algorithm. Without returning to pseudo-data simulation, replacing it with log-likelihood simulation.

“These algorithms sample space randomly and naively and do not learn from previous simulations”

The above criticism is moderated in a footnote about ABC-SMC using the “current parameter value to determine which move to make next [but] parameters visited in previous iterations are not taken into account”. I still find it excessive in that SMC algorithms and in particular ABC-SMC algorithms are completely free to use the whole past to build the new proposal. This was clearly enunciated in our earlier population Monte Carlo papers. For instance, the complete collection of past particles can be recycled by weights computing thru our AMIS algorithm, as illustrated by Jukka Corander in one genetics application.