Archive for ABC-MCMC

a simulated annealing approach to Bayesian inference

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on October 1, 2015 by xi'an

Paris/Zürich, Oct. 3, 2011 A misleading title if any! Carlos Albert arXived a paper with this title this morning and I rushed to read it. Because it sounded like Bayesian analysis could be expressed as a special form of simulated annealing. But it happens to be a rather technical sequel [“that complies with physics standards”] to another paper I had missed, A simulated annealing approach to ABC, by Carlos Albert, Hans Künsch, and Andreas Scheidegger. Paper that appeared in Statistics and Computing last year, and which is most interesting!

“These update steps are associated with a flow of entropy from the system (the ensemble of particles in the product space of parameters and outputs) to the environment. Part of this flow is due to the decrease of entropy in the system when it transforms from the prior to the posterior state and constitutes the well-invested part of computation. Since the process happens in finite time, inevitably, additional entropy is produced. This entropy production is used as a measure of the wasted computation and minimized, as previously suggested for adaptive simulated annealing” (p.3)

The notion behind this simulated annealing intrusion into the ABC world is that the choice of the tolerance can be adapted along iterations according to a simulated annealing schedule. Both papers make use of thermodynamics notions that are completely foreign to me, like endoreversibility, but aim at minimising the “entropy production of the system, which is a measure for the waste of computation”. The central innovation is to introduce an augmented target on (θ,x) that is


where ε is the tolerance, while ρ(x,y) is a measure of distance to the actual observations, and to treat ε as an annealing temperature. In an ABC-MCMC implementation, the acceptance probability of a random walk proposal (θ’,x’) is then


Under some regularity constraints, the sequence of targets converges to


if ε decreases slowly enough to zero. While the representation of ABC-MCMC through kernels other than the Heaviside function can be found in the earlier ABC literature, the embedding of tolerance updating within the modern theory of simulated annealing is rather exciting.

Furthermore, we will present an adaptive schedule that attempts convergence to the correct posterior while minimizing the required simulations from the likelihood. Both the jump distribution in parameter space and the tolerance are adapted using mean fields of the ensemble.” (p.2)

What I cannot infer from a rather quick perusal of the papers is whether or not the implementation gets into the way of the all-inclusive theory. For instance, how can the Markov chain keep moving as the tolerance gets to zero? Even with a particle population and a sequential Monte Carlo implementation, it is unclear why the proposal scale factor [as in equation (34)] does not collapse to zero in order to ensure a non-zero acceptance rate. In the published paper, the authors used the same toy mixture example as ours [from Sisson et al., 2007], where we earned the award of the “incredibly ugly squalid picture”, with improvements in the effective sample size, but this remains a toy example. (Hopefully a post to be continued in more depth…)

ergodicity of approximate MCMC chains with applications to large datasets

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on August 31, 2015 by xi'an

bhamAnother arXived paper I read on my way to Warwick! And yet another paper written by my friend Natesh Pillai (and his co-author Aaron Smith, from Ottawa). The goal of the paper is to study the ergodicity and the degree of approximation of the true posterior distribution of approximate MCMC algorithms that recently flourished as an answer to “Big Data” issues… [Comments below are about the second version of this paper.] One of the most curious results in the paper is the fact that the approximation may prove better than the original kernel, in terms of computing costs! If asymptotically in the computing cost. There also are acknowledged connections with the approximative MCMC kernel of Pierre Alquier, Neal Friel, Richard Everitt and A Boland, briefly mentioned in an earlier post.

The paper starts with a fairly theoretical part, to follow with an application to austerity sampling [and, in the earlier version of the paper, to the Hoeffding bounds of Bardenet et al., both discussed earlier on the ‘Og, to exponential random graphs (the paper being rather terse on the description of the subsampling mechanism), to stochastic gradient Langevin dynamics (by Max Welling and Yee-Whye Teh), and to ABC-MCMC]. The assumptions are about the transition kernels of a reference Markov kernel and of one associated with the approximation, imposing some bounds on the Wasserstein distance between those kernels, K and K’. Results being generic, there is no constraint as to how K is chosen or on how K’ is derived from K. Except in Lemma 3.6 and in the application section, where the same proposal kernel L is used for both Metropolis-Hastings algorithms K and K’. While I understand this makes for an easier coupling of the kernels, this also sounds like a restriction to me in that modifying the target begs for a similar modification in the proposal, if only because the tails they are a-changin’

In the case of subsampling the likelihood to gain computation time (as discussed by Korattikara et al. and by Bardenet et al.), the austerity algorithm as described in Algorithm 2 is surprising as the average of the sampled data log-densities and the log-transform of the remainder of the Metropolis-Hastings probability, which seem unrelated, are compared until they are close enough.  I also find hard to derive from the different approximation theorems bounding exceedance probabilities a rule to decide on the subsampling rate as a function of the overall sample size and of the computing cost. (As a side if general remark, I remain somewhat reserved about the subsampling idea, given that it requires the entire dataset to be available at every iteration. This makes parallel implementations rather difficult to contemplate.)

astronomical evidence

Posted in pictures, Statistics, University life with tags , , , , , , , , , , , , on July 24, 2015 by xi'an

As I have a huge arXiv backlog and an even higher non-arXiv backlog, I cannot be certain I will find time to comment on those three recent and quite exciting postings connecting ABC with astro- and cosmo-statistics [thanks to Ewan for pointing out those to me!]:

ABC à Montréal

Posted in Kids, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on December 13, 2014 by xi'an

Montreal1So today was the NIPS 2014 workshop, “ABC in Montréal“, which started with a fantastic talk by Juliane Liepe on some exciting applications of ABC to the migration of immune cells, with the analysis of movies involving those cells acting to heal a damaged fly wing and a cut fish tail. Quite amazing videos, really. (With the great entry line of ‘We have all cut  a finger at some point in our lives’!) The statistical model behind those movies was a random walk on a grid, with different drift and bias features that served as model characteristics. Frank Wood managed to deliver his talk despite a severe case of food poisoning, with a great illustration of probabilistic programming that made me understand (at last!) the very idea of probabilistic programming. And  Vikash Mansinghka presented some applications in image analysis. Those two talks led me to realise why probabilistic programming was so close to ABC, with a programming touch! Hence why I was invited to talk today! Then Dennis Prangle exposed his latest version of lazy ABC, that I have already commented on the ‘Og, somewhat connected with our delayed acceptance algorithm, to the point that maybe something common can stem out of the two notions. Michael Blum ended the day with provocative answers to the provocative question of Ted Meeds as to whether or not machine learning needed ABC (Ans. No!) and whether or not machine learning could help ABC (Ans. ???). With an happily mix-up between mechanistic and phenomenological models that helped generating discussion from the floor.

The posters were also of much interest, with calibration as a distance measure by Michael Guttman, in continuation of the poster he gave at MCMski, Aaron Smith presenting his work with Luke Bornn, Natesh Pillai and Dawn Woodard, on why a single pseudo-sample is enough for ABC efficiency. This gave me the opportunity to discuss with him the apparent contradiction with the result of Kryz Łatunsziński and Anthony Lee about the geometric convergence of ABC-MCMC only attained with a random number of pseudo-samples… And to wonder if there is a geometric versus binomial dilemma in this setting, Namely, whether or not simulating pseudo-samples until one is accepted would be more efficient than just running one and discarding it in case it is too far. So, although the audience was not that large (when compared with the other “ABC in…” and when considering the 2500+ attendees at NIPS over the week!), it was a great day where I learned a lot, did not have a doze during talks (!), [and even had an epiphany of sorts at the treadmill when I realised I just had to take longer steps to reach 16km/h without hyperventilating!] So thanks to my fellow organisers, Neil D Lawrence, Ted Meeds, Max Welling, and Richard Wilkinson for setting the program of that day! And, by the way, where’s the next “ABC in…”?! (Finland, maybe?)

another instance of ABC?

Posted in Statistics with tags , , , , , on December 2, 2014 by xi'an

“These characteristics are (1) likelihood is not available; (2) prior information is available; (3) a portion of the prior information is expressed in terms of functionals of the model that cannot be converted into an analytic prior on model parameters; (4) the model can be simulated. Our approach depends on an assumption that (5) an adequate statistical model for the data are available.”

A 2009 JASA paper by Ron Gallant and Rob McCulloch, entitled “On the Determination of General Scientific Models With Application to Asset Pricing”, may have or may not have connection with ABC, to wit the above quote, but I have trouble checking whether or not this is the case.

The true (scientific) model parametrised by θ is replaced with a (statistical) substitute that is available in closed form. And parametrised by g(θ). [If you can get access to the paper, I’d welcome opinions about Assumption 1 therein which states that the intractable density is equal to a closed-form density.] And the latter is over-parametrised when compared with the scientific model. As in, e.g., a N(θ,θ²) scientific model versus a N(μ,σ²) statistical model. In addition, the prior information is only available on θ. However, this does not seem to matter that much since (a) the Bayesian analysis is operated on θ only and (b) the Metropolis approach adopted by the authors involves simulating a massive number of pseudo-observations, given the current value of the parameter θ and the scientific model, so that the transform g(θ) can be estimated by maximum likelihood over the statistical model. The paper suggests using a secondary Markov chain algorithm to find this MLE. Which is claimed to be a simulated annealing resolution (p.121) although I do not see the temperature decreasing. The pseudo-model is then used in a primary MCMC step.

Hence, not truly an ABC algorithm. In the same setting, ABC would use a simulated dataset the same size as the observed dataset, compute the MLEs for both and compare them. Faster if less accurate when Assumption 1 [that the statistical model holds for a restricted parametrisation] does not stand.

Another interesting aspect of the paper is about creating and using a prior distribution around the manifold η=g(θ). This clearly relates to my earlier query about simulating on measure zero sets. The paper does not bring a definitive answer, as it never simulates exactly on the manifold, but this constitutes another entry on this challenging problem…

thick disc formation scenario of the Milky Way evaluated by ABC

Posted in Statistics, University life with tags , , , , , , , on July 9, 2014 by xi'an

“The facts that the thick-disc episode lasted for several billion years, that a contraction is observed during the collapse phase, and that the main thick disc has a constant scale height with no flare argue against the formation of the thick disc through radial migration. The most probable scenario for the thick disc is that it formed while the Galaxy was gravitationally collapsing from well-mixed gas-rich giant clumps that were sustained by high turbulence, which prevented a thin disc from forming for a time, as proposed previously.”

Following discussions with astronomers from Besancon on the use of ABC methods to approximate posteriors, I was associated with their paper on assessing a formation scenario of the Milky Way, which was accepted a few weeks ago in Astronomy & Astrophysics. The central problem (was there a thin-then-thick disk?) somewhat escapes me, but this collaboration started when some of the astronomers leading the study contacted me about convergence issues with their MCMC algorithms and I realised they were using ABC-MCMC without any idea that it was in fact called ABC-MCMC and had been studied previously in another corner of the literature… The scale in the kernel was chosen to achieve an average acceptance rate of 5%-10%. Model are then compared by the combination of a log-likelihood approximation resulting from the ABC modelling and of a BIC ranking of the models.  (Incidentally, I was impressed at the number of papers published in Astronomy & Astrophysics. The monthly issue contains dozens of papers!)

a pseudo-marginal perspective on the ABC algorithm

Posted in Mountains, pictures, Statistics, University life with tags , , , , , , , , on May 5, 2014 by xi'an


My friends Luke Bornn, Natesh Pillai and Dawn Woodard just arXived along with Aaron Smith a short note on the convergence properties of ABC. When compared with acceptance-rejection or regular MCMC. Unsurprisingly, ABC does worse in both cases. What is central to this note is that ABC can be (re)interpreted as a pseudo-marginal method where the data comparison step acts like an unbiased estimator of the true ABC target (not of the original ABC target, mind!). From there, it is mostly an application of Christophe Andrieu’s and Matti Vihola’s results in this setup. The authors also argue that using a single pseudo-data simulation per parameter value is the optimal strategy (as compared with using several), when considering asymptotic variance. This makes sense in terms of simulating in a larger dimensional space but what of the cost of producing those pseudo-datasets against the cost of producing a new parameter? There are a few (rare) cases where the datasets are much cheaper to produce.


Get every new post delivered to your Inbox.

Join 919 other followers