Archive for ABC-SMC

at CIRM [#3]

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on March 4, 2016 by xi'an

Simon Barthelmé gave his mini-course on EP, with loads of details on the implementation of the method. Focussing on the EP-ABC and MCMC-EP versions today. Leaving open the difficulty of assessing to which limit EP is converging. But mentioning the potential for asynchronous EP (on which I would like to hear more). Ironically using several times a logistic regression example, if not on the Pima Indians benchmark! He also talked about approximate EP solutions that relate to consensus MCMC. With a connection to Mark Beaumont’s talk at NIPS [at the time as mine!] on the comparison with ABC. While we saw several talks on EP during this week, I am still agnostic about the potential of the approach. It certainly produces a fast proxy to the true posterior and hence can be exploited ad nauseam in inference methods based on pseudo-models like indirect inference. In conjunction with other quick and dirty approximations when available. As in ABC, it would be most useful to know how far from the (ideal) posterior distribution does the approximation stands. Machine learning approaches presumably allow for an evaluation of the predictive performances, but less so for the modelling accuracy, even with new sampling steps. [But I know nothing, I know!]

Dennis Prangle presented some on-going research on high dimension [data] ABC. Raising the question of what is the true meaning of dimension in ABC algorithms. Or of sample size. Because the inference relies on the event d(s(y),s(y’))≤ξ or on the likelihood l(θ|x). Both one-dimensional. Mentioning Iain Murray’s talk at NIPS [that I also missed]. Re-expressing as well the perspective that ABC can be seen as a missing or estimated normalising constant problem as in Bornn et al. (2015) I discussed earlier. The central idea is to use SMC to simulate a particle cloud evolving as the target tolerance ξ decreases. Which supposes a latent variable structure lurking in the background.

Judith Rousseau gave her talk on non-parametric mixtures and the possibility to learn parametrically about the component weights. Starting with a rather “magic” result by Allman et al. (2009) that three repeated observations per individual, all terms in a mixture are identifiable. Maybe related to that simpler fact that mixtures of Bernoullis are not identifiable while mixtures of Binomial are identifiable, even when n=2. As “shown” in this plot made for X validated. Actually truly related because Allman et al. (2009) prove identifiability through a finite dimensional model. (I am surprised I missed this most interesting paper!) With the side condition that a mixture of p components made of r Bernoulli products is identifiable when p ≥ 2[log² r] +1, when log² is base 2-logarithm. And [x] the upper rounding. I also find most relevant this distinction between the weights and the remainder of the mixture as weights behave quite differently, hardly parameters in a sense.

a simulated annealing approach to Bayesian inference

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on October 1, 2015 by xi'an

Paris/Zürich, Oct. 3, 2011 A misleading title if any! Carlos Albert arXived a paper with this title this morning and I rushed to read it. Because it sounded like Bayesian analysis could be expressed as a special form of simulated annealing. But it happens to be a rather technical sequel [“that complies with physics standards”] to another paper I had missed, A simulated annealing approach to ABC, by Carlos Albert, Hans Künsch, and Andreas Scheidegger. Paper that appeared in Statistics and Computing last year, and which is most interesting!

“These update steps are associated with a flow of entropy from the system (the ensemble of particles in the product space of parameters and outputs) to the environment. Part of this flow is due to the decrease of entropy in the system when it transforms from the prior to the posterior state and constitutes the well-invested part of computation. Since the process happens in finite time, inevitably, additional entropy is produced. This entropy production is used as a measure of the wasted computation and minimized, as previously suggested for adaptive simulated annealing” (p.3)

The notion behind this simulated annealing intrusion into the ABC world is that the choice of the tolerance can be adapted along iterations according to a simulated annealing schedule. Both papers make use of thermodynamics notions that are completely foreign to me, like endoreversibility, but aim at minimising the “entropy production of the system, which is a measure for the waste of computation”. The central innovation is to introduce an augmented target on (θ,x) that is


where ε is the tolerance, while ρ(x,y) is a measure of distance to the actual observations, and to treat ε as an annealing temperature. In an ABC-MCMC implementation, the acceptance probability of a random walk proposal (θ’,x’) is then


Under some regularity constraints, the sequence of targets converges to


if ε decreases slowly enough to zero. While the representation of ABC-MCMC through kernels other than the Heaviside function can be found in the earlier ABC literature, the embedding of tolerance updating within the modern theory of simulated annealing is rather exciting.

Furthermore, we will present an adaptive schedule that attempts convergence to the correct posterior while minimizing the required simulations from the likelihood. Both the jump distribution in parameter space and the tolerance are adapted using mean fields of the ensemble.” (p.2)

What I cannot infer from a rather quick perusal of the papers is whether or not the implementation gets into the way of the all-inclusive theory. For instance, how can the Markov chain keep moving as the tolerance gets to zero? Even with a particle population and a sequential Monte Carlo implementation, it is unclear why the proposal scale factor [as in equation (34)] does not collapse to zero in order to ensure a non-zero acceptance rate. In the published paper, the authors used the same toy mixture example as ours [from Sisson et al., 2007], where we earned the award of the “incredibly ugly squalid picture”, with improvements in the effective sample size, but this remains a toy example. (Hopefully a post to be continued in more depth…)

ABC by population annealing

Posted in Statistics, University life with tags , , , , , , , , on January 6, 2015 by xi'an

The paper “Bayesian Parameter Inference and Model Selection by Population Annealing in System Biology” by Yohei Murakami got published in PLoS One last August but I only became aware of it when ResearchGate pointed it out to me [by mentioning one of our ABC papers was quoted there].

“We are recommended to try a number of annealing schedules to check the influence of the schedules on the simulated data (…) As a whole, the simulations with the posterior parameter ensemble could, not only reproduce the data used for parameter inference, but also capture and predict the data which was not used for parameter inference.”

Population annealing is a notion introduced by Y Iba, the very same IBA who introduced the notion of population Monte Carlo that we studied in subsequent papers. It reproduces the setting found in many particle filter papers of a sequence of (annealed or rather tempered) targets ranging from an easy (i.e., almost flat) target to the genuine target, and of an update of a particle set by MCMC moves and reweighing. I actually have trouble perceiving the difference with other sequential Monte Carlo schemes as those exposed in Del Moral, Doucet and Jasra (2006, Series B). And the same is true of the ABC extension covered in this paper. (Where the annealed intermediate targets correspond to larger tolerances.) This sounds like a traditional ABC-SMC algorithm. Without the adaptive scheme on the tolerance ε found e.g. in Del Moral et al., since the sequence is set in advance. [However, the discussion about the implementation includes the above quote that suggests a vague form of cross-validated tolerance construction]. The approximation of the marginal likelihood also sounds standard, the marginal being approximated by the proportion of accepted pseudo-samples. Or more exactly by the sum of the SMC weights at the end of the annealing simulation. This actually raises several questions: (a) this estimator is always between 0 and 1, while the marginal likelihood is not restricted [but this is due to a missing 1/ε in the likelihood estimate that cancels from both numerator and denominator]; (b) seeing the kernel as a non-parametric estimate of the likelihood led me to wonder why different ε could not be used in different models, in that the pseudo-data used for each model under comparison differs. If we were in a genuine non-parametric setting the bandwidth would be derived from the pseudo-data.

“Thus, Bayesian model selection by population annealing is valid.”

The discussion about the use of ABC population annealing somewhat misses the point of using ABC, which is to approximate the genuine posterior distribution, to wit the above quote: that the ABC Bayes factors favour the correct model in the simulation does not tell anything about the degree of approximation wrt the original Bayes factor. [The issue of non-consistent Bayes factors does not apply here as there is no summary statistic applied to the few observations in the data.] Further, the magnitude of the variability of the values of this Bayes factor as ε varies, from 1.3 to 9.6, mostly indicates that the numerical value is difficult to trust. (I also fail to explain the huge jump in Monte Carlo variability from 0.09 to 1.17 in Table 1.) That this form of ABC-SMC improves upon the basic ABC rejection approach is clear. However it needs to build some self-control to avoid arbitrary calibration steps and reduce the instability of the final estimates.

“The weighting function is set to be large value when the observed data and the simulated data are ‘‘close’’, small value when they are ‘‘distant’’, and constant when they are ‘‘equal’’.”

The above quote is somewhat surprising as the estimated likelihood f(xobs|xobs,θ) is naturally constant when xobs=xsim… I also failed to understand how the model intervened in the indicator function used as a default ABC kernel

ABC à Montréal

Posted in Kids, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on December 13, 2014 by xi'an

Montreal1So today was the NIPS 2014 workshop, “ABC in Montréal“, which started with a fantastic talk by Juliane Liepe on some exciting applications of ABC to the migration of immune cells, with the analysis of movies involving those cells acting to heal a damaged fly wing and a cut fish tail. Quite amazing videos, really. (With the great entry line of ‘We have all cut  a finger at some point in our lives’!) The statistical model behind those movies was a random walk on a grid, with different drift and bias features that served as model characteristics. Frank Wood managed to deliver his talk despite a severe case of food poisoning, with a great illustration of probabilistic programming that made me understand (at last!) the very idea of probabilistic programming. And  Vikash Mansinghka presented some applications in image analysis. Those two talks led me to realise why probabilistic programming was so close to ABC, with a programming touch! Hence why I was invited to talk today! Then Dennis Prangle exposed his latest version of lazy ABC, that I have already commented on the ‘Og, somewhat connected with our delayed acceptance algorithm, to the point that maybe something common can stem out of the two notions. Michael Blum ended the day with provocative answers to the provocative question of Ted Meeds as to whether or not machine learning needed ABC (Ans. No!) and whether or not machine learning could help ABC (Ans. ???). With an happily mix-up between mechanistic and phenomenological models that helped generating discussion from the floor.

The posters were also of much interest, with calibration as a distance measure by Michael Guttman, in continuation of the poster he gave at MCMski, Aaron Smith presenting his work with Luke Bornn, Natesh Pillai and Dawn Woodard, on why a single pseudo-sample is enough for ABC efficiency. This gave me the opportunity to discuss with him the apparent contradiction with the result of Kryz Łatunsziński and Anthony Lee about the geometric convergence of ABC-MCMC only attained with a random number of pseudo-samples… And to wonder if there is a geometric versus binomial dilemma in this setting, Namely, whether or not simulating pseudo-samples until one is accepted would be more efficient than just running one and discarding it in case it is too far. So, although the audience was not that large (when compared with the other “ABC in…” and when considering the 2500+ attendees at NIPS over the week!), it was a great day where I learned a lot, did not have a doze during talks (!), [and even had an epiphany of sorts at the treadmill when I realised I just had to take longer steps to reach 16km/h without hyperventilating!] So thanks to my fellow organisers, Neil D Lawrence, Ted Meeds, Max Welling, and Richard Wilkinson for setting the program of that day! And, by the way, where’s the next “ABC in…”?! (Finland, maybe?)

lazy ABC

Posted in Books, Statistics, University life with tags , , , , , , , on June 9, 2014 by xi'an

“A more automated approach would be useful for lazy versions of ABC SMC algorithms.”

Dennis Prangle just arXived the work on lazy ABC he had presented in Oxford at the i-like workshop a few weeks ago. The idea behind the paper is to cut down massively on the generation of pseudo-samples that are “too far” from the observed sample. This is formalised through a stopping rule that puts the estimated likelihood to zero with a probability 1-α(θ,x) and otherwise divide the original ABC estimate by α(θ,x). Which makes the modification unbiased when compared with basic ABC. The efficiency appears when α(θ,x) can be computed much faster than producing the entire pseudo-sample and its distance to the observed sample. When considering an approximation to the asymptotic variance of this modification, Dennis derives a optimal (in the sense of the effective sample size) if formal version of the acceptance probability α(θ,x), conditional on the choice of a “decision statistic” φ(θ,x).  And of an importance function g(θ). (I do not get his Remark 1 about the case when π(θ)/g(θ) only depends on φ(θ,x), since the later also depends on x. Unless one considers a multivariate φ which contains π(θ)/g(θ) itself as a component.) This approach requires to estimate


as a function of φ: I would have thought (non-parametric) logistic regression a good candidate towards this estimation, but Dennis is rather critical of this solution.

I added the quote above as I find it somewhat ironical: at this stage, to enjoy laziness, the algorithm has first to go through a massive calibration stage, from the selection of the subsample [to be simulated before computing the acceptance probability α(θ,x)] to the construction of the (somewhat mysterious) decision statistic φ(θ,x) to the estimation of the terms composing the optimal α(θ,x). The most natural choice of φ(θ,x) seems to be involving subsampling, still with a wide range of possibilities and ensuing efficiencies. (The choice found in the application is somehow anticlimactic in this respect.) In most ABC applications, I would suggest using a quick & dirty approximation of the distribution of the summary statistic.

A slight point of perplexity about this “lazy” proposal, namely the static role of ε, which is impractical because not set in stone… As discussed several times here, the tolerance is a function of many factors incl. all the calibration parameters of the lazy ABC, rather than an absolute quantity. The paper is rather terse on this issue (see Section 4.2.2). It seems to me that playing with a large collection of tolerances may be too costly in this setting.

¼th i-like workshop in St. Anne’s College, Oxford

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on March 27, 2014 by xi'an

IMG_0153Due to my previous travelling to and from Nottingham for the seminar and back home early enough to avoid the dreary evening trains from Roissy airport (no luck there, even at 8pm, the RER train was not operating efficiently!, and no fast lane is planed prior to 2023…), I did not see many talks at the i-like workshop. About ¼th, roughly… I even missed the poster session (and the most attractive title of Lazy ABC by Dennis Prangle) thanks to another dreary train ride from Derby to Oxford.

IMG_0150As it happened I had already heard or read parts of the talks in the Friday morning session, but this made understanding them better. As in Banff, Paul Fearnhead‘s talk on reparameterisations for pMCMC on hidden Markov models opened a wide door to possible experiments on those algorithms. The examples in the talk were mostly of the parameter duplication type, somewhat creating unidentifiability to decrease correlation, but I also wondered at the possibility of introducing frequent replicas of the hidden chain in order to fight degeneracy. Then Sumeet Singh gave a talk on the convergence properties of noisy ABC for approximate MLE. Although I had read some of the papers behind the talk, it made me realise how keeping balls around each observation in the ABC acceptance step was not leading to extinction as the number of observations increased. (Summet also had a good line with his ABCDE algorithm, standing for ABC done exactly!) Anthony Lee covered his joint work with Krys Łatuszyński on the ergodicity conditions on the ABC-MCMC algorithm, the only positive case being the 1-hit algorithm as discussed in an earlier post. This result will hopefully get more publicity, as I frequently read that increasing the number of pseudo-samples has no clear impact on the ABC approximation. Krys Łatuszyński concluded the morning with an aggregate of the various results he and his co-authors had obtained on the fascinating Bernoulli factory. Including constructive derivations.

After a few discussions on and around research topics, it was too soon time to take advantage of the grand finale of a March shower to walk from St. Anne’s College to Oxford Station, in order to start the trip back home. I was lucky enough to find a seat and could start experimenting in R the new idea my trip to Nottingham had raised! While discussing a wee bit with my neighbour, a delightful old lady from the New Forest travelling to Coventry, recovering from a brain seizure, wondering about my LaTeX code syntax despite the tiny fonts, and who most suddenly popped a small screen from her bag to start playing Candy Crush!, apologizing all the same. The overall trip was just long enough for my R code to validate this idea of mine, making this week in England quite a profitable one!!! IMG_0145

threshold schedules for ABC-SMC (reply from the authors)

Posted in Statistics with tags , , , , on October 19, 2012 by xi'an

(Below is a reply from the authors of threshold schedules for ABC-SMC I discussed two days ago.)

In our experience the problems that we seek to address arise routinely if adaptive quantile schedules are employed and are not due to poor initial exploration of the parameter space. Rather, when sub-optimal quantiles are chosen we observe what has often been called the “survival of the fattest”. Here a finite population of particles is accepted into regions that satisfy some less stringent cost function (e.g. simulated annealing at high temperature) easily but thereby drifts (in the genetic sense) away from better solutions that would also satisfy tighter thresholds. We do not claim to be the first to point this out; but we believe that depending on the problem any quantile method will be susceptible to this. A similar point has been made, perhaps more beautifully, by Calvet and Czellar (2012).

The toy example is maybe slightly misleading in the sense that parameter-space and data-space can be mapped bijectively. It is not, however, fanciful: it is an illustration of “survival of the fattest”. And it also demonstrates that reliance on a fixed quantile (say 10%) for any type of problem is likely to result in systematically biased, plainly incorrect “posteriors” in some if not most cases.

We do not claim that the optimal ε is zero. But clearly it is not the one dictated by some fixed computational cost you want to get away with (as is the case when people take the best n out of N simulations). What we try to argue is that the threshold should be set based on the problem under consideration. Therefore the choice of ε should make use of the hypothetical acceptance curve, ℵ.

Clearly we do not know ℵ but show that the UT is one way of second-guessing its shape. That each guess depends on the present population is both obvious and necessary.

A CDF-based estimate for ℵ using the empirical distances from the previous population would ignore the role of the perturbation kernel that we employ (and which should also be problem-specific). This we found, and by hindsight should have been obvious, has huge impact on the efficiency, in particular the validity of ℵ. Of course, if we already had the CDF of the present target population (ie the whole set of perturbed particles) we would be fine. But knowing the answer makes any question look easy. But getting an answer to the types of problems that interest us is definitely obtained more quickly when we use the UT.

Maybe the problems that we are dealing with in our “day-jobs” (in the analysis of biomolecular and non-linear dynamical systems) make the shortcomings of the “quantile-method” more readily apparent; against this background the proposed approach is perhaps more easily understood.


Get every new post delivered to your Inbox.

Join 1,021 other followers