Archive for ABC-SMC

¼th i-like workshop in St. Anne’s College, Oxford

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on March 27, 2014 by xi'an

IMG_0153Due to my previous travelling to and from Nottingham for the seminar and back home early enough to avoid the dreary evening trains from Roissy airport (no luck there, even at 8pm, the RER train was not operating efficiently!, and no fast lane is planed prior to 2023…), I did not see many talks at the i-like workshop. About ¼th, roughly… I even missed the poster session (and the most attractive title of Lazy ABC by Dennis Prangle) thanks to another dreary train ride from Derby to Oxford.

IMG_0150As it happened I had already heard or read parts of the talks in the Friday morning session, but this made understanding them better. As in Banff, Paul Fearnhead‘s talk on reparameterisations for pMCMC on hidden Markov models opened a wide door to possible experiments on those algorithms. The examples in the talk were mostly of the parameter duplication type, somewhat creating unidentifiability to decrease correlation, but I also wondered at the possibility of introducing frequent replicas of the hidden chain in order to fight degeneracy. Then Sumeet Singh gave a talk on the convergence properties of noisy ABC for approximate MLE. Although I had read some of the papers behind the talk, it made me realise how keeping balls around each observation in the ABC acceptance step was not leading to extinction as the number of observations increased. (Summet also had a good line with his ABCDE algorithm, standing for ABC done exactly!) Anthony Lee covered his joint work with Krys Łatuszyński on the ergodicity conditions on the ABC-MCMC algorithm, the only positive case being the 1-hit algorithm as discussed in an earlier post. This result will hopefully get more publicity, as I frequently read that increasing the number of pseudo-samples has no clear impact on the ABC approximation. Krys Łatuszyński concluded the morning with an aggregate of the various results he and his co-authors had obtained on the fascinating Bernoulli factory. Including constructive derivations.

After a few discussions on and around research topics, it was too soon time to take advantage of the grand finale of a March shower to walk from St. Anne’s College to Oxford Station, in order to start the trip back home. I was lucky enough to find a seat and could start experimenting in R the new idea my trip to Nottingham had raised! While discussing a wee bit with my neighbour, a delightful old lady from the New Forest travelling to Coventry, recovering from a brain seizure, wondering about my LaTeX code syntax despite the tiny fonts, and who most suddenly popped a small screen from her bag to start playing Candy Crush!, apologizing all the same. The overall trip was just long enough for my R code to validate this idea of mine, making this week in England quite a profitable one!!! IMG_0145

threshold schedules for ABC-SMC (reply from the authors)

Posted in Statistics with tags , , , , on October 19, 2012 by xi'an

(Below is a reply from the authors of threshold schedules for ABC-SMC I discussed two days ago.)

In our experience the problems that we seek to address arise routinely if adaptive quantile schedules are employed and are not due to poor initial exploration of the parameter space. Rather, when sub-optimal quantiles are chosen we observe what has often been called the “survival of the fattest”. Here a finite population of particles is accepted into regions that satisfy some less stringent cost function (e.g. simulated annealing at high temperature) easily but thereby drifts (in the genetic sense) away from better solutions that would also satisfy tighter thresholds. We do not claim to be the first to point this out; but we believe that depending on the problem any quantile method will be susceptible to this. A similar point has been made, perhaps more beautifully, by Calvet and Czellar (2012).

The toy example is maybe slightly misleading in the sense that parameter-space and data-space can be mapped bijectively. It is not, however, fanciful: it is an illustration of “survival of the fattest”. And it also demonstrates that reliance on a fixed quantile (say 10%) for any type of problem is likely to result in systematically biased, plainly incorrect “posteriors” in some if not most cases.

We do not claim that the optimal ε is zero. But clearly it is not the one dictated by some fixed computational cost you want to get away with (as is the case when people take the best n out of N simulations). What we try to argue is that the threshold should be set based on the problem under consideration. Therefore the choice of ε should make use of the hypothetical acceptance curve, ℵ.

Clearly we do not know ℵ but show that the UT is one way of second-guessing its shape. That each guess depends on the present population is both obvious and necessary.

A CDF-based estimate for ℵ using the empirical distances from the previous population would ignore the role of the perturbation kernel that we employ (and which should also be problem-specific). This we found, and by hindsight should have been obvious, has huge impact on the efficiency, in particular the validity of ℵ. Of course, if we already had the CDF of the present target population (ie the whole set of perturbed particles) we would be fine. But knowing the answer makes any question look easy. But getting an answer to the types of problems that interest us is definitely obtained more quickly when we use the UT.

Maybe the problems that we are dealing with in our “day-jobs” (in the analysis of biomolecular and non-linear dynamical systems) make the shortcomings of the “quantile-method” more readily apparent; against this background the proposed approach is perhaps more easily understood.

threshold schedules for ABC-SMC

Posted in Statistics, University life with tags , , , , , , , , , on October 17, 2012 by xi'an

Daniel Silk, Sarah Filippi and Michael Stumpf just posted on arXiv a new paper on ABC-SMC. They attack therein a typical problem with SMC (and PMC, and even MCMC!) methods, namely the ability to miss (global) modes of the target due to a poor initial exploration. So, if a proposal is built on previous simulations and if  those simulations have missed an important mode, it is quite likely that this proposal will concentrate on other parts of the space and miss the important mode even more. This is also why simulated annealing and other stochastic optimisation algorithms are so fickle: a wrong parameterisation (like the temperature schedule) induces the sequence to converge to a local optimum rather than to the global optimum. Since sequential ABC is a form of simulated annealing, the decreasing tolerance (or threshold) playing the part of the temperature, it is no surprise that it suffers from this generic disease…

The method proposed in the paper aims at avoiding this difficulty by looking at sudden drops in the acceptance rate curve (as a function of the tolerance ε),

\aleph_t(\epsilon)=\int p_t(x)\mathbb{I}(\Delta(x,x^\star)\le \epsilon)\,\text{d}x,

suggesting for threshold the value of ε that maximises the second derivative of this curve. Now, before getting to (at?) the issue of turning this principle into a practical algorithm, let me linger at the motivation for it:

“To see this, one can imagine an ε-ball expanding about the true data; at first the ball only encompasses a small number of particles that were drawn from very close to the global maximum, corresponding to the low gradient at the foot of the shape. Once ε is large enough we are able to accept the relatively large number of particles sitting in the local maximum, which causes the increase in gradient.” (p.5)

Thus, the argument for looking at values of ε preceding fast increases in the acceptance rate ℵ is that we are thus avoiding flatter and lower regions of the posterior support corresponding to a local maximum. It clearly encompasses the case studied by the authors of a global highly concentrated global mode, surrounded by flatter lower modes, but it seems to me that this is not necessarily the only possible reason for changes in the shape of the acceptance probability ℵ. First, we are looking at an ABC acceptance rate, not at a genuine likelihood. Second, this acceptance rate function depends on the current (and necessarily rough) approximate to the genuine predictive, pt. Furthermore, when taking into account this rudimentary replacement of the true likelihood function, it is rather difficult to envision how it impacts the correspondence between a proximity in the data and a proximity in the parameter space. (The toy example is both illuminating and misleading, because it considers a setting where the data is a deterministic transform of the parameter.) I thus think the analysis should refer more strongly to the non-parametric literature and in particular to the k-nearest-neighbour approach recently reformulated by Biau et al.: there is no reason to push the tolerance ε all the way down to zero as this limit does not correspond to the optimal value of the tolerance. If one does not use a non-parametric determination of the “right” tolerance, the lingering question is when and why stopping the sequence of ABC simulations.

The acceptance rate function ℵ is approximated using an unscented transform. I understand neither the implementation of, nor the motivations for, this choice. Given that the function ℵ is a one-dimensional transform of the tolerance and that it actually corresponds to the cdf of the distance between the true data and the (current) pseudo-data, why couldn’t we use smooth versions of a cdf estimate based on the simulated distances, given that these distances are already computed..?  I must be missing something.

interacting particles ABC

Posted in Statistics with tags , , , , , , on August 27, 2012 by xi'an

Carlo Albert and Hans Kuensch recently posted an arXiv paper which provides a new perspective on ABC. It relates to ABC-MCMC and to ABC-SMC in different ways, but the major point is to propose a sequential schedule for decreasing the tolerance that ensures convergence. Although there exist other proofs of convergence in the literature, this one is quite novel in that it connects ABC with the cooling schedules of simulated annealing. (The fact that the sample size does not appear as in Fearnhead and Prangle and their non-parametric perspective can be deemed less practical, but I think this is simply another perspective on the problem!) The corresponding ABC algorithm is a mix of MCMC and SMC in that it lets a population of N particles evolve in a quasi-independent manner, the population being only used to update the parameters of the independent (normal) proposal and those of the cooling tolerance. Each particle in the population moves according to a Metropolis-Hastings step, but this is not an ABC-MCMC scheme in that the algorithm works with a population at all times, and this is not an ABC-SMC scheme in that there is no weighting and no resampling.

Maybe I can add two remarks about the conclusion: the authors do not seem aware of other works using other penalties than the 0-1 kernel, but those abound, see e.g. the discussion paper of Fearnhead and Prangle. Or Ratmann et al. The other missing connection is about adaptive tolerance construction, which is also found in the literature, see e.g. Doucet et al. or Drovandi and Pettitt.

On optimality of kernels for ABC-SMC

Posted in Statistics, University life with tags , , , , , , , , , , on December 11, 2011 by xi'an

This freshly arXived paper by Sarah Filippi, Chris Barnes, Julien Cornebise, and Michael Stumpf, is in the lineage of our 2009 Biometrika ABC-PMC (population Monte Carlo) paper with Marc Beaumont, Jean-Marie Cornuet and Jean-Michel Marin. (I actually missed the first posting while in Berlin last summer. Flying to Utah gave me the opportunity to read it at length!)  The  paper focusses on the impact of the transition kernel in our PMC scheme: while we used component-wise adaptive proposals, the paper studies multivariate adaptivity with a covariance matrix adapted from the whole population, or locally or from an approximation to the information matrix. The simulation study run in the paper shows that, even when accounting for the additional cost due to the derivation of the matrix, the multivariate adaptation can improve the acceptance rate by a fair amount. So this is an interesting and positive sequel to our paper (that I may well end up refereeing one of those days, like an earlier paper from some of the authors!)

The main criticism I may have about the paper is that the selection of the tolerance sequence is not done in an adaptive way, while it could, given the recent developments of Del Moral et al. and of Drovandri and Pettitt (as well as our even more recent still-un-arXived submission to Stat & Computing!). While the target is the same for all transition kernels, thus the comparison still makes sense as is, the final product is to build a complete adaptive scheme that comes as close as possible to the genuine posterior.

This paper also raised a new question: there is a slight distinction between the Kullback-Leibler divergence we used and the Kullback-Leibler divergence the authors use here. (In fact, we do not account for the change in the tolerance.) Now, since what only matters is the distribution of the current particles, and while the distribution on the past particles is needed to compute the double integral leading to the divergence, there is a complete freedom in the choice of this past distribution. As in Del Moral et al., the distribution L(θ:t-1t) could therefore be chosen towards an optimal acceptance rate or something akin. I wonder if anyone ever looked at this…


Follow

Get every new post delivered to your Inbox.

Join 551 other followers