Last week, I flew down to Montpellier for two days of work on ABC model choice with Jean-Michel Marin and Pierre Pudlo. Although we missed the COLT 2014 deadline, we are now close to completing this work that will propose a rather radical change in our advocacy of how ABC model choice should be conducted. We actually spent the second day on the wonderful campus of INRA at Montferrier-sur-Lez, just outside Montpellier, discussing of the implications of this approach with our friends at CBGP, Jean-Marie Cornuet and Arnaud Estoup. With possible impact on the DIYABC software. It was a very profitable trip (not mentioning tasting great Grés de Montpellier wine!) and I hope to manage completing the paper with Pierre during the next week in Banff. Unfortunately, when I came back to my train station, I found some idiots had a go at my bike and bent the back wheel which then needed to be replaced…
Archive for ABC
On Friday, I am giving a talk on ABC at Université Laval, in the old city of Québec. While on my way to the 14w5125 workshop on scalable Bayesian computation at BIRS, Banff. I have not visited Laval since the late 1980′s (!) even though my last trip to Québec (the city) was in 2009, when François Perron took me there for ice-climbing and skiing after a seminar in Montréal… (This trip, I will not stay long enough in Québec, alas. Keeping my free day-off for another attempt at ice-climbing near Banff.) Here are slides I have used often in the past year, but this may be the last occurrence as we are completing a paper on the topic with my friends from Montpellier.
Reinaldo sent me this email a long while ago
Could you recommend me a nice reference about measures to evaluate stochastic algorithms (in particular focus in approximating posterior distributions).
and I hope he is still reading the ‘Og, despite my lack of prompt reply! I procrastinated and procrastinated in answering this question as I did not have a ready reply… We have indeed seen (almost suffered from!) a flow of MCMC convergence diagnostics in the 90′s. And then it dried out. Maybe because of the impossibility to be “really” sure, unless running one’s MCMC much longer than “necessary to reach” stationarity and convergence. The heat of the dispute between the “single chain school” of Geyer (1992, Statistical Science) and the “multiple chain school” of Gelman and Rubin (1992, Statistical Science) has since long evaporated. My feeling is that people (still) run their MCMC samplers several times and check for coherence between the outcomes. Possibly using different kernels on parallel threads. At best, but rarely, they run (one or another form of) tempering to identify the modal zones of the target. And instances where non-trivial control variates are available are fairly rare. Hence, a non-sequitur reply at the MCMC level. As there is no automated tool available, in my opinion. (Even though I did not check the latest versions of BUGS.)
As it happened, Didier Chauveau from Orléans gave today a talk at Big’MC on convergence assessment based on entropy estimation, a joint work with Pierre Vandekerkhove. He mentioned SamplerCompare which is an R package that appeared in 2010. Soon to come is their own EntropyMCMC package, using parallel simulation. And k-nearest neighbour estimation.
If I re-interpret the question as focussed on ABC algorithms, it gets both more delicate and easier. Easy because each ABC distribution is different. So there is no reason to look at the unreachable original target. Delicate because there are several parameters to calibrate (tolerance, choice of summary, …) on top of the number of MCMC simulations. In DIYABC, the outcome is always made of the superposition of several runs to check for stability (or lack thereof). But this tells us nothing about the distance to the true original target. The obvious but impractical answer is to use some basic bootstrapping, as it is generally much too costly.
Crakel and Flegal just arXived a short paper running ABC for doing inference on the parameters of two families of bivariate betas. And I could not but read it thru. And wonder why ABC was that necessary to handle the model. The said bivariate betas are defined from
This makes each term in the pair Beta and the two components dependent. This construct was proposed by Arnold and Ng (2011). (The five-parameter version cancels the gammas for i=3,4,5.)
Since the pdf of the joint distribution is not available in closed form, Crakel and Flegal zoom on ABC-MCMC as the method of choice and discuss simulation experiments. (The choice of the tolerance ε as an absolute rather than relative value, ε=0.2,0.0.6,0.8, puzzles me, esp. since the distance between the summary statistics is not scaled.) I however wonder why other approaches are impossible. (Or why it is necessary to use this distribution to model correlated betas. Unless I am confused copulas were invented to this effect.) First, this is a latent variable model, so latent variables could be introduced inside an MCMC scheme. A wee bit costly but feasible. Second, several moments of those distributions are known so a empirical likelihood approach could be considered.
…unsurprisingly, the performances of ABC comparing true data of size n with synthetic data of size m>n are not great. However, there exists another way of reducing the variance in the synthetic data, namely by repeating simulations of samples of size n and averaging the indicators for proximity, resulting in a frequency rather than a 0-1 estimator. See e.g. Del Moral et al. (2009). In this sense, increasing the computing power reduces the variability of the ABC approximation. (And I thus fail to see the full relevance of Result 1.)
Taking the average of the indicators from multiple simulations will reduce the variability of the estimated ABC likelihood but because it is only still an unbiased estimate it will not alter the target and will not improve the ABC approximation (Andrieu and Roberts 2009). It will only have the effect of improving the mixing of MCMC ABC. Result 1 is used to contrast ABC II and BIL as they behave quite differently as n is increased.
The authors make several assumptions of unicity that I somewhat find unclear. While assuming that the MLE for the auxiliary model is unique could make sense (Assumption 2), I do not understand the corresponding indexing of this estimator (of the auxiliary parameter) on the generating (model) parameter θ. It should only depend on the generated/simulated data x. The notion of a noisy mapping is just confusing to me.
The dependence on θ is a little confusing I agree (especially in the context of ABC II methods). It starts to become more clear in the context of BIL. As n goes to infinity, the effect of the simulated data is removed and then we obtain the function φ(θ) (so we need to remember which θ simulated the data), which is referred to as the mapping or binding function in the II literature. If we somehow knew the binding function, BIL would proceed straightforwardly. But of course we don’t in practice, so we try to estimate it via simulated data (which, for computational reasons, needs to be a finite sample) from the true model based on theta. Thus we obtain a noisy estimate of the mapping. One way forward might be to fit some (non-parametric?) regression model to smooth out the noise and try to recover the true mapping (without ever taking n to infinity) and run a second BIL with this estimated mapping. I plan to investigate this in future work.
The assumption that the auxiliary score function at the auxiliary MLE for the observed data and for a simulated dataset (Assumption 3) is unique proceeds from the same spirit. I however fail to see why it matters so much. If the auxiliary MLE is the result of a numerical optimisation algorithm, the numerical algorithm may return local modes. This only adds to the approximative effect of the ABC-I schemes.
The optimiser failing to find the MLE (local mode) is certainly an issue shared by all BII methods, apart from ABC IS (which only requires 1 optimisation, so more effort to find the MLE can be applied here). Assuming the optimiser can obtain the MLE, I think the uniqueness assumptions makes sense. It basically says that, for a particular simulated dataset we would like a unique value for the ABC discrepancy function.
Given that the paper does not produce convergence results for those schemes, unless the auxiliary model contains the genuine model, such theoretical assumptions do not feel that necessary.
Actually, the ABC II methods will never converge to the true posterior (in general) due to lack of sufficiency. This is even the case if the true model is a special case of the auxiliary model! (in which case BIL can converge to the true posterior)
The paper uses normal mixtures as an auxiliary model: the multimodality of this model should not be such an hindrance (and reordering is transparent, i.e. does not “reduce the flexibility of the auxiliary model”, and does not “increase the difficulty of implementation”, as stated p.16).
The paper concludes from a numerical study to the superiority of the Bayesian indirect inference of Gallant and McCulloch (2009). Which simply replaces the true likelihood with the maximal auxiliary model likelihood estimated from a simulated dataset. (This is somehow similar to our use of the empirical likelihood in the PNAS paper.) It is however moderated by the cautionary provision that “the auxiliary model [should] describe the data well”. As for empirical likelihood, I would suggest resorting to this Bayesian indirect inference as a benchmark, providing a quick if possibly dirty reference against which to test more elaborate ABC schemes. Or other approximations, like empirical likelihood or Wood’s synthetic likelihood.
Unfortunately the methods are not quick (apart from ABC IS when the scores are analytic), but good approximations can be obtained. The majority of Bayesian methods that deal with intractable likelihoods do not target the true posterior (there are a couple of exceptions in special cases) and thus also suffer from some dirtiness, and BII does not escape from that. But, if a reasonable auxiliary model can be found, then I would suggest that (at least one of the) BII methods will be competitive.
On reflection for BIL it is not necessary for the auxiliary model to fit the data, since the generative model being proposed may be mis-specified and also not fit the data well. BIL needs an auxiliary model that mimics well the likelihood of the generative model for values of theta in non-negligible posterior regions. For ABC II, we are simply looking for a good summarisation of the data. Therefore it would seem useful if the auxiliary model did fit the data well. Note this process is independent of the generative model being proposed. Therefore the auxiliary model would be the same regardless of the chosen generative model. Very different considerations indeed.
Inspired by a discussion with Anthony Lee, it appears that the (Bayesian version) of synthetic likelihood you mentioned is actually also a BIL method but where the auxiliary model is applied to the summary statistic likelihood rather than the full data likelihood. The synthetic likelihood is nice from a numerical/computational point of view as the MLE of the auxiliary model is analytic.