Archive for population genetics

ABC+EL=no D(ata)

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , , , , on May 28, 2012 by xi'an

It took us a loooong while [for various and uninteresting reasons] but we finally ended up completing a paper on ABC using empirical likelihood (EL) that was started by me listening to Brunero Liseo’s tutorial in O’Bayes-2011 in Shanghai… Brunero mentioned empirical likelihood as a semi-parametric technique w/o much Bayesian connections and this got me thinking of a possible recycling within ABC. I won’t get into the details of empirical likelihood, referring to Art Owen’s book “Empirical Likelihood” for a comprehensive entry, The core idea of empirical likelihood is to use a maximum entropy discrete distribution supported by the data and constrained by estimating equations related with the parameters of interest/of the model. As such, it is a non-parametric approach in the sense that the distribution of the data does not need to be specified, only some of its characteristics. Econometricians have been quite busy at developing this kind of approach over the years, see e.g. Gouriéroux and Monfort’s  Simulation-Based Econometric Methods). However, this empirical likelihood technique can also be seen as a convergent approximation to the likelihood and hence exploited in cases when the exact likelihood cannot be derived. For instance, as a substitute to the exact likelihood in Bayes’ formula. Here is for instance a comparison of a true normal-normal posterior with a sample of 10³ points simulated using the empirical likelihood based on the moment constraint.

The paper we wrote with Kerrie Mengersen and Pierre Pudlo thus examines the consequences of using an empirical likelihood in ABC contexts. Although we called the derived algorithm ABCel, it differs from genuine ABC algorithms in that it does not simulate pseudo-data. Hence the title of this post. (The title of the paper is “Approximate Bayesian computation via empirical likelihood“. It should be arXived by the time the post appears: “Your article is scheduled to be announced at Mon, 28 May 2012 00:00:00 GMT“.) We had indeed started looking at a simulated data version, but it was rather poor, and we thus opted for an importance sampling version where the parameters are simulated from an importance distribution (e.g., the prior) and then weighted by the empirical likelihood (times a regular importance factor if the importance distribution is not the prior). The above graph is an illustration in a toy example.

The difficulty with the method is in connecting the parameters (of interest/of the assumed distribution) with moments of the (iid) data. While this operates rather straightforwardly for quantile distributions, it is less clear for dynamic models like ARCH and GARCH, where we have to reconstruct the underlying iid process. (Where ABCel clearly improves upon ABC for the GARCH(1,1) model but remains less informative than a regular MCMC analysis. Incidentally, this study led to my earlier post on the unreliable garch() function in the tseries package!) And it is even harder for population genetic models, where parameters like divergence dates, effective population sizes, mutation rates, &tc., cannot be expressed as moments of the distribution of the sample at a given locus. In particular, the datapoints are not iid. Pierre Pudlo then had the brilliant idea to resort instead to a composite likelihood, approximating the intra-locus likelihood by a product of pairwise likelihoods over all pairs of genes in the sample at a given locus. Indeed, in Kingman’s coalescent theory, the pairwise likelihoods can be expressed in closed form, hence we can derive the pairwise composite scores. The comparison with optimal ABC outcomes shows an improvement brought by ABCel in the approximation, at an overall computing cost that is negligible against ABC (i.e., it takes minutes to produce the ABCel outcome, compared with hours for ABC.)

We are now looking for extensions and improvements of ABCel, both at the methodological and at the genetic levels, and we would of course welcome any comment at this stage. The paper has been submitted to PNAS, as we hope it should appeal to the ABC community at large, i.e. beyond statisticians…

Confronting intractability in Bristol

Posted in pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , on April 18, 2012 by xi'an

Here are the (revised) slides of my talk this afternoon at the Confronting Intractability in Statistical Inference workshop in Bristol, supported by SuSTain. The novelty is in the final part, where we managed to apply our result to a three population genetic escenario using one versus two δμ summary statistics. This should be the central new example in the incoming revision of our paper to Series B.

More generally, the meeting is very interesting, with great talks and highly relevant topics: e.g., yesterday, I finally understood what transportation models meant (at the general level) and how they related to copula modelling, saw a possible connection from computer models to ABC, got inspiration to mix Gaussian processes with simulation output, and listened to the whole exposition of Simon Wood’s alternative to ABC (much more informative than the four pages of his paper in Nature!). Despite (or due to?) sampling Bath ales yesterday night, I even woke up early enough this morning to run over and under the Clifton suspension bridge, with a slight drizzle that could not really be characterized as rain…

Still not confident…

Posted in Statistics, University life with tags , , , , , on June 14, 2011 by xi'an

About 45 days after submitting our revised version of lack of confidence in ABC model choice to PNAS, we have received the reviews and they still ask for more clarity in our conclusions. In particular, one referee does not buy the distinction between ABC point (and confidence) estimation and ABC model choice, namely that ABC may equally go wrong in the former case when using a poor selection of summary statistics.  This is correct, even though a stepwise ABC estimation à la Marjoram could bring some measure of confidence in the set of summary statistics used for estimation. This referee’s conclusion is then that “the statistics we use for model checking must be different from those used in inference“. This is sound when considering the simple normal example we include in the paper, however in realistic situations, there is no sufficiency to aim at and therefore it is impossible to know how good or how different those summary statistics are. The second referee maintains her or his earlier position that there is nothing wrong with ABC when using a sufficiently large collection of summary statistics. The proof is given in our paper by the fact that for the larger dataset  we “get statistically reliable results using ABC for model choice (without using sufficient statistics)”. First, this is only one experiment, with no reason to extrapolate to other philogenetic trees and even less to models outside population genetics. Second, we only argue that the ABC Bayes factor does not have any reason to converge to the true Bayes factor. It may well be a consistent quantity for model selection, but theoretical arguments are currently missing (although an extension of the perspective adopted in Fearnhead and Prangle and in Dean et al., namely to see ABC as an inference method per se rather than an approximation to a Bayesian inaccessible methodology, could be considered here as well). In conclusion, this means that we have to go for yet another run of revision that will tone even further down our conclusion and presumably soften the distinction between point estimation and hypothesis testing… A good thing Natesh starts his visit to Paris today!