Archive for Kingman’s coalescent

JSM 2015 [day #3]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on August 12, 2015 by xi'an

My first morning session was about inference for philogenies. While I was expecting some developments around Kingman’s  coalescent models my coauthors needed and developped ABC for, I was surprised to see models that were producing closed form (or close enough to) likelihoods. Due to strong restrictions on the population sizes and migration possibilities, as explained later to me by Vladimir Minin. No need for ABC there since MCMC was working on the species trees, with Vladimir Minin making use of [the Savage Award winner] Vinayak Rao’s approach on trees that differ from the coalescent. And enough structure to even consider and demonstrate tree identifiability in Laura Kubatko’s case.

I then stopped by the astrostatistics session as the first talk by Gwendolin Eddie was about galaxy mass estimation, a problem I may actually be working on in the Fall, but it ended up being a completely different problem and I was further surprised that the issue of whether or not the data was missing at random was not considered by the authors.searise3

Christening a session Unifying foundation(s) may be calling for trouble, at least from me! In this spirit, Xiao Li Meng gave a talk attempting at a sort of unification of the frequentist, Bayesian, and fiducial paradigms by introducing the notion of personalized inference, which is a notion I had vaguely thought of in the past. How much or how far do you condition upon? However, I have never thought of this justifying fiducial inference in any way and Xiao Li’s lively arguments during and after the session not enough to convince me of the opposite: Prior-free does not translate into (arbitrary) choice-free. In the earlier talk about confidence distributions by Regina Liu and Minge Xie, that I partly missed for Galactic reasons, I just entered into the room at the very time when ABC was briefly described as a confidence distribution because it was not producing a convergent approximation to the exact posterior, a logic that escapes me (unless those confidence distributions are described in such a loose way as to include about any method f inference). Dongchu Sun also gave us a crash course on reference priors, with a notion of random posteriors I had not heard of before… As well as constructive posteriors… (They seemed to mean constructible matching priors as far as I understood.)

The final talk in this session by Chuanhei Liu on a new approach (modestly!) called inferential model was incomprehensible, with the speaker repeatedly stating that the principles were too hard to explain in five minutes and needed an incoming book… I later took a brief look at an associated paper, which relates to fiducial inference and to Dempster’s belief functions. For me, it has the same Münchhausen feeling of creating a probability out of nothing, creating a distribution on the parameter by ignoring the fact that the fiducial equation x=a(θ,u) modifies the distribution of u once x is observed.

posterior predictive checks for admixture models

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , on July 8, 2014 by xi'an

rainbows-four-studies2In a posting coincidence, just a few days after we arXived our paper on ABC model choice with random forests, where we use posterior predictive errors for assessing the variability of the random forest procedure, David Mimno, David Blei, and Barbara Engelhardt arXived a paper on posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure, which deals with similar data and models, while also using the posterior predictive as a central tool. (Marginalia: the paper is a wee bit difficult to read [esp. with France-Germany playing in the airport bar!] as the modelling is only clearly described at the very end. I suspect this arXived version was put together out of a submission to a journal like Nature or PNAS, with mentions of a Methods section that does not appear here and of Supplementary Material that turned into subsections of the Discussion section.)

The dataset are genomic datasets made of SNPs (single nucleotide polymorphisms). For instance, the first (HapMap) dataset corresponds to 1,043 individuals and 468,167 SNPs. The model is simpler than Kingman’s coalescent, hence its likelihood does not require ABC steps to run inference. The admixture model in the paper is essentially a mixture model over ancestry indices with individual dependent weights with Bernoulli observations, hence resulting into a completed likelihood of the form

\prod_{i=1}^n\prod_{\ell=1}^L\prod_j \phi_{\ell,z_{i,\ell,j}}^{x_{i,\ell,j}}(1-\phi_{\ell,z_{i,\ell,j}})^{1-x_{i,\ell,j}}\theta_{i,z_{i,\ell,j}}

(which looks more formidable than it truly is!). Regular Bayesian inference is thus possible in this setting, implementing e.g. Gibbs sampling. The authors chose instead to rely on EM and thus derived the maximum likelihood estimators of the (many) parameters of the admixture. And of the latent variables z. Their posterior predictive check is based on the simulation of pseudo-observations (as in ABC!) from the above likelihood, with parameters and latent variables replaced with their EM estimates (unlike ABC). There is obviously some computational reason in doing this instead of simulating from the posterior, albeit implicit in the paper. I am however slightly puzzled by the conditioning on the latent variable estimate , as its simulation is straightforward and as a latent variable is more a missing observation than a parameter. Given those 30 to 100 replications of the data, an empirical distribution of a discrepancy function is used to assess whether or not the equivalent discrepancy for the observation is an outlier. If so, the model is not appropriate for the data. (Interestingly, the discrepancy is measured via the Bayes factor of z-scores.)

The connection with our own work is that the construction of discrepancy measures proposed in this paper could be added to our already large collection of summary statistics to check to potential impact in model comparison, i.e. for a significant contribution to the random forest nodes.  Conversely, the most significant summary statistics could then be tested as discrepancy measures. Or, more in tune with our Series B paper on the proper selection of summary variables, the distribution of those discrepancy measures could be compared across potential models. Assuming this does not take too much computing power…

improved approximate-Bayesian model-choice method for estimating shared evolutionary history [reply from the author]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on June 3, 2014 by xi'an

[Here is a very kind and detailed reply from Jamie Oakes to the comments I made on his ABC paper a few days ago:]

First of all, many thanks for your thorough review of my pre-print! It is very helpful and much appreciated. I just wanted to comment on a few things you address in your post.

I am a little confused about how my replacement of continuous uniform probability distributions with gamma distributions for priors on several parameters introduces a potentially crippling number of hyperparameters. Both uniform and gamma distributions have two parameters. So, the new model only has one additional hyperparameter compared to the original msBayes model: the concentration parameter on the Dirichlet process prior on divergence models. Also, the new model offers a uniform prior over divergence models (though I don’t recommend it).

Your comment about there being no new ABC technique is 100% correct. The model is new, the ABC numerical machinery is not. Also, your intuition is correct, I do not use the divergence times to calculate summary statistics. I mention the divergence times in the description of the ABC algorithm with the hope of making it clear that the times are scaled (see Equation (12)) prior to the simulation of the data (from which the summary statistics are calculated). This scaling is simply to go from units proportional to time, to units that are proportional to the expected number of mutations. Clearly, my attempt at clarity only created unnecessary opacity. I’ll have to make some edits.

Regarding the reshuffling of the summary statistics calculated from different alignments of sequences, the statistics are not exchangeable. So, reshuffling them in a manner that is not conistent across all simulations and the observed data is not mathematically valid. Also, if elements are exchangeable, their order will not affect the likelihood (or the posterior, barring sampling error). Thus, if our goal is to approximate the likelihood, I would hope the reshuffling would also have little affect on the approximate posterior (otherwise my approximation is not so good?).

You are correct that my use of “bias” was not well defined in reference to the identity line of my plots of the estimated vs true probability of the one-divergence model. I think we can agree that, ideally (all assumptions are met), the estimated posterior probability of a model should estimate the probability that the model is correct. For large numbers of simulation
replicates, the proportion of the replicates for which the one-divergence model is true will approximate the probability that the one-divergence model is correct. Thus, if the method has the desirable (albeit “frequentist”) behavior such that the estimated posterior probability of the one-divergence model is an unbiased estimate of the probability that the one-divergence model is correct, the points should fall near the identity line. For example, let us say the method estimates a posterior probability of 0.90 for the one-divergence model for 1000 simulated datasets. If the method is accurately estimating the probability that the one-divergence model is the correct model, then the one-divergence model should be the true model for approximately 900 of the 1000 datasets. Any trend away from the identity line indicates the method is biased in the (frequentist) sense that it is not correctly estimating the probability that the one-divergence model is the correct model. I agree this measure of “bias” is frequentist in nature. However, it seems like a worthwhile goal for Bayesian model-choice methods to have good frequentist properties. If a method strongly deviates from the identity line, it is much more difficult to interpret the posterior probabilites that it estimates. Going back to my example of the posterior probability of 0.90 for 1000 replicates, I would be alarmed if the model was true in only 100 of the replicates.

My apologies if my citation of your PNAS paper seemed misleading. The citation was intended to be limited to the context of ABC methods that use summary statistics that are insufficient across the models under comparison (like msBayes and the method I present in the paper). I will definitely expand on this sentence to make this clearer in revisions. Thanks!

Lastly, my concluding remarks in the paper about full-likelihood methods in this domain are not as lofty as you might think. The likelihood function of the msBayes model is tractable, and, in fact, has already been derived and implemented via reversible-jump MCMC (albeit, not readily available yet). Also, there are plenty of examples of rich, Kingman-coalescent models implemented in full-likelihood Bayesian frameworks. Too many to list, but a lot of them are implemented in the BEAST software package. One noteworthy example is the work of Bryant et al. (2012, Molecular Biology and Evolution, 29(8), 1917–32) that analytically integrates over all gene trees for biallelic markers under the coalescent.

improved approximate-Bayesian model-choice method for estimating shared evolutionary history

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on May 14, 2014 by xi'an

“An appealing approach would be a comparative, Bayesian model-choice method for inferring the probability of competing divergence histories while integrating over uncertainty in mutational and ancestral processes via models of nucleotide substitution and lineage coalescence.” (p.2)

Jamies Oaks arXived (a few months ago now) a rather extensive Monte-Carlo study on the impact of prior modelling on the model-choice performances of ABC model choice. (Of which I only became aware recently.) As in the earlier paper I commented on the Óg, the issue here has much more to do with prior assessment and calibration than with ABC implementation per se. For instance, the above quote recaps the whole point of conducting Bayesian model choice. (As missed by Templeton.)

“This causes divergence models with more divergence-time parameters to integrate over a much greater parameter space with low likelihood yet high prior density, resulting in small marginal likelihoods relative to models with fewer divergence-time parameters.” (p.2)

This second quote is essentially stressing the point with Occam’s razor argument. Which I deem [to be] a rather positive feature of Bayesian model choice. A reflection on the determination of the prior distribution, getting away from uniform priors, thus sounds most timely! The current paper takes place within a rather extensive exchange between Oak’s group and Hickerson’s group on what makes Bayesian model choice (and the associated software msBayes) pick or not the correct model. Oak and coauthors objected to the use of “narrow, empirically informed uniform priors”, arguing that this leads to a bias towards models with less parameters, a “statistical issue” in their words, while Hickerson et al. (2014) think this is due to msBayes way of selecting models and their parameters at random. However it refrains from reproducing earlier criticisms of or replies to Hickerson et al.

The current paper claims to have reached a satisfactory prior modelling with ¨improved robustness, accuracy, and power” (p.3).  If I understand correctly, the changes are in replacing a uniform distribution with a Gamma or a Dirichlet prior. Which means introducing a seriously large and potentially crippling number of hyperparameters into the picture. Having a lot of flexibility in the prior also means a lot of variability in the resulting inference… In other words, with more flexibility comes more responsibility, to paraphrase Voltaire.

“I have introduced a new approximate-Bayesian model choice method.” (p.21)

The ABC part is rather standard, except for the strange feature that the divergence times are used to construct summary statistics (p.10). Strange because these times are not observed for the actual data. So I must be missing something. (And I object to the above quote and to the title of the paper since there is no new ABC technique there, simply a different form of prior.)

“ABC methods in general are known to be biased for model choice.” (p.21)

I do not understand much the part about (reshuffling) introducing bias as detailed on p.11: every approximate method gives a “biased” answer in the sense this answer is not the true and proper posterior distribution. Using a different (re-ordered) vector of statistics provides a different ABC outcome,  hence a different approximate posterior, for which it seems truly impossible to check whether or not it increases the discrepancy from the true posterior, compared with the other version. I must admit I always find annoying to see the word bias used in a vague meaning and esp. within a Bayesian setting. All Bayesian methods are biased. End of the story. Quoting our PNAS paper as concluding that ABC model choice is biased is equally misleading: the intended warning represented by the paper was that Bayes factors and posterior probabilities could be quite unrelated with those based on the whole dataset. That the proper choice of summary statistics leads to a consistent model choice shows ABC model choice is not necessarily “biased”… Furthermore, I also fail to understand why the posterior probability of model i should be distributed as a uniform (“If the method is unbiased, the points should fall near the identity line”) when the data is from model i: this is not a p-value but a posterior probability and the posterior probability is not the frequentist coverage…

My overall problem is that, all in all, this is a single if elaborate Monte Carlo study and, as such, it does not carry enough weight to validate an approach that remains highly subjective in the selection of its hyperparameters. Without raising any doubt about an hypothetical “fixing” of those hyperparameters, I think this remains a controlled experiment with simulated data where the true parameters are know and the prior is “true”. This obviously helps in getting better performances.

“With improving numerical methods (…), advances in Monte Carlo techniques and increasing efficiency of likelihood calculations, analyzing rich comparative phylo-geographical models in a full-likelihood Bayesian framework is becoming computationally feasible.” (p.21)

This conclusion of the paper sounds over-optimistic and rather premature. I do not know of any significant advance in computing the observed likelihood for the population genetics models ABC is currently handling. (The SMC algorithm of Bouchard-Côté, Sankaraman and Jordan, 2012, does not apply to Kingman’s coalescent, as far as I can tell.) This is certainly a goal worth pursuing and borrowing strength from multiple techniques cannot hurt, but it remains so far a lofty goal, still beyond our reach… I thus think the major message of the paper is to reinforce our own and earlier calls for caution when interpreting the output of an ABC model choice (p.20), or even of a regular Bayesian analysis, agreeing that we should aim at seeing “a large amount of posterior uncertainty” rather than posterior probability values close to 0 and 1.

interacting particle systems as… facebook

Posted in Books, Statistics, University life with tags , , , , , on October 8, 2013 by xi'an

Among the many interesting arXived papers this Friday, I first read David Aldous’ “Interacting particle systems as stochastic social dynamics“. Being unfamiliar with those systems (despite having experts in offices down the hall in Paris-Dauphine!), I read this typology of potential models (published in Bernoulli) with a keen interest! The paper stemmed from a short course given in 2012 in Warwick and Cornell. I think the links exhibited there with (social) networks should be relevant for statisticians working on networks (!) and dynamic graphical models. Statistics is not mentioned in the paper, except for the (misleading) connection with statistical physics, but there is obviously a huge potential for statistical inference, from parameter estimation to model comparison. (As pointed out by David Aldous, there is usually “no data or evidence linking the model to the asserted real-world phenomena”.) The paper then introduces some basic models like the token, the pandemic and the averaging process, plus the voter model that relates to Kingman’s coalescent. A very nice read opening new vistas for sure (and a source of projects for graduate students most certainly!)

inference in Kingman’s coalescent with pMCMC

Posted in Books, Statistics, University life with tags , , , , , , , on May 22, 2013 by xi'an

As I was checking the recent stat postings on arXiv, I noticed the paper by Chen and Xie entitled inference in Kingman’s coalescent with pMCMC.  (And surprisingly deposited in the machine learning subdomain.) The authors compare a pMCMC implementation for Kingman’s coalescent with importance sampling (à la Stephens & Donnelly), regular MCMC and SMC.  The specifics of their pMCMC algorithm is that they simulate the coalescent times conditional on the tree structure and the tree structure conditional on the coalescent times (via SMC). The results reported in the paper consider up to five loci and agree with earlier experiments showing poor performances of MCMC algorithms (based on the LAMARC software and apparently using independent proposals).  They show similar performances between importance sampling and pMCMC. While I find this application of pMCMC interesting, I wonder at the generality of the approach: when I was introduced to ABC techniques, the motivation was that importance sampling was deteriorating very quickly with the number of parameters. Here it seems the authors only considered one parameter θ. I wonder what happens when the number of parameters increases. And how pMCMC would then compare with ABC.

ABC+EL=no D(ata)

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , , , , on May 28, 2012 by xi'an

It took us a loooong while [for various and uninteresting reasons] but we finally ended up completing a paper on ABC using empirical likelihood (EL) that was started by me listening to Brunero Liseo’s tutorial in O’Bayes-2011 in Shanghai… Brunero mentioned empirical likelihood as a semi-parametric technique w/o much Bayesian connections and this got me thinking of a possible recycling within ABC. I won’t get into the details of empirical likelihood, referring to Art Owen’s book “Empirical Likelihood” for a comprehensive entry, The core idea of empirical likelihood is to use a maximum entropy discrete distribution supported by the data and constrained by estimating equations related with the parameters of interest/of the model. As such, it is a non-parametric approach in the sense that the distribution of the data does not need to be specified, only some of its characteristics. Econometricians have been quite busy at developing this kind of approach over the years, see e.g. Gouriéroux and Monfort’s  Simulation-Based Econometric Methods). However, this empirical likelihood technique can also be seen as a convergent approximation to the likelihood and hence exploited in cases when the exact likelihood cannot be derived. For instance, as a substitute to the exact likelihood in Bayes’ formula. Here is for instance a comparison of a true normal-normal posterior with a sample of 10³ points simulated using the empirical likelihood based on the moment constraint.

The paper we wrote with Kerrie Mengersen and Pierre Pudlo thus examines the consequences of using an empirical likelihood in ABC contexts. Although we called the derived algorithm ABCel, it differs from genuine ABC algorithms in that it does not simulate pseudo-data. Hence the title of this post. (The title of the paper is “Approximate Bayesian computation via empirical likelihood“. It should be arXived by the time the post appears: “Your article is scheduled to be announced at Mon, 28 May 2012 00:00:00 GMT“.) We had indeed started looking at a simulated data version, but it was rather poor, and we thus opted for an importance sampling version where the parameters are simulated from an importance distribution (e.g., the prior) and then weighted by the empirical likelihood (times a regular importance factor if the importance distribution is not the prior). The above graph is an illustration in a toy example.

The difficulty with the method is in connecting the parameters (of interest/of the assumed distribution) with moments of the (iid) data. While this operates rather straightforwardly for quantile distributions, it is less clear for dynamic models like ARCH and GARCH, where we have to reconstruct the underlying iid process. (Where ABCel clearly improves upon ABC for the GARCH(1,1) model but remains less informative than a regular MCMC analysis. Incidentally, this study led to my earlier post on the unreliable garch() function in the tseries package!) And it is even harder for population genetic models, where parameters like divergence dates, effective population sizes, mutation rates, &tc., cannot be expressed as moments of the distribution of the sample at a given locus. In particular, the datapoints are not iid. Pierre Pudlo then had the brilliant idea to resort instead to a composite likelihood, approximating the intra-locus likelihood by a product of pairwise likelihoods over all pairs of genes in the sample at a given locus. Indeed, in Kingman’s coalescent theory, the pairwise likelihoods can be expressed in closed form, hence we can derive the pairwise composite scores. The comparison with optimal ABC outcomes shows an improvement brought by ABCel in the approximation, at an overall computing cost that is negligible against ABC (i.e., it takes minutes to produce the ABCel outcome, compared with hours for ABC.)

We are now looking for extensions and improvements of ABCel, both at the methodological and at the genetic levels, and we would of course welcome any comment at this stage. The paper has been submitted to PNAS, as we hope it should appeal to the ABC community at large, i.e. beyond statisticians…