Archive for phylogenetic models

Nature snapshot [Volume 539 Number 7627]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on November 15, 2016 by xi'an

A number of entries of interest [to me] in that Nature issue: from the Capuchin monkeys that break stones in a way that resembles early hominins biface tools, to the persistent association between some sounds and some meanings across numerous languages, to the use of infected mosquitoes in South America to fight Zika, to the call for more maths in psychiatry by the NIMH director, where since prevision is mentioned I presumed stats is included, to the potentially earthshaking green power revolution in Africa, to the reconstruction of the first HIV strains in North America, along with the deconstruction of the “Patient 0” myth, helped by Bayesian phylogenetic analyses, to a cover of the Open Syllabus Project, with Monte Carlo Statistical Methods arriving first [in the Monte Carlo list]….

“Observations should not converge on one model but aim to find anomalies that carry clues about the nature of dark matter, dark energy or initial conditions of the Universe. Further observations should be motivated by testing unconventional interpretations of those anomalies (such as exotic forms of dark matter or modified theories of gravity). Vast data sets may contain evidence for unusual behaviour that was unanticipated when the projects were conceived.” Avi Loeb

One editorial particularly drew my attention, Good data are not enough, by the astronomer Avi Loeb. as illustrated  by the quote above, Loeb objects to data being interpreted and even to data being collected towards the assessment of the standard model. While I agree that this model contains a lot of fudge factors like dark matter and dark energy, which apparently constitutes most of the available matter, the discussion is quite curious, in that interpreting data according to alternative theories sounds impossible and certainly beyond the reach of most PhD students [as Loeb criticises the analysis of some data in a recent thesis he evaluated].

“modern cosmology is augmented by unsubstantiated, mathematically sophisticated ideas — of the multiverse, anthropic reasoning and string theory.

The author argues to always allow for alternative interpretations of the data, which sounds fine at a primary level but again calls for the conception of such alternative models. When discrepancies are found between the standard model and the data, they can be due to errors in the measurement itself, in the measurement model, or in the theoretical model. However, they may be impossible to analyse outside the model, in the neutral way called and wished by Loeb. Designing neutral experiments sounds even less meaningful. Which is why I am fairly taken aback by the call to “a research frontier [that] should maintain at least two ways of interpreting data so that new experiments will aim to select the correct one”! Why two and not more?! And which ones?! I am not aware of fully developed alternative theories and cannot see how experiments designed under one model could produce indications about a new and incomplete model.

“Such simple, off-the-shelf remedies could help us to avoid the scientific fate of the otherwise admirable Mayan civilization.”

Hence I am bemused by the whole exercise, which deepest arguments seem to be a paper written by the author last year and an interdisciplinary centre on black holes also launched recently by the same author.

Methodological developments in evolutionary genomic [3 years postdoc in Montpellier]

Posted in pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , , on November 26, 2014 by xi'an

[Here is a call for a post-doctoral position in Montpellier, South of France, not Montpelier, Vermont!, in a population genetics group with whom I am working. Highly recommended if you are currently looking for a postdoc!]

Three-year post-doctoral position at the Institute of Computational Biology (IBC), Montpellier (France) :
Methodological developments in evolutionary genomics.

One young investigator position opens immediately at the Institute for Computational Biology (IBC) of Montpellier (France) to work on the development of innovative inference methods and software in population genomics or phylogenetics to analyze large-scale genomic data in the fields of health, agronomy and environment (Work Package 2 « evolutionary genomics » of the IBC). The candidate will develop its own research on some of the following topics : selective processes, demographic history, spatial genetic processes, very large phylogenies reconstruction, gene/species tree reconciliation, using maximum likelihood, Bayesian and simulation-based inference. We are seeking a candidate with a strong background in mathematical and computational evolutionary biology, with interest in applications and software development. The successfull candidate will work on his own project, build in collaboration with any researcher involved in the WP2 project and working at the IBC labs (AGAP, CBGP, ISEM, I3M, LIRMM, MIVEGEC).

IBC hires young investigators, typically with a PhD plus some post-doc experience, a high level of publishing, strong communication abilities, and a taste for multidisciplinary research. Working full-time at IBC, these young researchers will play a key role in Institute life. Most of their time will be devoted to scientific projects. In addition, they are expected to actively participate in the coordination of workpackages, in the hosting of foreign researchers and in the organization of seminars and events (summer schools, conferences…). In exchange, these young researchers will benefit from an exceptional environment thanks to the presence of numerous leading international researchers, not to mention significant autonomy for their work. Montpellier hosts one of the most vibrant communities of biodiversity research in Europe with several research centers of excellence in the field. This positions is open for up to 3 years with a salary well above the French post-doc standards. Starting date is open to discussion.

 The application deadline is January 31, 2015.

Living in Montpellier:


Contacts at WP2 « Evolutionary Genetics » :


Jean-Michel Marin :

François Rousset :

Vincent Ranwez :

Olivier Gascuel :

Submit my application :

Semi-automatic ABC [revised]

Posted in Statistics with tags , , on April 18, 2011 by xi'an

Paul Fearnhead and Dennis Prangle have posted a revised version of their semi-automatic ABC paper. Compared with the earlier version commented on that post, the paper makes a better case for the ABC algorithm, considered there from a purely inferential viewpoint and calibrated for estimation purposes. In particular, the paper contains an important result in the form of a consistency theorem that shows that ABC is a convergent estimation method when the number of observations or datasets grows to infinity. I had not seen this result before and it definitely is an argument to remember when presenting ABC methods to newcomers.

Of course, I still remain skeptical about the “optimality” resulting from the choice of summary statistics in the paper, partly because

  • practice shows that proper approximation to genuine posterior distributions stems from using a (much) larger number of summary statistics than the dimension of the parameter;
  • the validity of the approximation to the optimal summary statistics depends on the quality of the pilot run;
  • important inferential issues like model choice are not covered by this approach.

But, nonetheless, the paper provides a way to construct default summary statistics that should come as a supplement to summary statistics provided by the experts, if not as a substitute.

The new version of Section 3 is much more satisfactory wrt the criticisms I voiced earlier, spelling out the computing cost and more importantly the connection with indirect inference. A clear strength of the paper remains with Section 4 which provides a major simulation experiment. My only criticism is the absence of a phylogeny example that would relate to the models that launched ABC methods. This is less of a mainstream statistics example, but it would be highly convincing to those primary users of ABC.

In conclusion, I find the paper both exciting and bringing new questions to the front. The appeal of this new field and the particularly highly debated issue of the choice of summary statistics will certainly create the opportunity for a wide discussion by the ABC community, were it to become a discussion paper.

Lack of confidence is back

Posted in Statistics, University life with tags , , , , on April 4, 2011 by xi'an

While in Bristol, I received the very good news that our ABC model choice submission to PNAS had passed the first round, namely, the manuscript was not rejected but instead the editor asks for a revision clarifying our message and the difference with the Bayesian Analysis paper. This revision should thus be manageable, even though the most negative reviewer considers that ABC model choice works rather when using summary statistics that are informative for choosing between the models, a point that sounds rather tautological to me., hence difficult to answer. It is indeed always possible to come up with summary statistics that make the ABC Bayes factor close to the genuine Bayes factor; however this requires a huge computing effort in order to validate the choice by cross-validation. Anyway, I have rather good hopes the paper could eventually be accepted in PNAS, which is obviously a quite exciting prospect! (Thanks to Michael Stumpf for pointing out the missing sentence!)

Incoherent phylogeographic inference [accepted]

Posted in Statistics, University life with tags , , , , , on August 30, 2010 by xi'an

The letter we submitted to PNAS about Templeton’s surprising diatribe on Bayesian inference has now been accepted:

Title: “Incoherent Phylogeographic Inference”
Tracking #: 2010-08762
Authors: Berger et al.

Dear Prof. Robert,
We are pleased to inform you that the PNAS Editorial Board has given final approval of your letter to the Editor for online publication. The author(s) of the published manuscript have been invited to respond to your feedback. If they provide a response, it may appear online concurrently with your letter.

Now we are looking forward (?) Alan Templeton’s answer, even though I suspect this short letter is not going to have any impact on his views!

Evidence and evolution (5)

Posted in Books, Statistics with tags , , , , , , , , , , , , on April 29, 2010 by xi'an

“Tout étant fait pour une fi n, tout est nécessairement pour la meilleure fi n. Remarquez bien que les nez ont été faits pour porter des lunettes, aussi avons-nous des lunettes.” Voltaire, Candide, Chapitre 1.

I am now done with my review of Sober’s Evidence and Evolution: The Logic Behind the Science, Posting about each chapter along the way helped me a lot to write down the review over the past few days. Its conclusion is that

Evidence and Evolution is very well-written, with hardly any typo (the unbiasedness property of AIC is stated at the bottom of page 101 with the expectation symbol E on the wrong side of the equation, Figure 3.8c is used instead of Figure 3.7c on page 204, Figure 4.7 is used instead of Figure 4.8 on page 293, Simon Tavaré’s name is always spelled Taveré, vaules rather than values is repeated four times on page 339). The style is sometimes too light and often too verbose, with an abundance of analogies that I regard as sidetracking, but this makes for an easier reading (except for the sentence “the key to answering the second question is that the observation that X = 1 and Y = 1 produces stronger evidence favoring CA over SA the lower the probability is that the ancestors postulated by the two hypotheses were in state 1”, on page 314, that still eludes me!). As detailed in this review, I have points of contentions with the philosophical views about testing in Evidence and Evolution as well as about the methods exposed therein, but this does not detract from the appeal of reading the book. (The lack of completely worked out statistical hypotheses in realistic settings remains the major issue in my criticism of the book.) While the criticisms of the Bayesian paradigm are often shallow (like the one on page 97 ridiculing Bayesians drawing inference based on a single observation), there is nothing fundamentally wrong with the statistical foundations of the book. I therefore repeat my earlier recommendation in favour of Evidence and Evolution, Chapters 1 and (paradoxically) 5 being the easier entries. Obviously, readers familiar with Sober’s earlier papers and books will most likely find a huge overlap with those but others will gather Sober’s viewpoints on the notion of testing hypotheses in a (mostly) unified perspective.

And, as illustrated by the above quote, I found the sentence from Voltaire’s Candide I wanted to include. Of course, this 12 page review may be overly long for the journal it was intended for, Human Genetics, in which case I will have to find another outlet for the current arXived version. But I enjoyed reading this book with a pencil and gathered enough remarks along the way to fill those twelve pages.

Incoherent inference

Posted in Statistics, University life with tags , , , , , , on March 28, 2010 by xi'an

“The probability of the nested special case must be less than or equal to the probability of the general model within which the special case is nested. Any statistic that assigns greater probability to the special case is incoherent. An example of incoherence is shown for the ABC method.” Alan Templeton, PNAS, 2010

Alan Templeton just published an article in PNAS about “coherent and incoherent inference” (with applications to phylogeography and human evolution). While his (misguided) arguments are mostly those found in an earlier paper of his’ and discussed in this post as well as in the defence of model based inference twenty-two of us published in Molecular Ecology a few months ago, the paper contains a more general critical perspective on Bayesian model comparison, aligning argument after argument about the incoherence of the Bayesian approach (and not of ABC, as presented there). The notion of coherence is borrowed from the 1991 (Bayesian) paper of Lavine and Schervish on Bayes factors, which shows that Bayes factors may be nonmonotonous in the alternative hypothesis (but also that posterior probabilities aren’t!). Templeton’s first argument proceeds from the quote above, namely that larger models should have larger probabilities or else this violates logic and coherence! The author presents the reader with a Venn diagram to explain why a larger set should have a larger measure. Obviously, he does not account for the fact that in model choice, different models induce different parameters spaces and that those spaces are endowed with orthogonal measures, especially if the spaces are of different dimensions. In the larger space, P(\theta_1=0)=0. (This point is not even touching the issue of defining “the” probability over the collection of models that Templeton seems to take for granted but that does not make sense outside a Bayesian framework.) Talking therefore of nested models having a smaller probability than the encompassing model or of “partially overlapping models” does not make sense from a measure theoretic (hence mathematical) perspective. (The fifty-one occurences of coherent/incoherent in the paper do not bring additional weight to the argument!)

“Approximate Bayesian computation (ABC) is presented as allowing statistical comparisons among models. ABC assigns posterior probabilities to a finite set of simulated a priori models.” Alan Templeton, PNAS, 2010

An issue common to all recent criticisms by Templeton is the misleading or misled confusion between the ABC method and the resulting Bayesian inference. For instance, Templeton distinguishes between the incoherence in the ABC model choice procedure from the incoherence in the Bayes factor, when ABC is used as a computational device to approximate the Bayes factor. There is therefore no inferential aspect linked with ABC,  per se, it is simply a numerical tool to approximate Bayesian procedures and, with enough computer power, the approximation can get as precise as one wishes. In this paper, Templeton also reiterates the earlier criticism that marginal likelihoods are not comparable across models, because they “are not adjusted for the dimensionality of the data or the models” (sic!). This point is missing the whole purpose of using marginal likelihoods, namely that they account for the dimensionality of the parameter by providing a natural Ockham’s razor penalising the larger model without requiring to specify a penalty term. (If necessary, BIC is so successful! provides an approximation to this penalty, as well as the alternate DIC.) The second criticism of ABC (i.e. of the Bayesian approach) is that model choice requires a collection of models and cannot decide outside this collection. This is indeed the purpose of a Bayesian model choice and studies like Berger and Sellke (1987, JASA) have shown the difficulty of reasoning within a single model. Furthermore, Templeton advocates the use of a likelihood ratio test, which necessarily implies using two models. Another Venn diagram also explains why Bayes formula when used for model choice is “mathematically and logically incorrect” because marginal likelihoods cannot be added up when models “overlap”: according to him, “there can be no universal denominator, because a simple sum always violates the constraints of logic when logically overlapping models are tested“. Once more, this simply shows a poor understanding of the probabilistic modelling involved in model choice.

“The central equation of ABC is inherently incoherent for three separate reasons, two of which are applicable in every case that deals with overlapping hypotheses.” Alan Templeton, PNAS, 2010

This argument relies on the representation of the “ABC equation” (sic!)

P(H_i|H,S^*) = \dfrac{G_i(||S_i-S^*||) \Pi_i}{\sum_{j=1}^n G_j(||S_j-S^*||) \Pi_j}

where S^* is the observed summary statistic, S_i is “the vector of expected (simulated) summary statistics under model i” and “G_i is a goodness-of-fit measure“. Templeton states that this “fundamental equation is mathematically incorrect in every instance (..) of overlap.” This representation of the ABC approximation is again misleading or misled in that the simulation algorithm ABC produces an approximation to a posterior sample from \pi_i(\theta_i|S^*). The resulting approximation to the marginal likelihood under model M_i is a regular Monte Carlo step that replaces an integral with a weighted sum, not a “goodness-of-fit measure.”  The subsequent argument  of Templeton’s about the goodness-of-fit measures being “not adjusted for the dimensionality of the data” (re-sic!) and the resulting incoherence is therefore void of substance. The following argument repeats an earlier misunderstanding with the probabilistic model involved in Bayesian model choice: the reasoning that, if

\sum_j \Pi_j = 1

the constraints of logic are violated [and] the prior probabilities used in the very first step of their Bayesian analysis are incoherent“, does not assimilate the issue of measures over mutually exclusive spaces.

“ABC is used for parameter estimation in addition to hypothesis testing and another source of incoherence is suggested from the internal discrepancy between the posterior probabilities generated by ABC and the parameter estimates found by ABC.” Alan Templeton, PNAS, 2010

The point corresponding to the above quote is that, while the posterior probability that \theta_1=0 (model M_1) is much higher than the posterior probability of the opposite (model M_2), the Bayes estimate of \theta_1 under model M_2 is “significantly different from zero“. Again, this reflects both a misunderstanding of the probability model, namely that \theta_1=0 is impossible [has measure zero] under model M_2, and a confusion between confidence intervals (that are model specific) and posterior probabilities (that work across models). The concluding message that “ABC is a deeply flawed Bayesian procedure in which ignorance overwhelms data to create massive incoherence” is thus unsubstantiated.

“Incoherent methods, such as ABC, Bayes factor, or any simulation approach that treats all hypotheses as mutually exclusive, should never be used with logically overlapping hypotheses.” Alan Templeton, PNAS, 2010

In conclusion, I am quite surprised at this controversial piece of work being published in PNAS, as the mathematical and statistical arguments of Professor Templeton should have been assessed by referees who are mathematicians and statisticians, in which case they would have spotted the obvious inconsistencies!