Archive for phylogenetic models

over-confident about mis-specified models?

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , on April 30, 2019 by xi'an

Ziheng Yang and Tianqui Zhu published a paper in PNAS last year that criticises Bayesian posterior probabilities used in the comparison of models under misspecification as “overconfident”. The paper is written from a phylogeneticist point of view, rather than from a statistician’s perspective, as shown by the Editor in charge of the paper [although I thought that, after Steve Fienberg‘s intervention!, a statistician had to be involved in a submission relying on statistics!] a paper , but the analysis is rather problematic, at least seen through my own lenses… With no statistical novelty, apart from looking at the distribution of posterior probabilities in toy examples. The starting argument is that Bayesian model comparison is often reporting posterior probabilities in favour of a particular model that are close or even equal to 1.

“The Bayesian method is widely used to estimate species phylogenies using molecular sequence data. While it has long been noted to produce spuriously high posterior probabilities for trees or clades, the precise reasons for this over confidence are unknown. Here we characterize the behavior of Bayesian model selection when the compared models are misspecified and demonstrate that when the models are nearly equally wrong, the method exhibits unpleasant polarized behaviors,supporting one model with high confidence while rejecting others. This provides an explanation for the empirical observation of spuriously high posterior probabilities in molecular phylogenetics.”

The paper focus on the behaviour of posterior probabilities to strongly support a model against others when the sample size is large enough, “even when” all models are wrong, the argument being apparently that the correct output should be one of equal probability between models, or maybe a uniform distribution of these model probabilities over the probability simplex. Why should it be so?! The construction of the posterior probabilities is based on a meta-model that assumes the generating model to be part of a list of mutually exclusive models. It does not account for cases where “all models are wrong” or cases where “all models are right”. The reported probability is furthermore epistemic, in that it is relative to the measure defined by the prior modelling, not to a promise of a frequentist stabilisation in a ill-defined asymptotia. By which I mean that a 99.3% probability of model M¹ being “true”does not have a universal and objective meaning. (Moderation note: the high polarisation of posterior probabilities was instrumental in our investigation of model choice with ABC tools and in proposing instead error rates in ABC random forests.)

The notion that two models are equally wrong because they are both exactly at the same Kullback-Leibler distance from the generating process (when optimised over the parameter) is such a formal [or cartoonesque] notion that it does not make much sense. There is always one model that is slightly closer and eventually takes over. It is also bizarre that the argument does not account for the complexity of each model and the resulting (Occam’s razor) penalty. Even two models with a single parameter are not necessarily of intrinsic dimension one, as shown by DIC. And thus it is not a surprise if the posterior probability mostly favours one versus the other. In any case, an healthily sceptic approach to Bayesian model choice means looking at the behaviour of the procedure (Bayes factor, posterior probability, posterior predictive, mixture weight, &tc.) under various assumptions (model M¹, M², &tc.) to calibrate the numerical value, rather than taking it at face value. By which I do not mean a frequentist evaluation of this procedure. Actually, it is rather surprising that the authors of the PNAS paper do not jump on the case when the posterior probability of model M¹ say is uniformly distributed, since this would be a perfect setting when the posterior probability is a p-value. (This is also what happens to the bootstrapped version, see the last paragraph of the paper on p.1859, the year Darwin published his Origin of Species.)

Nature snapshot [Volume 539 Number 7627]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on November 15, 2016 by xi'an

A number of entries of interest [to me] in that Nature issue: from the Capuchin monkeys that break stones in a way that resembles early hominins biface tools, to the persistent association between some sounds and some meanings across numerous languages, to the use of infected mosquitoes in South America to fight Zika, to the call for more maths in psychiatry by the NIMH director, where since prevision is mentioned I presumed stats is included, to the potentially earthshaking green power revolution in Africa, to the reconstruction of the first HIV strains in North America, along with the deconstruction of the “Patient 0” myth, helped by Bayesian phylogenetic analyses, to a cover of the Open Syllabus Project, with Monte Carlo Statistical Methods arriving first [in the Monte Carlo list]….

“Observations should not converge on one model but aim to find anomalies that carry clues about the nature of dark matter, dark energy or initial conditions of the Universe. Further observations should be motivated by testing unconventional interpretations of those anomalies (such as exotic forms of dark matter or modified theories of gravity). Vast data sets may contain evidence for unusual behaviour that was unanticipated when the projects were conceived.” Avi Loeb

One editorial particularly drew my attention, Good data are not enough, by the astronomer Avi Loeb. as illustrated  by the quote above, Loeb objects to data being interpreted and even to data being collected towards the assessment of the standard model. While I agree that this model contains a lot of fudge factors like dark matter and dark energy, which apparently constitutes most of the available matter, the discussion is quite curious, in that interpreting data according to alternative theories sounds impossible and certainly beyond the reach of most PhD students [as Loeb criticises the analysis of some data in a recent thesis he evaluated].

“modern cosmology is augmented by unsubstantiated, mathematically sophisticated ideas — of the multiverse, anthropic reasoning and string theory.

The author argues to always allow for alternative interpretations of the data, which sounds fine at a primary level but again calls for the conception of such alternative models. When discrepancies are found between the standard model and the data, they can be due to errors in the measurement itself, in the measurement model, or in the theoretical model. However, they may be impossible to analyse outside the model, in the neutral way called and wished by Loeb. Designing neutral experiments sounds even less meaningful. Which is why I am fairly taken aback by the call to “a research frontier [that] should maintain at least two ways of interpreting data so that new experiments will aim to select the correct one”! Why two and not more?! And which ones?! I am not aware of fully developed alternative theories and cannot see how experiments designed under one model could produce indications about a new and incomplete model.

“Such simple, off-the-shelf remedies could help us to avoid the scientific fate of the otherwise admirable Mayan civilization.”

Hence I am bemused by the whole exercise, which deepest arguments seem to be a paper written by the author last year and an interdisciplinary centre on black holes also launched recently by the same author.

Methodological developments in evolutionary genomic [3 years postdoc in Montpellier]

Posted in pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , , on November 26, 2014 by xi'an

[Here is a call for a post-doctoral position in Montpellier, South of France, not Montpelier, Vermont!, in a population genetics group with whom I am working. Highly recommended if you are currently looking for a postdoc!]

Three-year post-doctoral position at the Institute of Computational Biology (IBC), Montpellier (France) :
Methodological developments in evolutionary genomics.

One young investigator position opens immediately at the Institute for Computational Biology (IBC) of Montpellier (France) to work on the development of innovative inference methods and software in population genomics or phylogenetics to analyze large-scale genomic data in the fields of health, agronomy and environment (Work Package 2 « evolutionary genomics » of the IBC). The candidate will develop its own research on some of the following topics : selective processes, demographic history, spatial genetic processes, very large phylogenies reconstruction, gene/species tree reconciliation, using maximum likelihood, Bayesian and simulation-based inference. We are seeking a candidate with a strong background in mathematical and computational evolutionary biology, with interest in applications and software development. The successfull candidate will work on his own project, build in collaboration with any researcher involved in the WP2 project and working at the IBC labs (AGAP, CBGP, ISEM, I3M, LIRMM, MIVEGEC).

IBC hires young investigators, typically with a PhD plus some post-doc experience, a high level of publishing, strong communication abilities, and a taste for multidisciplinary research. Working full-time at IBC, these young researchers will play a key role in Institute life. Most of their time will be devoted to scientific projects. In addition, they are expected to actively participate in the coordination of workpackages, in the hosting of foreign researchers and in the organization of seminars and events (summer schools, conferences…). In exchange, these young researchers will benefit from an exceptional environment thanks to the presence of numerous leading international researchers, not to mention significant autonomy for their work. Montpellier hosts one of the most vibrant communities of biodiversity research in Europe with several research centers of excellence in the field. This positions is open for up to 3 years with a salary well above the French post-doc standards. Starting date is open to discussion.

 The application deadline is January 31, 2015.

Living in Montpellier: http://www.agropolis.org/english/guide/index.html

 

Contacts at WP2 « Evolutionary Genetics » :

 

Jean-Michel Marin : http://www.math.univ-montp2.fr/~marin/

François Rousset : http://www.isem.univ-montp2.fr/recherche/teams/evolutionary-genetics/staff/roussetfrancois/?lang=en

Vincent Ranwez : https://sites.google.com/site/ranwez/

Olivier Gascuel : http://www.lirmm.fr/~gascuel/

Submit my application : http://www.ibc-montpellier.fr/open-positions/young-investigators#wp2-evolution

Semi-automatic ABC [revised]

Posted in Statistics with tags , , on April 18, 2011 by xi'an

Paul Fearnhead and Dennis Prangle have posted a revised version of their semi-automatic ABC paper. Compared with the earlier version commented on that post, the paper makes a better case for the ABC algorithm, considered there from a purely inferential viewpoint and calibrated for estimation purposes. In particular, the paper contains an important result in the form of a consistency theorem that shows that ABC is a convergent estimation method when the number of observations or datasets grows to infinity. I had not seen this result before and it definitely is an argument to remember when presenting ABC methods to newcomers.

Of course, I still remain skeptical about the “optimality” resulting from the choice of summary statistics in the paper, partly because

  • practice shows that proper approximation to genuine posterior distributions stems from using a (much) larger number of summary statistics than the dimension of the parameter;
  • the validity of the approximation to the optimal summary statistics depends on the quality of the pilot run;
  • important inferential issues like model choice are not covered by this approach.

But, nonetheless, the paper provides a way to construct default summary statistics that should come as a supplement to summary statistics provided by the experts, if not as a substitute.

The new version of Section 3 is much more satisfactory wrt the criticisms I voiced earlier, spelling out the computing cost and more importantly the connection with indirect inference. A clear strength of the paper remains with Section 4 which provides a major simulation experiment. My only criticism is the absence of a phylogeny example that would relate to the models that launched ABC methods. This is less of a mainstream statistics example, but it would be highly convincing to those primary users of ABC.

In conclusion, I find the paper both exciting and bringing new questions to the front. The appeal of this new field and the particularly highly debated issue of the choice of summary statistics will certainly create the opportunity for a wide discussion by the ABC community, were it to become a discussion paper.

Lack of confidence is back

Posted in Statistics, University life with tags , , , , on April 4, 2011 by xi'an

While in Bristol, I received the very good news that our ABC model choice submission to PNAS had passed the first round, namely, the manuscript was not rejected but instead the editor asks for a revision clarifying our message and the difference with the Bayesian Analysis paper. This revision should thus be manageable, even though the most negative reviewer considers that ABC model choice works rather when using summary statistics that are informative for choosing between the models, a point that sounds rather tautological to me., hence difficult to answer. It is indeed always possible to come up with summary statistics that make the ABC Bayes factor close to the genuine Bayes factor; however this requires a huge computing effort in order to validate the choice by cross-validation. Anyway, I have rather good hopes the paper could eventually be accepted in PNAS, which is obviously a quite exciting prospect! (Thanks to Michael Stumpf for pointing out the missing sentence!)

Incoherent phylogeographic inference [accepted]

Posted in Statistics, University life with tags , , , , , on August 30, 2010 by xi'an

The letter we submitted to PNAS about Templeton’s surprising diatribe on Bayesian inference has now been accepted:

Title: “Incoherent Phylogeographic Inference”
Tracking #: 2010-08762
Authors: Berger et al.

Dear Prof. Robert,
We are pleased to inform you that the PNAS Editorial Board has given final approval of your letter to the Editor for online publication. The author(s) of the published manuscript have been invited to respond to your feedback. If they provide a response, it may appear online concurrently with your letter.

Now we are looking forward (?) Alan Templeton’s answer, even though I suspect this short letter is not going to have any impact on his views!

Evidence and evolution (5)

Posted in Books, Statistics with tags , , , , , , , , , , , , on April 29, 2010 by xi'an

“Tout étant fait pour une fi n, tout est nécessairement pour la meilleure fi n. Remarquez bien que les nez ont été faits pour porter des lunettes, aussi avons-nous des lunettes.” Voltaire, Candide, Chapitre 1.

I am now done with my review of Sober’s Evidence and Evolution: The Logic Behind the Science, Posting about each chapter along the way helped me a lot to write down the review over the past few days. Its conclusion is that

Evidence and Evolution is very well-written, with hardly any typo (the unbiasedness property of AIC is stated at the bottom of page 101 with the expectation symbol E on the wrong side of the equation, Figure 3.8c is used instead of Figure 3.7c on page 204, Figure 4.7 is used instead of Figure 4.8 on page 293, Simon Tavaré’s name is always spelled Taveré, vaules rather than values is repeated four times on page 339). The style is sometimes too light and often too verbose, with an abundance of analogies that I regard as sidetracking, but this makes for an easier reading (except for the sentence “the key to answering the second question is that the observation that X = 1 and Y = 1 produces stronger evidence favoring CA over SA the lower the probability is that the ancestors postulated by the two hypotheses were in state 1”, on page 314, that still eludes me!). As detailed in this review, I have points of contentions with the philosophical views about testing in Evidence and Evolution as well as about the methods exposed therein, but this does not detract from the appeal of reading the book. (The lack of completely worked out statistical hypotheses in realistic settings remains the major issue in my criticism of the book.) While the criticisms of the Bayesian paradigm are often shallow (like the one on page 97 ridiculing Bayesians drawing inference based on a single observation), there is nothing fundamentally wrong with the statistical foundations of the book. I therefore repeat my earlier recommendation in favour of Evidence and Evolution, Chapters 1 and (paradoxically) 5 being the easier entries. Obviously, readers familiar with Sober’s earlier papers and books will most likely find a huge overlap with those but others will gather Sober’s viewpoints on the notion of testing hypotheses in a (mostly) unified perspective.

And, as illustrated by the above quote, I found the sentence from Voltaire’s Candide I wanted to include. Of course, this 12 page review may be overly long for the journal it was intended for, Human Genetics, in which case I will have to find another outlet for the current arXived version. But I enjoyed reading this book with a pencil and gathered enough remarks along the way to fill those twelve pages.