Archive for phylogeography

Bayesian phylogeographic inference of SARS-CoV-2

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , on December 14, 2020 by xi'an

Nature Communications of 10 October has a paper by Philippe Lemey et al. (incl. Marc Suchard) on including travel history and removing sampling bias on the study of the virus spread. (Which I was asked to review for a CNRS COVID watch platform, Bibliovid.)

The data is made of curated genomes available in GISAID on March 10, that is, before lockdown even started in France. With (trustworthy?) travel history data for over 20% of the sampled patients. (And an unwelcome reminder that Hong Kong is part of China, at a time of repression and “mainlandisation” by the CCP.)

“we model a discrete diffusion process between 44 locations within China, including 13 provinces, one municipality (Beijing), and one special administrative area (Hong Kong). We fit a generalized linear model (GLM) parameterization of the discrete diffusion process…”

The diffusion is actually a continuous-time Markov process, with a phylogeny that incorporates nodes associated with location. The Bayesian analysis of the model is made by MCMC, since, contrary to ABC, the likelihood can be computed by Felsenstein’s pruning algorithm. The covariates are used to calibrate the Markov process transitions between locations. The paper also includes a posterior predictive accuracy assessment.

“…we generate Markov jump estimates of the transition histories that are averaged over the entire posterior in our Bayesian inference.”

In particular the paper describes “travel-aware reconstruction” analyses that track the spatial path followed by a virus until collection, as below. The top graph represents the posterior probability distribution of this path.Given the lack of representativity, the authors also develop an additional “approach that adds unsampled taxa to assess the sensitivity of inferences to sampling bias”, although it mostly reflects the assumptions made in producing the artificial data. (With a possible connection with ABC?). If I understood correctly, they added 458 taxa for 14 locations,

An interesting opening made in the conclusion about the scalability of the approach:

“With the large number of SARS-CoV-2 genomes now available, the question arises how scalable the incorporation of un-sampled taxa will be. For computationally expensive Bayesian inferences, the approach may need to go hand in hand with down-sampling procedures or more detailed examination of specific sub-lineages.”

In the end, I find it hard, as with other COVID-related papers I read, to check how much the limitations, errors, truncations, &tc., attached with the data at hand impact the validation of this philogeographic reconstruction, and how the model can help further than reconstructing histories of contamination at the (relatively) early stage.

all models are wrong

Posted in Statistics, University life with tags , , , , , , , on September 27, 2014 by xi'an

“Using ABC to evaluate competing models has various hazards and comes with recommended precautions (Robert et al. 2011), and unsurprisingly, many if not most researchers have a healthy scepticism as these tools continue to mature.”

Michael Hickerson just published an open-access letter with the above title in Molecular Ecology. (As in several earlier papers, incl. the (in)famous ones by Templeton, Hickerson confuses running an ABC algorithm with conducting Bayesian model comparison, but this is not the main point of this post.)

“Rather than using ABC with weighted model averaging to obtain the three corresponding posterior model probabilities while allowing for the handful of model parameters (θ, τ, γ, Μ) to be estimated under each model conditioned on each model’s posterior probability, these three models are sliced up into 143 ‘submodels’ according to various parameter ranges.”

The letter is in fact a supporting argument for the earlier paper of Pelletier and Carstens (2014, Molecular Ecology) which conducted the above splitting experiment. I could not read this paper so cannot judge of the relevance of splitting this way the parameter range. From what I understand it amounts to using mutually exclusive priors by using different supports.

“Specifically, they demonstrate that as greater numbers of the 143 sub-models are evaluated, the inference from their ABC model choice procedure becomes increasingly.”

An interestingly cut sentence. Increasingly unreliable? mediocre? weak?

“…with greater numbers of models being compared, the most probable models are assigned diminishing levels of posterior probability. This is an expected result…”

True, if the number of models under consideration increases, under a uniform prior over model indices, the posterior probability of a given model mechanically decreases. But the pairwise Bayes factors should not be impacted by the number of models under comparison and the letter by Hickerson states that Pelletier and Carstens found the opposite:

“…pairwise Bayes factor[s] will always be more conservative except in cases when the posterior probabilities are equal for all models that are less probable than the most probable model.”

Which means that the “Bayes factor” in this study is computed as the ratio of a marginal likelihood and of a compound (or super-marginal) likelihood, averaged over all models and hence incorporating the prior probabilities of the model indices as well. I had never encountered such a proposal before. Contrary to the letter’s claim:

“…using the Bayes factor, incorporating all models is perhaps more consistent with the Bayesian approach of incorporating all uncertainty associated with the ABC model choice procedure.”

Besides the needless inclusion of ABC in this sentence, a somewhat confusing sentence, as Bayes factors are not, stricto sensu, Bayesian procedures since they remove the prior probabilities from the picture.

“Although the outcome of model comparison with ABC or other similar likelihood-based methods will always be dependent on the composition of the model set, and parameter estimates will only be as good as the models that are used, model-based inference provides a number of benefits.”

All models are wrong but the very fact that they are models allows for producing pseudo-data from those models and for checking if the pseudo-data is similar enough to the observed data. In components that matters the most for the experimenter. Hence a loss function of sorts…