## Bayesian phylogeographic inference of SARS-CoV-2

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , on December 14, 2020 by xi'an

Nature Communications of 10 October has a paper by Philippe Lemey et al. (incl. Marc Suchard) on including travel history and removing sampling bias on the study of the virus spread. (Which I was asked to review for a CNRS COVID watch platform, Bibliovid.)

The data is made of curated genomes available in GISAID on March 10, that is, before lockdown even started in France. With (trustworthy?) travel history data for over 20% of the sampled patients. (And an unwelcome reminder that Hong Kong is part of China, at a time of repression and “mainlandisation” by the CCP.)

“we model a discrete diffusion process between 44 locations within China, including 13 provinces, one municipality (Beijing), and one special administrative area (Hong Kong). We fit a generalized linear model (GLM) parameterization of the discrete diffusion process…”

The diffusion is actually a continuous-time Markov process, with a phylogeny that incorporates nodes associated with location. The Bayesian analysis of the model is made by MCMC, since, contrary to ABC, the likelihood can be computed by Felsenstein’s pruning algorithm. The covariates are used to calibrate the Markov process transitions between locations. The paper also includes a posterior predictive accuracy assessment.

“…we generate Markov jump estimates of the transition histories that are averaged over the entire posterior in our Bayesian inference.”

In particular the paper describes “travel-aware reconstruction” analyses that track the spatial path followed by a virus until collection, as below. The top graph represents the posterior probability distribution of this path.Given the lack of representativity, the authors also develop an additional “approach that adds unsampled taxa to assess the sensitivity of inferences to sampling bias”, although it mostly reflects the assumptions made in producing the artificial data. (With a possible connection with ABC?). If I understood correctly, they added 458 taxa for 14 locations,

An interesting opening made in the conclusion about the scalability of the approach:

“With the large number of SARS-CoV-2 genomes now available, the question arises how scalable the incorporation of un-sampled taxa will be. For computationally expensive Bayesian inferences, the approach may need to go hand in hand with down-sampling procedures or more detailed examination of specific sub-lineages.”

In the end, I find it hard, as with other COVID-related papers I read, to check how much the limitations, errors, truncations, &tc., attached with the data at hand impact the validation of this philogeographic reconstruction, and how the model can help further than reconstructing histories of contamination at the (relatively) early stage.

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , on December 9, 2013 by xi'an

This week, thanks to a lack of clear instructions (from me) to my students in the Reading Classics student seminar, four students showed up with a presentation! Since I had planned for two teaching blocks, three of them managed to fit within the three hours, while the last one nicely accepted to wait till next week to present a paper by David Cox…

The first paper discussed therein was A new look at the statistical model identification, written in 1974 by Hirotugu Akaike. And presenting the AIC criterion. My student Rozan asked to give the presentation in French as he struggled with English, but it was still a challenge for him and he ended up being too close to the paper to provide a proper perspective on why AIC is written the way it is and why it is (potentially) relevant for model selection. And why it is not such a definitive answer to the model selection problem. This is not the simplest paper in the list, to be sure, but some intuition could have been built from the linear model, rather than producing the case of an ARMA(p,q) model without much explanation. (I actually wonder why the penalty for this model is (p+q)/T, rather than (p+q+1)/T for the additional variance parameter.) Or simulation ran on the performances of AIC versus other xIC’s…

The second paper was another classic, the original GLM paper by John Nelder and his coauthor Wedderburn, published in 1972 in Series B. A slightly easier paper, in that the notion of a generalised linear model is presented therein, with mathematical properties linking the (conditional) mean of the observation with the parameters and several examples that could be discussed. Plus having the book as a backup. My student Ysé did a reasonable job in presenting the concepts, but she would have benefited from this extra-week in including properly the computations she ran in R around the glm() function… (The definition of the deviance was somehow deficient, although this led to a small discussion during the class as to how the analysis of deviance was extending the then flourishing analysis of variance.) In the generic definition of the generalised linear models, I was also reminded of the
generality of the nuisance parameter modelling, which made the part of interest appear as an exponential shift on the original (nuisance) density.

The third paper, presented by Bong, was yet another classic, namely the FDR paper, Controlling the false discovery rate, of Benjamini and Hochberg in Series B (which was recently promoted to the should-have-been-a-Read-Paper category by the RSS Research Committee and discussed at the Annual RSS Conference in Edinburgh four years ago, as well as published in Series B). This 2010 discussion would actually have been a good start to discuss the paper in class, but Bong was not aware of it and mentioned earlier papers extending the 1995 classic. She gave a decent presentation of the problem and of the solution of Benjamini and Hochberg but I wonder how much of the novelty of the concept the class grasped. (I presume everyone was getting tired by then as I was the only one asking questions.) The slides somewhat made it look too much like a simulation experiment… (Unsurprisingly, the presentation did not include any Bayesian perspective on the approach, even though they are quite natural and emerged very quickly once the paper was published. I remember for instance the Valencia 7 meeting in Teneriffe where Larry Wasserman discussed about the Bayesian-frequentist agreement in multiple testing.)

## Core [still] minus one…

Posted in Books, pictures, R, Running, Statistics, Travel, University life with tags , , , , , , on September 23, 2012 by xi'an

Another full day spent working with Jean-Michel Marin on the new edition of Bayesian Core (soon to be Bayesian Essentials with R!) and the remaining hierarchical Bayes chapter… I have reread and completed the regression and GLM chapters, sent to very friendly colleagues for a last round of comments. Now, I am essentially idle, waiting for Jean-Michel to finish his part on the hierarchical Bayes chapter, so that I can do the final editing.round. Jean-Michel had a very long day on that chapter, leaving Montpellier at 5am to return only at half past midnight, due to massive delays in the train schedule (which is why I always fly to Montpellier…)

## ASC 2012 (#3, also available by mind-reading)

Posted in Running, Statistics, University life with tags , , , , , , , , , on July 13, 2012 by xi'an

This final morning at the ASC 2012 conference in Adelaide, I attended a keynote lecture by Sophia Rabe-Hesketh on GLMs that I particularly appreciated, as I am quite fond of those polymorphous and highly adaptable models (witness the rich variety of applications at the INLA conference in Trondheim last month). I then gave my talk on ABC model choice, trying to cover the three episodes in the series within the allocated 40 minutes (and got from Terry Speed the trivia information that Renfrey Potts, father to the Potts model, spent most of his life in Adelaide, where he died in 2005! Terry added that he used to run along the Torrens river, being a dedicated marathon runner. This makes Adelaide the death place of both R.A. Fisher and R. Potts.)

Later in the morning, Christl Donnelly  gave a fascinating talk on her experiences with government bodies during the BSE and foot-and-mouth epidemics in Britain in the past decades. It was followed by  a frankly puzzling [keynote Ozcots] talk delivered by Jessica Utts on the issue of parapsychology tests, i.e. the analysis of experiments testing for “psychic powers”. Nothing less. Actually, I first thought this was a pedagogical trick to capture the attention of students and debunk, however Utts’ focus on exhibiting such “powers” was definitely dead serious and she concluded that “psychic functioning appears to be a real effect”. So it came as a shock that she was truly believing in psychic paranormal abilities! I had been under the wrong impression that the 2005 Statistical Science paper of hers was demonstrating the opposite but it clearly belongs to the tradition of controversial Statistical Science that started with the Bible code paper… I also found it flabbergasting to learn that the U.S. Army is/was funding research in this area and is/was actually employing “psychics”, as well that the University of Edinburgh has a parapsychology unit within the department of psychology. (But, after all, UK universities also have long had schools of Divinity, so let the irrational in a while ago!) Continue reading

## Hyper-g priors

Posted in Books, R, Statistics with tags , , , , , , on August 31, 2010 by xi'an

Earlier this month, Daniel Sabanés Bové and Leo Held posted a paper about g-priors on arXiv. While I glanced at it for a few minutes, I did not have the chance to get a proper look at it till last Sunday. The g-prior was first introduced by the late Arnold Zellner for (standard) linear models, but they can be extended to generalised linear models (formalised by the late John Nelder) at little cost. In Bayesian Core, Jean-Michel Marin and I do centre the prior modelling in both linear and generalised linear models around g-priors, using the naïve extension for generalised linear models,

$\beta \sim \mathcal{N}(0,g \sigma^2 (\mathbf{X}^\text{T}\mathbf{X})^{-1})$

as in the linear case. Indeed, the reasonable alternative would be to include the true information matrix but since it depends on the parameter $\beta$ outside the normal case this is not truly an alternative. Bové and Held propose a slightly different version

$\beta \sim \mathcal{N}(0,g \sigma^2 c (\mathbf{X}^\text{T}\mathbf{W}\mathbf{X})^{-1})$

where W is a diagonal weight matrix and c is a family dependent scale factor evaluated at the mode 0. As in Liang et al. (2008, JASA) and most of the current literature, they also separate the intercept $\beta_0$ from the other regression coefficients. They also burn their “improperness joker” by choosing a flat prior on $\beta_0$, which means they need to use a proper prior on g, again as Liang et al. (2008, JASA), for the corresponding Bayesian model comparison to be valid. In Bayesian Core, we do not separate $\beta_0$ from the other regression coefficients and hence are left with one degree of freedom that we spend in choosing an improper prior on g instead. (Hence I do not get the remark of Bové and Held that our choice “prohibits Bayes factor comparisons with the null model“. As argued in Bayesian Core, the factor g being an hyperparameter shared by all models, we can use the same improper prior on g in all models and hence use standard Bayes factors.) In order to achieve closed form expressions, the authors use Cui and George ‘s (2008) prior

$\pi(g) \propto (1+g)^{1+a}\exp\{-b/(1+g)\}$

which requires the two hyper-hyper-parameters a and b to be specified.

The second part of the paper considers computational issues. It compares the ILA solution of Rue, Martino and Chopin (2009, Series B) with an MCMC solution based on an independent proposal on g resulting from linear interpolations (?). The marginal likelihoods are approximated by Chib and Jeliazkov (2001, JASA) for the MCMC part. Unsurprisingly, ILA does much better, even with a 97% acceptance rate in the MCMC algorithm.

The paper is very well-written and quite informative about the existing literature. It also uses the Pima Indian dataset  (The authors even dug out a 1991 paper of mine I had completely forgotten!) I am actually thinking of using the review in our revision of Bayesian Core, even though I think we should stick to our choice of including $\beta_0$ within the set of parameters…