Archive for Markov process

Xing glass bridges [or not]

Posted in Books, Kids, pictures, R, Statistics with tags , , , , , , , on November 10, 2021 by xi'an

A riddle from the Riddler surfing on Squid Games. Evaluating the number of survivors (out of 16 players) able to X the glass bridge, when said bridge is made of 18 consecutive steps, each involving a choice between a tempered and a non-tempered glass square. Stepping on a non-tempered square means death, while all following players are aware of the paths of the earlier ones. Each player thus moves at least one step further than the previous and unlucky player. The total number of steps used by the players is therefore a Negative Binomial Neg(16,½) variate truncated at 19 (if counting attempts rather than failures), with the probability of reaching 19 being .999. When counting the number of survivors, a direct simulation gives an estimate very close to 7:

   mean(apply(apply(matrix(rgeom(16*1e6,.5)+1,nc=16),1,cumsum)>18,2,sum))

but the expectation is not exactly 7! Indeed, this value is a sum of probabilities that the cumulated sums of Geometric variates are larger than 18, which has no closed form as far as I can see

   sum(1-pnbinom(size=1:16,q=17:2,prob=.5)

but whose value is 7.000076. In the Korean TV series, there are only three survivors, which would have had a .048 probability of occurring. (Not accounting for the fact that one player was temporarily able to figure out which square was right and that two players fell through at the same time.)

Looking later at on-line discussions, I found that the question was quite popular, with a whole spectrum of answers… Including a wrong Binomial B(18, ½) modelling that does not account for the fact that all 16 (incredibly unlucky) players could have died before the last steps.

And reading the solution on The Riddler a week later, I was sorry to see this representation of the distribution of survivors, as if it was a continuous distribution!

Bayesian phylogeographic inference of SARS-CoV-2

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , on December 14, 2020 by xi'an

Nature Communications of 10 October has a paper by Philippe Lemey et al. (incl. Marc Suchard) on including travel history and removing sampling bias on the study of the virus spread. (Which I was asked to review for a CNRS COVID watch platform, Bibliovid.)

The data is made of curated genomes available in GISAID on March 10, that is, before lockdown even started in France. With (trustworthy?) travel history data for over 20% of the sampled patients. (And an unwelcome reminder that Hong Kong is part of China, at a time of repression and “mainlandisation” by the CCP.)

“we model a discrete diffusion process between 44 locations within China, including 13 provinces, one municipality (Beijing), and one special administrative area (Hong Kong). We fit a generalized linear model (GLM) parameterization of the discrete diffusion process…”

The diffusion is actually a continuous-time Markov process, with a phylogeny that incorporates nodes associated with location. The Bayesian analysis of the model is made by MCMC, since, contrary to ABC, the likelihood can be computed by Felsenstein’s pruning algorithm. The covariates are used to calibrate the Markov process transitions between locations. The paper also includes a posterior predictive accuracy assessment.

“…we generate Markov jump estimates of the transition histories that are averaged over the entire posterior in our Bayesian inference.”

In particular the paper describes “travel-aware reconstruction” analyses that track the spatial path followed by a virus until collection, as below. The top graph represents the posterior probability distribution of this path.Given the lack of representativity, the authors also develop an additional “approach that adds unsampled taxa to assess the sensitivity of inferences to sampling bias”, although it mostly reflects the assumptions made in producing the artificial data. (With a possible connection with ABC?). If I understood correctly, they added 458 taxa for 14 locations,

An interesting opening made in the conclusion about the scalability of the approach:

“With the large number of SARS-CoV-2 genomes now available, the question arises how scalable the incorporation of un-sampled taxa will be. For computationally expensive Bayesian inferences, the approach may need to go hand in hand with down-sampling procedures or more detailed examination of specific sub-lineages.”

In the end, I find it hard, as with other COVID-related papers I read, to check how much the limitations, errors, truncations, &tc., attached with the data at hand impact the validation of this philogeographic reconstruction, and how the model can help further than reconstructing histories of contamination at the (relatively) early stage.

%d bloggers like this: