## prime suspects [book review]

Posted in Books, Kids, University life with tags , , , , , , , , , , , , , , on August 6, 2019 by xi'an

I was contacted by Princeton University Press to comment on the comic book/graphic novel Prime Suspects (The Anatomy of Integers and Permutations), by Andrew Granville (mathematician) & Jennifer Granville (writer), and Robert Lewis (illustrator), and they sent me the book. I am not a big fan of graphic book entries to mathematical even less than to statistical notions (Logicomix being sort of an exception for its historical perspective and nice drawing style) and this book did nothing to change my perspective on the subject. First, the plot is mostly a pretense at introducing number theory concepts and I found it hard to follow it for more than a few pages. The [noires maths] story is that “forensic maths” detectives are looking at murders that connects prime integers and permutations… The ensuing NCIS-style investigation gives the authors the opportunity to skim through the whole cenacle of number theorists, plus a few other mathematicians, who appear as more or less central characters. Even illusory ones like Nicolas Bourbaki. And Alexander Grothendieck as a recluse and clairvoyant hermit [who in real life did not live in a Pyrénées cavern!!!]. Second, I [and nor is Andrew who was in my office when the book arrived!] am not particularly enjoying the drawings or the page composition or the colours of this graphic novel, especially because I find the characters drawn quite inconsistently from one strip to the next, to the point of being unrecognisable, and, if it matters, hardly resembling their real-world equivalent (as seen in the portrait of Persi Diaconis). To be completely honest, the drawings look both ugly and very conventional to me, in that I do not find much of a characteristic style to them. To contemplate what Jacques TardiFrançois Schuiten or José Muñoz could have achieved with the same material… (Or even Edmond Baudoin, who drew the strips for the graphic novels he coauthored with Cédric Villani.) The graphic novel (with a prime 181 pages) is postfaced with explanations about the true persons behind the characters, from Carl Friedriech Gauß to Terry Tao, and of course on the mathematical theory for the analogies between the prime and cycles frequencies behind the story. Which I find much more interesting and readable, obviously. (With a surprise appearance of Kingman’s coalescent!) But also somewhat self-defeating in that so much has to be explained on the side for the links between the story, the characters and the background heavily loaded with “obscure references” to make sense to more than a few mathematician readers. Who may prove to be the core readership of this book.

There is also a bit of a Gödel-Escher-and-Bach flavour in that a piece by Robert Schneider called Réverie in Prime Time Signature is included, while an Escher’s infinite stairway appears in one page, not far from what looks like Milano Vittorio Emmanuelle gallery (On the side, I am puzzled by the footnote on p.208 that “I should clarify that selecting a random permutation and a random prime, as described, can be done easily, quickly, and correctly”. This may be connected to the fact that the description of Bach’s algorithm provided therein is incomplete.)

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE. As appropriate for a book about Chance!]

## JSM 2015 [day #3]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on August 12, 2015 by xi'an

My first morning session was about inference for philogenies. While I was expecting some developments around Kingman’s  coalescent models my coauthors needed and developped ABC for, I was surprised to see models that were producing closed form (or close enough to) likelihoods. Due to strong restrictions on the population sizes and migration possibilities, as explained later to me by Vladimir Minin. No need for ABC there since MCMC was working on the species trees, with Vladimir Minin making use of [the Savage Award winner] Vinayak Rao’s approach on trees that differ from the coalescent. And enough structure to even consider and demonstrate tree identifiability in Laura Kubatko’s case.

I then stopped by the astrostatistics session as the first talk by Gwendolin Eddie was about galaxy mass estimation, a problem I may actually be working on in the Fall, but it ended up being a completely different problem and I was further surprised that the issue of whether or not the data was missing at random was not considered by the authors.

Christening a session Unifying foundation(s) may be calling for trouble, at least from me! In this spirit, Xiao Li Meng gave a talk attempting at a sort of unification of the frequentist, Bayesian, and fiducial paradigms by introducing the notion of personalized inference, which is a notion I had vaguely thought of in the past. How much or how far do you condition upon? However, I have never thought of this justifying fiducial inference in any way and Xiao Li’s lively arguments during and after the session not enough to convince me of the opposite: Prior-free does not translate into (arbitrary) choice-free. In the earlier talk about confidence distributions by Regina Liu and Minge Xie, that I partly missed for Galactic reasons, I just entered into the room at the very time when ABC was briefly described as a confidence distribution because it was not producing a convergent approximation to the exact posterior, a logic that escapes me (unless those confidence distributions are described in such a loose way as to include about any method f inference). Dongchu Sun also gave us a crash course on reference priors, with a notion of random posteriors I had not heard of before… As well as constructive posteriors… (They seemed to mean constructible matching priors as far as I understood.)

The final talk in this session by Chuanhei Liu on a new approach (modestly!) called inferential model was incomprehensible, with the speaker repeatedly stating that the principles were too hard to explain in five minutes and needed an incoming book… I later took a brief look at an associated paper, which relates to fiducial inference and to Dempster’s belief functions. For me, it has the same Münchhausen feeling of creating a probability out of nothing, creating a distribution on the parameter by ignoring the fact that the fiducial equation x=a(θ,u) modifies the distribution of u once x is observed.

## posterior predictive checks for admixture models

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , on July 8, 2014 by xi'an

In a posting coincidence, just a few days after we arXived our paper on ABC model choice with random forests, where we use posterior predictive errors for assessing the variability of the random forest procedure, David Mimno, David Blei, and Barbara Engelhardt arXived a paper on posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure, which deals with similar data and models, while also using the posterior predictive as a central tool. (Marginalia: the paper is a wee bit difficult to read [esp. with France-Germany playing in the airport bar!] as the modelling is only clearly described at the very end. I suspect this arXived version was put together out of a submission to a journal like Nature or PNAS, with mentions of a Methods section that does not appear here and of Supplementary Material that turned into subsections of the Discussion section.)

The dataset are genomic datasets made of SNPs (single nucleotide polymorphisms). For instance, the first (HapMap) dataset corresponds to 1,043 individuals and 468,167 SNPs. The model is simpler than Kingman’s coalescent, hence its likelihood does not require ABC steps to run inference. The admixture model in the paper is essentially a mixture model over ancestry indices with individual dependent weights with Bernoulli observations, hence resulting into a completed likelihood of the form

$\prod_{i=1}^n\prod_{\ell=1}^L\prod_j \phi_{\ell,z_{i,\ell,j}}^{x_{i,\ell,j}}(1-\phi_{\ell,z_{i,\ell,j}})^{1-x_{i,\ell,j}}\theta_{i,z_{i,\ell,j}}$

(which looks more formidable than it truly is!). Regular Bayesian inference is thus possible in this setting, implementing e.g. Gibbs sampling. The authors chose instead to rely on EM and thus derived the maximum likelihood estimators of the (many) parameters of the admixture. And of the latent variables z. Their posterior predictive check is based on the simulation of pseudo-observations (as in ABC!) from the above likelihood, with parameters and latent variables replaced with their EM estimates (unlike ABC). There is obviously some computational reason in doing this instead of simulating from the posterior, albeit implicit in the paper. I am however slightly puzzled by the conditioning on the latent variable estimate , as its simulation is straightforward and as a latent variable is more a missing observation than a parameter. Given those 30 to 100 replications of the data, an empirical distribution of a discrepancy function is used to assess whether or not the equivalent discrepancy for the observation is an outlier. If so, the model is not appropriate for the data. (Interestingly, the discrepancy is measured via the Bayes factor of z-scores.)

The connection with our own work is that the construction of discrepancy measures proposed in this paper could be added to our already large collection of summary statistics to check to potential impact in model comparison, i.e. for a significant contribution to the random forest nodes.  Conversely, the most significant summary statistics could then be tested as discrepancy measures. Or, more in tune with our Series B paper on the proper selection of summary variables, the distribution of those discrepancy measures could be compared across potential models. Assuming this does not take too much computing power…

## improved approximate-Bayesian model-choice method for estimating shared evolutionary history [reply from the author]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on June 3, 2014 by xi'an

[Here is a very kind and detailed reply from Jamie Oakes to the comments I made on his ABC paper a few days ago:]

First of all, many thanks for your thorough review of my pre-print! It is very helpful and much appreciated. I just wanted to comment on a few things you address in your post.

I am a little confused about how my replacement of continuous uniform probability distributions with gamma distributions for priors on several parameters introduces a potentially crippling number of hyperparameters. Both uniform and gamma distributions have two parameters. So, the new model only has one additional hyperparameter compared to the original msBayes model: the concentration parameter on the Dirichlet process prior on divergence models. Also, the new model offers a uniform prior over divergence models (though I don’t recommend it).

Your comment about there being no new ABC technique is 100% correct. The model is new, the ABC numerical machinery is not. Also, your intuition is correct, I do not use the divergence times to calculate summary statistics. I mention the divergence times in the description of the ABC algorithm with the hope of making it clear that the times are scaled (see Equation (12)) prior to the simulation of the data (from which the summary statistics are calculated). This scaling is simply to go from units proportional to time, to units that are proportional to the expected number of mutations. Clearly, my attempt at clarity only created unnecessary opacity. I’ll have to make some edits.

Regarding the reshuffling of the summary statistics calculated from different alignments of sequences, the statistics are not exchangeable. So, reshuffling them in a manner that is not conistent across all simulations and the observed data is not mathematically valid. Also, if elements are exchangeable, their order will not affect the likelihood (or the posterior, barring sampling error). Thus, if our goal is to approximate the likelihood, I would hope the reshuffling would also have little affect on the approximate posterior (otherwise my approximation is not so good?).

You are correct that my use of “bias” was not well defined in reference to the identity line of my plots of the estimated vs true probability of the one-divergence model. I think we can agree that, ideally (all assumptions are met), the estimated posterior probability of a model should estimate the probability that the model is correct. For large numbers of simulation
replicates, the proportion of the replicates for which the one-divergence model is true will approximate the probability that the one-divergence model is correct. Thus, if the method has the desirable (albeit “frequentist”) behavior such that the estimated posterior probability of the one-divergence model is an unbiased estimate of the probability that the one-divergence model is correct, the points should fall near the identity line. For example, let us say the method estimates a posterior probability of 0.90 for the one-divergence model for 1000 simulated datasets. If the method is accurately estimating the probability that the one-divergence model is the correct model, then the one-divergence model should be the true model for approximately 900 of the 1000 datasets. Any trend away from the identity line indicates the method is biased in the (frequentist) sense that it is not correctly estimating the probability that the one-divergence model is the correct model. I agree this measure of “bias” is frequentist in nature. However, it seems like a worthwhile goal for Bayesian model-choice methods to have good frequentist properties. If a method strongly deviates from the identity line, it is much more difficult to interpret the posterior probabilites that it estimates. Going back to my example of the posterior probability of 0.90 for 1000 replicates, I would be alarmed if the model was true in only 100 of the replicates.

My apologies if my citation of your PNAS paper seemed misleading. The citation was intended to be limited to the context of ABC methods that use summary statistics that are insufficient across the models under comparison (like msBayes and the method I present in the paper). I will definitely expand on this sentence to make this clearer in revisions. Thanks!

Lastly, my concluding remarks in the paper about full-likelihood methods in this domain are not as lofty as you might think. The likelihood function of the msBayes model is tractable, and, in fact, has already been derived and implemented via reversible-jump MCMC (albeit, not readily available yet). Also, there are plenty of examples of rich, Kingman-coalescent models implemented in full-likelihood Bayesian frameworks. Too many to list, but a lot of them are implemented in the BEAST software package. One noteworthy example is the work of Bryant et al. (2012, Molecular Biology and Evolution, 29(8), 1917–32) that analytically integrates over all gene trees for biallelic markers under the coalescent.

## improved approximate-Bayesian model-choice method for estimating shared evolutionary history

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on May 14, 2014 by xi'an

“An appealing approach would be a comparative, Bayesian model-choice method for inferring the probability of competing divergence histories while integrating over uncertainty in mutational and ancestral processes via models of nucleotide substitution and lineage coalescence.” (p.2)

Jamies Oaks arXived (a few months ago now) a rather extensive Monte-Carlo study on the impact of prior modelling on the model-choice performances of ABC model choice. (Of which I only became aware recently.) As in the earlier paper I commented on the Óg, the issue here has much more to do with prior assessment and calibration than with ABC implementation per se. For instance, the above quote recaps the whole point of conducting Bayesian model choice. (As missed by Templeton.)

“This causes divergence models with more divergence-time parameters to integrate over a much greater parameter space with low likelihood yet high prior density, resulting in small marginal likelihoods relative to models with fewer divergence-time parameters.” (p.2)

This second quote is essentially stressing the point with Occam’s razor argument. Which I deem [to be] a rather positive feature of Bayesian model choice. A reflection on the determination of the prior distribution, getting away from uniform priors, thus sounds most timely! The current paper takes place within a rather extensive exchange between Oak’s group and Hickerson’s group on what makes Bayesian model choice (and the associated software msBayes) pick or not the correct model. Oak and coauthors objected to the use of “narrow, empirically informed uniform priors”, arguing that this leads to a bias towards models with less parameters, a “statistical issue” in their words, while Hickerson et al. (2014) think this is due to msBayes way of selecting models and their parameters at random. However it refrains from reproducing earlier criticisms of or replies to Hickerson et al.

The current paper claims to have reached a satisfactory prior modelling with ¨improved robustness, accuracy, and power” (p.3).  If I understand correctly, the changes are in replacing a uniform distribution with a Gamma or a Dirichlet prior. Which means introducing a seriously large and potentially crippling number of hyperparameters into the picture. Having a lot of flexibility in the prior also means a lot of variability in the resulting inference… In other words, with more flexibility comes more responsibility, to paraphrase Voltaire.

“I have introduced a new approximate-Bayesian model choice method.” (p.21)

The ABC part is rather standard, except for the strange feature that the divergence times are used to construct summary statistics (p.10). Strange because these times are not observed for the actual data. So I must be missing something. (And I object to the above quote and to the title of the paper since there is no new ABC technique there, simply a different form of prior.)

“ABC methods in general are known to be biased for model choice.” (p.21)

I do not understand much the part about (reshuffling) introducing bias as detailed on p.11: every approximate method gives a “biased” answer in the sense this answer is not the true and proper posterior distribution. Using a different (re-ordered) vector of statistics provides a different ABC outcome,  hence a different approximate posterior, for which it seems truly impossible to check whether or not it increases the discrepancy from the true posterior, compared with the other version. I must admit I always find annoying to see the word bias used in a vague meaning and esp. within a Bayesian setting. All Bayesian methods are biased. End of the story. Quoting our PNAS paper as concluding that ABC model choice is biased is equally misleading: the intended warning represented by the paper was that Bayes factors and posterior probabilities could be quite unrelated with those based on the whole dataset. That the proper choice of summary statistics leads to a consistent model choice shows ABC model choice is not necessarily “biased”… Furthermore, I also fail to understand why the posterior probability of model i should be distributed as a uniform (“If the method is unbiased, the points should fall near the identity line”) when the data is from model i: this is not a p-value but a posterior probability and the posterior probability is not the frequentist coverage…

My overall problem is that, all in all, this is a single if elaborate Monte Carlo study and, as such, it does not carry enough weight to validate an approach that remains highly subjective in the selection of its hyperparameters. Without raising any doubt about an hypothetical “fixing” of those hyperparameters, I think this remains a controlled experiment with simulated data where the true parameters are know and the prior is “true”. This obviously helps in getting better performances.

“With improving numerical methods (…), advances in Monte Carlo techniques and increasing efficiency of likelihood calculations, analyzing rich comparative phylo-geographical models in a full-likelihood Bayesian framework is becoming computationally feasible.” (p.21)

This conclusion of the paper sounds over-optimistic and rather premature. I do not know of any significant advance in computing the observed likelihood for the population genetics models ABC is currently handling. (The SMC algorithm of Bouchard-Côté, Sankaraman and Jordan, 2012, does not apply to Kingman’s coalescent, as far as I can tell.) This is certainly a goal worth pursuing and borrowing strength from multiple techniques cannot hurt, but it remains so far a lofty goal, still beyond our reach… I thus think the major message of the paper is to reinforce our own and earlier calls for caution when interpreting the output of an ABC model choice (p.20), or even of a regular Bayesian analysis, agreeing that we should aim at seeing “a large amount of posterior uncertainty” rather than posterior probability values close to 0 and 1.