## Approximate Bayesian model choice

Posted in Books, R, Statistics, Travel, University life with tags , , , , , , , , , on March 17, 2014 by xi'an

The above is the running head of the arXived paper with full title “Implications of  uniformly distributed, empirically informed priors for phylogeographical model selection: A reply to Hickerson et al.” by Oaks, Linkem and Sukuraman. That I (again) read in the plane to Montréal (third one in this series!, and last because I also watched the Japanese psycho-thriller Midsummer’s Equation featuring a physicist turned detective in one of many TV episodes. I just found some common features with The Devotion of Suspect X, only to discover now that the book has been turned into another episode in the series.)

“Here we demonstrate that the approach of Hickerson et al. (2014) is dangerous in the sense that the empirically-derived priors often exclude from consideration the true values of the models’ parameters. On a more fundamental level, we question the value of adopting an empirical Bayesian stance for this model-choice problem, because it can mislead model posterior probabilities, which are inherently measures of belief in the models after prior knowledge is updated by the data.”

This paper actually is a reply to Hickerson et al. (2014, Evolution), which is itself a reply to an earlier paper by Oaks et al. (2013, Evolution). [Warning: I did not check those earlier references!] The authors object to the use of “narrow, empirically informed uniform priors” for the reason reproduced in the above quote. In connection with the msBayes of Huang et al. (2011, BMC Bioinformatics). The discussion is less about ABC used for model choice and posterior probabilities of models and more about the impact of vague priors, Oaks et al. (2013) arguing that this leads to a bias towards models with less parameters, a “statistical issue” in their words, while Hickerson et al. (2014) think this is due to msBayes way of selecting models and their parameters at random.

“…it is difficult to choose a uniformly distributed prior on divergence times that is broad enough to confidently contain the true values of parameters while being narrow enough to avoid spurious support of models with less parameter space.”

So quite an interesting debate that takes us in fine far away from the usual worries about ABC model choice! We are more at the level empirical versus natural Bayes, seen in the literature of the 80’s. (The meaning of empirical Bayes is not that clear in the early pages as the authors seem to involve any method using the data “twice”.) I actually do not remember reading papers about the formal properties of model choice done through classical empirical Bayes techniques. Except the special case of Aitkin’s (1991,2009) integrated likelihood. Which is essentially the analysis performed on the coin toy example (p.7)

“…models with more divergence parameters will be forced to integrate over much greater parameter space, all with equal prior density, and much of it with low likelihood.”

The above argument is an interesting rephrasing of Lindley’s paradox, which I cannot dispute, but of course it does not solve the fundamental issue of how to choose the prior away from vague uniform priors… I also like the quote “the estimated posterior probability of a model is a single value (rather than a distribution) lacking a measure of posterior uncertainty” as this is an issue on which we are currently working. I fully agree with the statement and we think an alternative assessment to posterior probabilities could be more appropriate for model selection in ABC settings (paper soon to come, hopefully!).

## on alternative perspectives and solutions on Bayesian tests

Posted in Statistics, Travel, University life with tags , , , , , , , on December 16, 2013 by xi'an

Here are the slides of my tutorial at O’ Bayes 2013 today, a pot-pourri of various, recent and less recent, criticisms (with, albeit less than usual, a certain proportion of recycled slides):

## “an outstanding paper that covers the Jeffreys-Lindley paradox”…

Posted in Statistics, University life with tags , , , , , , , , on December 4, 2013 by xi'an

“This is, in this revised version, an outstanding paper that covers the Jeffreys-Lindley paradox (JLP) in exceptional depth and that unravels the philosophical differences between different schools of inference with the help of the JLP. From the analysis of this paradox, the author convincingly elaborates the principles of Bayesian and severity-based inferences, and engages in a thorough review of the latter’s account of the JLP in Spanos (2013).” Anonymous

I have now received a second round of reviews of my paper, “On the Jeffreys-Lindleys paradox” (submitted to Philosophy of Science) and the reports are quite positive (or even extremely positive as in the above quote!). The requests for changes are directed to clarify points, improve the background coverage, and simplify my heavy style (e.g., cutting Proustian sentences). These requests were easily addressed (hopefully to the satisfaction of the reviewers) and, thanks to the week in Warwick, I have already sent the paper back to the journal, with high hopes for acceptance. The new version has also been arXived. I must add that some parts of the reviews sounded much better than my original prose and I was almost tempted to include them in the final version. Take for instance

“As a result, the reader obtains not only a better insight into what is at stake in the JLP, going beyond the results of Spanos (2013) and Sprenger (2013), but also a much better understanding of the epistemic function and mechanics of statistical tests. This is a major achievement given the philosophical controversies that have haunted the topic for decades. Recent insights from Bayesian statistics are integrated into the article and make sure that it is mathematically up to date, but the technical and foundational aspects of the paper are well-balanced.” Anonymous

## on the Jeffreys-Lindley’s paradox (revision)

Posted in Statistics, University life with tags , , , , , , , , , on September 17, 2013 by xi'an

As mentioned here a few days ago, I have been revising my paper on the Jeffreys-Lindley’s paradox paper for Philosophy of Science. It came as a bit of a (very pleasant) surprise that this journal was ready to consider a revised version of the paper given that I have no formal training in philosophy and that the (first version of the) paper was rather hurriedly made of a short text written for the 95th birthday of Dennis Lindley and of my blog post on Aris Spanos’ “Who should be afraid of the Jeffreys-Lindley paradox?“, recently published in Philosophy of Science.  So I found both reviewers very supportive and I am grateful for their suggestions to improve both the scope and the presentation of the paper. It has been resubmitted and rearXived, and I am now waiting for the decision of the editorial team with the appropriate philosophical sense of detachment…

Posted in Books, Statistics, University life with tags , , , , , on September 13, 2013 by xi'an

“In the asymptotic limit, the Bayesian cannot justify the strictly positive probability of H0 as an approximation to testing the hypothesis that the parameter value is close to θ0 — which is the hypothesis of real scientific interest”

While revising my Jeffreys-Lindley’s paradox paper for Philosophy of Science, it was suggested (to me) that I read the incoming paper by Jan Sprenger on this paradox. The paper is entitled Testing a Precise Null Hypothesis: The Case of Lindley’s Paradox and it defends the thesis that the regular Bayesian approach (hence the Bayes factor used in the Jeffreys-Lindley’s paradox) is forced to put a prior on the (point) null hypothesis when all that really matters is the vicinity of the null. (I think Andrew would agree there as he positively hates point null hypotheses. See also Rissanen’s perspective about maximal precision allowed by a give sample.) Sprenger then advocates the use of the log score for comparing the full model with the point-null sub-model, i.e. the posterior expectation of the Kullback-Leibler distance between both models:

$\mathbb{E}^\pi\left[\mathbb{E}_\theta\{\log f(X|\theta)/ f(X|\theta_0)\}|x\right],$

rejoining  José Bernardo and Phil Dawid on this ground.

While I agree about the notion that it is impossible to distinguish a small enough departure from the null from the null (no typo!), and I also support the argument that “all models are wrong”, hence point null should eventually—meaning with enough data—rejected, I find the Bayesian solution through the Bayes factor rather appealing because it uses the prior distribution to weight the alternative values of θ in order to oppose their averaged likelihood to the likelihood in θ0. (Note I did not mention Occam!) Further, while the notion of opposing a point null to the rest of the Universe may sound silly, what truly matters is the decisional setting, namely that we want to select a simpler model and use it for later purposes. It is therefore this issue that should be tested, rather than whether or not θ is exactly equal to θ0. I incidentally find it amusing that Sprenger picks the ESP experiment as his illustration in that this is a (the?) clearcut case where the point null: “there is no such thing as ESP” makes sense. Now, it can be argued that what the statistical experiment is assessing is the ESP experiment, for which many objective causes (beyond ESP!) may induce a departure from the null (and from the binomial model). But then this prevents any rational analysis of the test (as is indeed the case!).

The paper thus objects to the use of Bayes factors (and of p-values) to instead  propose to compare scores in the Bernardo-Dawid spirit. As discussed earlier, it has several appealing features, from recovering the Kullback-Leibler divergence between models as a measure of fit  to allowing for the incorporation of improper priors (a point Andrew may disagree with), to avoiding the double use of the data. It is however incomplete in that it creates a discrepancy or a disbalance between both models, thus making the comparison of more than two models difficult to fathom, and it does not readily incorporate the notion of nuisance parameters in the embedded model, seemingly forcing the inclusion of pseudo-priors as in the Bayesian analysis of Aitkin’s integrated likelihood.

## my first week at work

Posted in Running, Travel, University life with tags , , , on September 9, 2013 by xi'an

After attending the first day of the RSS annual conference in Newcastle, I took the train to Coventry to join the Department of Statistics at the University of Warwick (this may sound confusing, but the University of Warwick is located in Coventry, not in Warwick, 8 miles south, and not to be confused with Coventry University, which is a former polytechnic; it is located in Warwickshire, though, which is why it took this name) where I now have a part-time professor position.  I will thus be at the department a week at a time, every other five weeks or so, during the teaching terms, and I obviously look forward the huge opportunities to interact with faculty and students therein. The “first week at work” was quite smooth, not entirely surprising given my numerous previous visits to Warwick, with hardly any bureaucratic step in the instalment. This gave me the opportunity to start the revision of the Jeffreys-Lindley’s paradox paper for Philosophy of Science and to reconnoitre longer running routes…

Posted in pictures, R, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , on March 22, 2013 by xi'an

Needless to say, it is with great pleasure I am back in beautiful Padova for the workshop Recent Advances in statistical inference: theory and case studies, organised by Laura Ventura and Walter Racugno. Esp. when considering this is one of the last places I met with George Casella, in June 2010. As we have plenty of opportunities to remember him with so many of his friends here. (Tomorrow we will run around Prato della Valle in his memory.)

The workshop is of a “traditional Bayesian facture”, I mean one I enjoy very much: long talks with predetermined discussants and discussion from the floor. This makes for less talks (although we had eight today!) but also for more exciting sessions if the talks are broad and innovative. This was the case today (not including my talk of course) and I enjoyed the sessions a lot.

Jim Berger gave the first talk on “global” objective priors, starting from the desiderata to build a “general” reference prior when one does not want to separate parameters of interest from nuisance parameters and when one already has marginal reference priors on those parameters. This setting was actually addressed in Berger and Sun (AoS, 2008) and Jim presented some of the solutions therein: while I could not really see a strong incentive in using an arithmetic average of those, because it does not make much sense with improper priors, I definitely liked the notion of geometric averages, which evacuate the problem of the normalising constants. (There are open questions as well, about whether one improper prior could dwarf another one in the geometric average. Tail-wise for instance. Gauri Datta mentioned in his discussion that the geometric average is a specific Kullback-Leibler optimum.)

In his discussion of Tom Severini’s paper on integrated likelihood (which really stands at the margin of Bayesian inference), Brunero Liseo proposed a new use of ABC to approximate the likelihood function (while regular ABC relies on an approximation of the likelihood), a bit à la Chib. I cannot tell about the precision of this approximation but this is rather exciting!

Laura Ventura presented four of her current papers on the use of high order asymptotics in approximating (Bayesian) posteriors, following the JASA 2012 paper by Ventura, Cabras and Racugno. (The same issue featured a paper by Gill and Casella, coincidentally.) She showed the improvement brought by moving from first order (normal) to third order (non-normal). This is in a sense at the antipode of ABC, e.g. I’d like to see the requirements on the likelihood functions to be able to come up with a manageable Laplace approximation. She also mentioned a resolution of the Jeffreys-Lindley paradox via the Pereira et al. (2008) evidence, which computes a sort of Bayesian p-value by assessing the posterior probability of the posterior density being lower than its value at the null. I had missed or forgotten about this idea, but I wonder at some caveats like the impact of parameterisation, the connection with the testing problem, the calibration of the quantity, the extension to non-nested models, &tc. (Note that Ventura et al. developed an R package called hoa, for higher-order asymptotics.)

David Dunson presented some very recent work on compressed sensing that summed up for me into the idea of massively projecting (huge vectors of) regressors into much smaller dimension convex combinations, using random matrices for the projections. This point was somehow unclear to me. And to the first discussant Michael Wiper as well, who stressed that a completely random selection of those matrices could produce “mostly rubbish”, unless a learning mechanism was instated. The second discussant, Peter Müller, made the same point about this completely random search in a huge dimension space, while considering the survival frequency of covariates could help towards the efficiency of the method.