**J**ust received an email today that our discussion with Judith of Chris Holmes and James Watson’s paper was now published as *Statistical Science 2016, Vol. 31, No. 4, 506-510*… While it is almost identical to the arXiv version, it can be read on-line.

## Archive for Statistical Science

## nonparametric Bayesian clay for robust decision bricks

Posted in Statistics with tags Bayesian robustness, Chris Holmes, Conan Doyle, discussion paper, James Watson, Judith Rousseau, Statistical Science on January 30, 2017 by xi'an## ISBA 2016 [#3]

Posted in pictures, Running, Statistics, Travel, University life, Wines with tags ABC, approximate likelihood, Calasetta, ISBA 2016, j-ISBA, loss function, restricted inference, San' Antioco, Sardinia, Statistical Science, summary statistics on June 16, 2016 by xi'anAmong the sessions I attended yesterday, I really liked the one on robustness and model mispecification. Especially the talk by Steve McEachern on Bayesian inference based on insufficient statistics, with a striking graph of the degradation of the Bayes factor as the prior variance increases. I sadly had no time to grab a picture of the graph, which compared this poor performance against a stable rendering when using a proper summary statistic. It clearly relates to our work on ABC model choice, as well as to my worries about the Bayes factor, so this explains why I am quite excited about this notion of restricted inference. In this session, Chris Holmes also summarised his two recent papers on loss-based inference, which I discussed here in a few posts, including the Statistical Science discussion Judith and I wrote recently. I also went to the j-ISBA [section] session which was sadly under-attended, maybe due to too many parallel sessions, maybe due to the lack of unifying statistical theme.

## comments on Watson and Holmes

Posted in Books, pictures, Statistics, Travel with tags Bayesian Analysis, Bayesian robustness, Conan Doyle, decision theory, Dirichlet process, Kullback-Leibler divergence, Saint Giles cemetery, Sherlock Holmes, Statistical Science, University of Oxford on April 1, 2016 by xi'an

“The world is full of obvious things which nobody by any chance ever observes.” The Hound of the Baskervilles

**I**n connection with the incoming publication of James Watson’s and Chris Holmes’ Approximating models and robust decisions in Statistical Science, Judith Rousseau and I wrote a discussion on the paper that has been arXived yesterday.

“Overall, we consider that the calibration of the Kullback-Leibler divergence remains an open problem.” (p.18)

While the paper connects with earlier ones by Chris and coauthors, and possibly *despite* the overall critical tone of the comments!, I really appreciate the renewed interest in robustness advocated in this paper. I was going to write *Bayesian robustness* but to differ from the perspective adopted in the 90’s where robustness was mostly about the prior, I would say this is rather a Bayesian approach to model robustness from a decisional perspective. With definitive innovations like considering the impact of posterior uncertainty over the decision space, uncertainty being defined e.g. in terms of Kullback-Leibler neighbourhoods. Or with a Dirichlet process distribution on the posterior. This may step out of the standard Bayesian approach but it remains of definite interest! (And note that this discussion of ours [reluctantly!] refrained from capitalising on the names of the authors to build easy puns linked with the most Bayesian of all detectives!)

## Harold Jeffreys’ default Bayes factor [for psychologists]

Posted in Books, Statistics, University life with tags Bayesian hypothesis testing, Dickey-Savage ratio, Harold Jeffreys, overfitting, Statistical Science, testing, Theory of Probability on January 16, 2015 by xi'an*“One of Jeffr**eys’ goals was to create default Bayes factors by using prior distributions that obeyed a series of general desiderata.”*

**T**he paper *Harold Jeffreys’s default Bayes factor hypothesis tests: explanation, extension, and application in Psychology* by Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers is both a survey and a reinterpretation *cum* explanation of Harold Jeffreys‘ views on testing. At about the same time, I received a copy from Alexander and a copy from the journal it had been submitted to! This work starts with a short historical entry on Jeffreys’ work and career, which includes four of his principles, quoted *verbatim* from the paper:

- “scientific progress depends primarily on induction”;
- “in order to formalize induction one requires a logic of partial belief” [enters the Bayesian paradigm];
- “scientific hypotheses can be assigned prior plausibility in accordance with their complexity” [a.k.a., Occam’s razor];
- “classical “Fisherian” p-values are inadequate for the purpose of hypothesis testing”.

“The choice of π(σ) therefore irrelevant for the Bayes factor as long as we use the same weighting function in both models”

A very relevant point made by the authors is that Jeffreys *only* considered embedded or nested hypotheses, a fact that allows for having common parameters between models and hence some form of reference prior. Even though (a) I dislike the notion of “common” parameters and (b) I do not think it is entirely legit (I was going to write proper!) from a mathematical viewpoint to use the same (improper) prior on both sides, as discussed in our Statistical Science paper. And in our most recent alternative proposal. The most delicate issue however is to derive a reference prior on the parameter of interest, which is *fixed* under the null and *unknown* under the alternative. Hence preventing the use of improper priors. Jeffreys tried to calibrate the corresponding prior by imposing asymptotic consistency under the alternative. And exact indeterminacy under “completely uninformative” data. Unfortunately, this is not a well-defined notion. In the normal example, the authors recall and follow the proposal of Jeffreys to use an improper prior π(σ)∝1/σ on the nuisance parameter and argue in his defence the quote above. I find this argument quite weak because suddenly the prior on σ becomes a *weighting function..*. A notion foreign to the Bayesian cosmology. If we use an improper prior for π(σ), the marginal likelihood on the data is no longer a probability density and I do not buy the argument that one should use the *same* measure with the *same* constant both on σ alone [for the nested hypothesis] and on the σ part of (μ,σ) [for the nesting hypothesis]. We are considering two spaces with different dimensions and hence orthogonal measures. This quote thus sounds more like wishful thinking than like a justification. Similarly, the assumption of independence between δ=μ/σ and σ does not make sense for σ-finite measures. Note that the authors later point out that (a) the posterior on σ varies between models despite using the *same* data [which shows that the parameter σ is far from common to both models!] and (b) the [testing] Cauchy prior on δ is only useful for the testing part and should be replaced with another [estimation] prior when the model has been selected. Which may end up as a backfiring argument about this default choice.

“Each updated weighting function should be interpreted as a posterior in estimating σ within their own context, the model.”

The re-derivation of Jeffreys’ conclusion that a Cauchy prior should be used on δ=μ/σ makes it clear that this choice only proceeds from an imperative of fat tails in the prior, without solving the calibration of the Cauchy scale. (Given the now-available modern computing tools, it would be nice to see the impact of this scale γ on the numerical value of the Bayes factor.) And maybe it also proceeds from a “hidden agenda” to achieve a Bayes factor that *solely* depends on the *t* statistic. Although this does not sound like a compelling reason to me, since the *t* statistic is not sufficient in this setting.

In a differently interesting way, the authors mention the Savage-Dickey ratio (p.16) as a way to represent the Bayes factor for nested models, without necessarily perceiving the mathematical difficulty with this ratio that we pointed out a few years ago. For instance, in the psychology example processed in the paper, the test is between δ=0 and δ≥0; however, if I set π(δ=0)=0 under the alternative prior, which should not matter *[from a measure-theoretic perspective where the density is uniquely defined almost everywhere]*, the Savage-Dickey representation of the Bayes factor returns zero, instead of 9.18!

“In general, the fact that different priors result in different Bayes factors should not come as a surprise.”

The second example detailed in the paper is the test for a zero Gaussian correlation. This is a sort of “ideal case” in that the parameter of interest is between -1 and 1, hence makes the choice of a uniform U(-1,1) easy or easier to argue. Furthermore, the setting is also “ideal” in that the Bayes factor simplifies down into a marginal over the sample correlation only, under the usual Jeffreys priors on means and variances. So we have a second case where the frequentist statistic behind the frequentist test[ing procedure] is also the single (and insufficient) part of the data used in the Bayesian test[ing procedure]. Once again, we are in a setting where Bayesian and frequentist answers are in one-to-one correspondence (at least for a fixed sample size). And where the Bayes factor allows for a closed form through hypergeometric functions. Even in the one-sided case. (This is a result obtained by the authors, not by Jeffreys who, as the proper physicist he was, obtained approximations that are remarkably accurate!)

“The fact that the Bayes factor is independent of the intention with which the data have been collected is of considerable practical importance.”

The authors have a side argument in this section in favour of the Bayes factor against the p-value, namely that the “Bayes factor does not depend on the sampling plan” (p.29), but I find this fairly weak (or tongue in cheek) as the Bayes factor *does* depend on the sampling distribution imposed on top of the data. It appears that the argument is mostly used to defend sequential testing.

“The Bayes factor (…) balances the tension between parsimony and goodness of fit, (…) against overfitting the data.”

*In fine*, I liked very much this re-reading of Jeffreys’ approach to testing, maybe the more because I now think we should get away from it! I am not certain it will help in convincing psychologists to adopt Bayes factors for assessing their experiments as it may instead frighten them away. And it does not bring an answer to the vexing issue of the relevance of point null hypotheses. But it constitutes a lucid and innovative of the major advance represented by Jeffreys’ formalisation of Bayesian testing.

## did I mean endemic? [pardon my French!]

Posted in Books, Statistics, University life with tags Air France, Bayesian Analysis, censoring, endemic, Glasgow, guest editors, information theory, Larry Wasserman, Robins-Wasserman paradox, Statistical Science, translation, Ubiquitous Chip on June 26, 2014 by xi'an**D**eborah Mayo wrote a Saturday night special column on our Big Bayes stories issue in *Statistical Science*. She (predictably?) focussed on the critical discussions, esp. David Hand’s most forceful arguments where he essentially considers that, due to our (special issue editors’) selection of successful stories, we biased the debate by providing a “one-sided” story. And that we or the editor of *Statistical Science* should also have included frequentist stories. To which Deborah points out that demonstrating that “only” a frequentist solution is available may be beyond the possible. And still, I could think of partial information and partial inference problems like the “paradox” raised by Jamie Robbins and Larry Wasserman in the past years. (Not the normalising constant paradox but the one about censoring.) Anyway, the goal of this special issue was to provide a range of realistic illustrations where Bayesian analysis was a most reasonable approach, not to raise the Bayesian flag against other perspectives: in an ideal world it would have been more interesting to get discussants produce alternative analyses bypassing the Bayesian modelling but obviously discussants only have a limited amount of time to dedicate to their discussion(s) and the problems were complex enough to deter any attempt in this direction.

**A**s an aside and in explanation of the cryptic title of this post, Deborah wonders at my use of *endemic* in the preface and at the possible mis-translation from the French. I did mean *endemic* (and *endémique*) in a half-joking reference to a disease one cannot completely get rid of. At least in French, the term extends beyond diseases, but presumably *pervasive* would have been less confusing… Or *ubiquitous* (as in Ubiquitous Chip for those with Glaswegian ties!). She also expresses “surprise at the choice of name for the special issue. Incidentally, the “big” refers to the bigness of the problem, not big data. Not sure about “stories”.” Maybe another occurrence of lost in translation… I had indeed no intent of connection with the “big” of “Big Data”, but wanted to convey the notion of a big as in major problem. And of a story explaining why the problem was considered and how the authors reached a satisfactory analysis. The story of the Air France Rio-Paris crash resolution is representative of that intent. (Hence the explanation for the above picture.)