**A**mong the many comments (thanks!) I received when posting our Testing via mixture estimation paper came the suggestion to relate this approach to the notion of full Bayesian significance test (FBST) developed by (Julio, not Hal) Stern and Pereira, from São Paulo, Brazil. I thus had a look at this alternative and read the Bayesian Analysis paper they published in 2008, as well as a paper recently published in Logic Journal of IGPL. (I could not find what the IGPL stands for.) The central notion in these papers is the *e-value*, which provides the *posterior probability that the posterior density is larger than the largest posterior density over the null set*. This definition bothers me, first because the *null* set has a measure equal to zero under an absolutely continuous prior (BA, p.82). Hence the posterior density is defined in an arbitrary manner over the *null* set and the maximum is itself arbitrary. (An issue that invalidates my 1993 version of the Lindley-Jeffreys paradox!) And second because it considers the posterior probability of an event that does not exist a priori, being conditional on the data. This sounds in fact quite similar to *Statistical Inference*, Murray Aitkin’s (2009) book using a posterior distribution of the likelihood function. With the same drawback of using the data twice. And the other issues discussed in our commentary of the book. (As a side-much-on-the-side remark, the authors incidentally forgot me when citing our 1992 Annals of Statistics paper about decision theory on accuracy estimators..!)

## Archive for Bayesian Analysis

## full Bayesian significance test

Posted in Books, Statistics with tags Bayes factor, Bayesian Analysis, Bayesian model choice, e-values, full Bayesian significance test, logic journal of the IGPL, measure theory, Murray Aitkin, p-values, São Paulo, statistical inference on December 18, 2014 by xi'an## about the strong likelihood principle

Posted in Books, Statistics, University life with tags ABC, ABC model choice, Alan Birnbaum, Bayesian Analysis, conditioning, sufficient statistics, The Likelihood Principle, weak conditionality principle on November 13, 2014 by xi'an**D**eborah Mayo arXived a Statistical Science paper a few days ago, along with discussions by Jan Bjørnstad, Phil Dawid, Don Fraser, Michael Evans, Jan Hanning, R. Martin and C. Liu. I am very glad that this discussion paper came out and that it came out in Statistical Science, although I am rather surprised to find no discussion by Jim Berger or Robert Wolpert, and even though I still cannot entirely follow the deductive argument in the rejection of Birnbaum’s proof, just as in the earlier version in Error & Inference. But I somehow do not feel like going again into a new debate about this critique of Birnbaum’s derivation. (Even though statements like the fact that the SLP “would preclude the use of sampling distributions” (p.227) would call for contradiction.)

“It is the imprecision in Birnbaum’s formulation that leads to a faulty impression of exactly what is proved.” M. Evans

Indeed, at this stage, I fear that [for me] a more relevant issue is whether or not the debate does matter… At a logical cum foundational [and maybe cum historical] level, it makes perfect sense to uncover if and which if any of the myriad of Birnbaum’s likelihood Principles holds. [Although trying to uncover Birnbaum’s motives and positions over time may not be so relevant.] I think the paper and the discussions acknowledge that *some* version of the weak conditionality Principle does not imply *some* version of the strong likelihood Principle. With other logical implications remaining true. At a methodological level, I am less much less sure it matters. Each time I taught this notion, I got blank stares and incomprehension from my students, to the point I have now stopped altogether teaching the likelihood Principle in class. And most of my co-authors do not seem to care very much about it. At a purely mathematical level, I wonder if there even is ground for a debate since the notions involved can be defined in various imprecise ways, as pointed out by Michael Evans above and in his discussion. At a statistical level, sufficiency eventually is a strange notion in that it seems to make plenty of sense until one realises there is no interesting sufficiency outside exponential families. Just as there are very few parameter transforms for which unbiased estimators can be found. So I also spend very little time teaching and even less worrying about sufficiency. (As it happens, I taught the notion this morning!) At another and presumably more significant statistical level, what matters is information, e.g., conditioning means adding information (i.e., about which experiment has been used). While complex settings may prohibit the use of the entire information provided by the data, at a formal level there is no argument for not using the entire information, i.e. conditioning upon the entire data. (At a computational level, this is no longer true, witness ABC and similar limited information techniques. By the way, ABC demonstrates if needed why sampling distributions matter so much to Bayesian analysis.)

“Non-subjective Bayesians who (…) have to live with some violations of the likelihood principle (…) since their prior probability distributions are influenced by the sampling distribution.” D. Mayo (p.229)

In the end, the fact that the prior may depend on the form of the sampling distribution and hence does violate the likelihood Principle does not worry me so much. In most models I consider, the parameters are endogenous to those sampling distributions and do not live an ethereal existence independently from the model: they are substantiated and calibrated by the model itself, which makes the discussion about the LP rather vacuous. See, e.g., the coefficients of a linear model. In complex models, or in large datasets, it is even impossible to handle the whole data or the whole model and proxies have to be used instead, making worries about the structure of the (original) likelihood vacuous. I think we have now reached a stage of statistical inference where models are no longer accepted as ideal truth and where approximation is the hard reality, imposed by the massive amounts of data relentlessly calling for immediate processing. Hence, where the self-validation or invalidation of such approximations in terms of predictive performances is the relevant issue. Provided we can at all face the challenge…

## plenty of new arXivals!

Posted in Statistics, University life with tags arXiv, Bayesian Analysis, Ising, Monte Carlo methods, simulation, Statistics on October 2, 2014 by xi'an**H**ere are some entries I spotted in the past days as of potential interest, for which I will have not enough time to comment:

- arXiv:1410.0163: Instrumental Variables: An Econometrician’s Perspective by Guido Imbens
- arXiv:1410.0123: Deep Tempering by Guillaume Desjardins, Heng Luo, Aaron Courville, Yoshua Bengio
- arXiv:1410.0255: Variance reduction for irreversible Langevin samplers and diffusion on graphs by Luc Rey-Bellet, Konstantinos Spiliopoulos
- arXiv:1409.8502: Combining Particle MCMC with Rao-Blackwellized Monte Carlo Data Association for Parameter Estimation in Multiple Target Tracking by Juho Kokkala, Simo Särkkä
- arXiv:1409.8185: Adaptive Low-Complexity Sequential Inference for Dirichlet Process Mixture Models by Theodoros Tsiligkaridis, Keith W. Forsythe
- arXiv:1409.7986: Hypothesis testing for Markov chain Monte Carlo by Benjamin M. Gyori, Daniel Paulin
- arXiv:1409.7672: Order-invariant prior specification in Bayesian factor analysis by Dennis Leung, Mathias Drton
- arXiv:1409.7458: Beyond Maximum Likelihood: from Theory to Practice by Jiantao Jiao, Kartik Venkat, Yanjun Han, Tsachy Weissman
- arXiv:1409.7419: Identifying the number of clusters in discrete mixture models by Cláudia Silvestre, Margarida G. M. S. Cardoso, Mário A. T. Figueiredo
- arXiv:1409.7287: Identification of jump Markov linear models using particle filters by Andreas Svensson, Thomas B. Schön, Fredrik Lindsten
- arXiv:1409.7074: Variational Pseudolikelihood for Regularized Ising Inference by Charles K. Fisher

## Bayes’ Rule [book review]

Posted in Books, Statistics, University life with tags Amazon, Bayes formula, Bayes rule, Bayes theorem, Bayesian Analysis, England, introductory textbooks, publishing, short course, Thomas Bayes' portrait, tutorial on July 10, 2014 by xi'an**T**his introduction to Bayesian Analysis, Bayes’ Rule, was written by James Stone from the University of Sheffield, who contacted CHANCE suggesting a review of his book. I thus bought it from amazon to check the contents. And write a review.

**F**irst, the format of the book. It is a short paper of 127 pages, plus 40 pages of glossary, appendices, references and index. I eventually found the name of the publisher, Sebtel Press, but for a while thought the book was self-produced. While the LaTeX output is fine and the (Matlab) graphs readable, pictures are not of the best quality and the display editing is minimal in that there are several huge white spaces between pages. Nothing major there, obviously, it simply makes the book look like course notes, but this is in no way detrimental to its potential appeal. (I will not comment on the numerous appearances of Bayes’ alleged portrait in the book.)

“… (on average) the adjusted value θ^{MAP}is more accurate than θ^{MLE}.” (p.82)

Bayes’ Rule has the interesting feature that, in the very first chapter, after spending a rather long time on Bayes’ formula, it introduces Bayes factors (p.15). With the somewhat confusing choice of calling the *prior* probabilities of hypotheses *marginal* probabilities. Even though they are indeed *marginal* given the joint, *marginal* is usually reserved for the sample, as in *marginal* likelihood. Before returning to more (binary) applications of Bayes’ formula for the rest of the chapter. The second chapter is about probability theory, which means here introducing the three axioms of probability and discussing geometric interpretations of those axioms and Bayes’ rule. Chapter 3 moves to the case of discrete random variables with more than two values, i.e. contingency tables, on which the range of probability distributions is (re-)defined and produces a new entry to Bayes’ rule. And to the MAP. Given this pattern, it is not surprising that Chapter 4 does the same for continuous parameters. The parameter of a coin flip. This allows for discussion of uniform and reference priors. Including maximum entropy priors à la Jaynes. And bootstrap samples presented as approximating the posterior distribution under the “fairest prior”. And even two pages on standard loss functions. This chapter is followed by a short chapter dedicated to estimating a normal mean, then another short one on exploring the notion of a continuous joint (Gaussian) density.

“To some people the wordBayesianis like a red rag to a bull.” (p.119)

Bayes’ Rule concludes with a chapter entitled *Bayesian wars*. A rather surprising choice, given the intended audience. Which is rather bound to confuse this audience… The first part is about probabilistic ways of representing information, leading to subjective probability. The discussion goes on for a few pages to justify the use of priors but I find completely unfair the argument that because Bayes’ rule is a mathematical theorem, it “has been proven to be true”. It is indeed a maths theorem, however that does not imply that any inference based on this theorem is correct! (A surprising parallel is Kadane’s Principles of Uncertainty with its anti-objective final chapter.)

**A**ll in all, I remain puzzled after reading Bayes’ Rule. Puzzled by the intended audience, as contrary to other books I recently reviewed, the author does not shy away from mathematical notations and concepts, even though he proceeds quite gently through the basics of probability. Therefore, potential readers need some modicum of mathematical background that some students may miss (although it actually corresponds to what my kids would have learned in high school). It could thus constitute a soft entry to Bayesian concepts, before taking a formal course on Bayesian analysis. Hence doing no harm to the perception of the field.

## did I mean endemic? [pardon my French!]

Posted in Books, Statistics, University life with tags Air France, Bayesian Analysis, censoring, endemic, Glasgow, guest editors, information theory, Larry Wasserman, Robins-Wasserman paradox, Statistical Science, translation, Ubiquitous Chip on June 26, 2014 by xi'an**D**eborah Mayo wrote a Saturday night special column on our Big Bayes stories issue in *Statistical Science*. She (predictably?) focussed on the critical discussions, esp. David Hand’s most forceful arguments where he essentially considers that, due to our (special issue editors’) selection of successful stories, we biased the debate by providing a “one-sided” story. And that we or the editor of *Statistical Science* should also have included frequentist stories. To which Deborah points out that demonstrating that “only” a frequentist solution is available may be beyond the possible. And still, I could think of partial information and partial inference problems like the “paradox” raised by Jamie Robbins and Larry Wasserman in the past years. (Not the normalising constant paradox but the one about censoring.) Anyway, the goal of this special issue was to provide a range of realistic illustrations where Bayesian analysis was a most reasonable approach, not to raise the Bayesian flag against other perspectives: in an ideal world it would have been more interesting to get discussants produce alternative analyses bypassing the Bayesian modelling but obviously discussants only have a limited amount of time to dedicate to their discussion(s) and the problems were complex enough to deter any attempt in this direction.

**A**s an aside and in explanation of the cryptic title of this post, Deborah wonders at my use of *endemic* in the preface and at the possible mis-translation from the French. I did mean *endemic* (and *endémique*) in a half-joking reference to a disease one cannot completely get rid of. At least in French, the term extends beyond diseases, but presumably *pervasive* would have been less confusing… Or *ubiquitous* (as in Ubiquitous Chip for those with Glaswegian ties!). She also expresses “surprise at the choice of name for the special issue. Incidentally, the “big” refers to the bigness of the problem, not big data. Not sure about “stories”.” Maybe another occurrence of lost in translation… I had indeed no intent of connection with the “big” of “Big Data”, but wanted to convey the notion of a big as in major problem. And of a story explaining why the problem was considered and how the authors reached a satisfactory analysis. The story of the Air France Rio-Paris crash resolution is representative of that intent. (Hence the explanation for the above picture.)

## David Blei smile in Paris (seminar)

Posted in Statistics, Travel, University life with tags Bayesian Analysis, David Blei, INRIA, machine learning, Paris, Princeton University, seminar, SMILE seminar, variational Bayes methods on October 30, 2013 by xi'an**N**icolas Chopin just reminded me of a seminar given by David Blei in Paris tomorrow (at 4pm, SMILE seminar, INRIA 23 avenue d’Italie, 5th floor, orange room) on ** Stochastic Variational Inference and Scalable Topic Models**, machine learning seminar that I will alas miss, being busy on giving mine at CMU. Here is the abstract:

```
``````
Probabilistic topic modeling provides a suite of tools for analyzing
large collections of electronic documents. With a collection as
input, topic modeling algorithms uncover its underlying themes and
decompose its documents according to those themes. We can use topic
models to explore the thematic structure of a large collection of
documents or to solve a variety of prediction problems about text.
Topic models are based on hierarchical mixed-membership models,
statistical models where each document expresses a set of components
(called topics) with individual per-document proportions. The
computational problem is to condition on a collection of observed
documents and estimate the posterior distribution of the topics and
per-document proportions. In modern data sets, this amounts to
posterior inference with billions of latent variables.
How can we cope with such data? In this talk I will describe
stochastic variational inference, a general algorithm for
approximating posterior distributions that are conditioned on massive
data sets. Stochastic inference is easily applied to a large class of
hierarchical models, including time-series models, factor models, and
Bayesian nonparametric models. I will demonstrate its application to
topic models fit with millions of articles. Stochastic inference
opens the door to scalable Bayesian computation for modern data
```