**J**eremias Knoblauch, Jack Jewson and Theodoros Damoulas, all affiliated with Warwick (hence a potentially biased reading!), arXived a paper on loss-based Bayesian inference that Jack discussed with me on my last visit to Warwick. As I was somewhat scared by the 61 pages, of which the 8 first pages are in NeurIPS style. The authors argue for a decision-theoretic approach to Bayesian inference that involves a loss over distributions and a divergence from the prior. For instance, when using the log-score as the loss and the Kullback-Leibler divergence, the regular posterior emerges, as shown by Arnold Zellner. Variational inference also falls under this hat. The argument for this generalization is that any form of loss can be used and still returns a distribution that is used to assess uncertainty about the parameter (of interest). In the axioms they produce for justifying the derivation of the optimal procedure, including cases where the posterior is restricted to a certain class, one [Axiom 4] generalizes the likelihood principle. Given the freedom brought by this general framework, plenty of fringe Bayes methods like standard variational Bayes can be seen as solutions to such a decision problem. Others like EP do not. Of interest to me are the potentials for this formal framework to encompass misspecification and likelihood-free settings, as well as for assessing priors, which is always a fishy issue. (The authors mention in addition the capacity to build related specific design Bayesian deep networks, of which I know nothing.) The obvious reaction of mine is one of facing an abundance of wealth (!) but encompassing approximate Bayesian solutions within a Bayesian framework remains an exciting prospect.

## Archive for Likelihood Principle

## a generalized representation of Bayesian inference

Posted in Books with tags approximate Bayesian inference, Bayesian decision theory, Bayesian robustness, Kullback-Leibler divergence, Likelihood Principle, University of Warwick, variational inference on July 5, 2019 by xi'an## Computational Bayesian Statistics [book review]

Posted in Books, Statistics with tags ABC, Bayes factor, Bayesian model selection, Bayesian p-values, Bayesian paradigm, Bayesian textbook, BayesX, book review, Cambridge University Press, coda, computational Bayesian methods, cup, Gibbs sampling, information criterion, INLA, JAGS, Jeffreys prior, Kalman filter, Laplace approximation, Likelihood Principle, MCMC, Metropolis-Hastings algorithm, model assessment, Monte Carlo Statistical Methods, OpenBUGS, R, sequential Monte Carlo, STAN, subjective versus objective Bayes on February 1, 2019 by xi'an**T**his Cambridge University Press book by M. Antónia Amaral Turkman, Carlos Daniel Paulino, and Peter Müller is an enlarged translation of a set of lecture notes in Portuguese. *(Warning: I have known Peter Müller from his PhD years in Purdue University and cannot pretend to perfect objectivity. For one thing, Peter once brought me frozen-solid beer: revenge can also be served cold!)* Which reminds me of my 1994 French edition of Méthodes de Monte Carlo par chaînes de Markov, considerably upgraded into Monte Carlo Statistical Methods (1998) thanks to the input of George Casella. *(Re-warning: As an author of books on the same topic(s), I can even less pretend to objectivity.)*

“The “great idea” behind the development of computational Bayesian statistics is the recognition that Bayesian inference can be implemented by way of simulation from the posterior distribution.”

The book is written from a strong, almost militant, subjective Bayesian perspective (as, e.g., when *half-Bayesians* are mentioned!). Subjective (and militant) as in Dennis Lindley‘s writings, eminently quoted therein. As well as in Tony O’Hagan‘s. Arguing that the sole notion of a Bayesian estimator is the entire posterior distribution. Unless one brings in a loss function. The book also discusses the Bayes factor in a critical manner, which is fine from my perspective. (Although the ban on improper priors makes its appearance in a very indirect way at the end of the last exercise of the first chapter.)

Somewhat at odds with the subjectivist stance of the previous chapter, the chapter on prior construction only considers non-informative and conjugate priors. Which, while understandable in an introductory book, is a wee bit disappointing. (When mentioning Jeffreys’ prior in multidimensional settings, the authors allude to using univariate Jeffreys’ rules for the marginal prior distributions, which is not a well-defined concept or else Bernardo’s and Berger’s reference priors would not have been considered.) The chapter also mentions the likelihood principle at the end of the last exercise, without a mention of the debate about its derivation by Birnbaum. Or Deborah Mayo’s recent reassessment of the strong likelihood principle. The following chapter is a sequence of illustrations in classical exponential family models, classical in that it is found in many Bayesian textbooks. (Except for the Poison model found in Exercise 3.3!)

Nothing to complain (!) about the introduction of Monte Carlo methods in the next chapter, especially about the notion of inference by Monte Carlo methods. And the illustration by Bayesian design. The chapter also introduces Rao-Blackwellisation [prior to introducing Gibbs sampling!]. And the simplest form of bridge sampling. (Resuscitating the weighted bootstrap of Gelfand and Smith (1990) may not be particularly urgent for an introduction to the topic.) There is furthermore a section on sequential Monte Carlo, including the Kalman filter and particle filters, in the spirit of Pitt and Shephard (1999). This chapter is thus rather ambitious in the amount of material covered with a mere 25 pages. Consensus Monte Carlo is even mentioned in the exercise section.

“This and other aspects that could be criticized should not prevent one from using this [Bayes factor] method in some contexts, with due caution.”

Chapter 5 turns back to inference with model assessment. Using Bayesian p-values for model assessment. (With an harmonic mean spotted in Example 5.1!, with no warning about the risks, except later in 5.3.2.) And model comparison. Presenting the whole collection of xIC information criteria. from AIC to WAIC, including a criticism of DIC. The chapter feels somewhat inconclusive but methinks this is the right feeling on the current state of the methodology for running inference about the model itself.

“Hint: There is a very easy answer.”

Chapter 6 is also a mostly standard introduction to Metropolis-Hastings algorithms and the Gibbs sampler. (The argument given later of a Metropolis-Hastings algorithm with acceptance probability one does not work.) The Gibbs section also mentions demarginalization as a [latent or auxiliary variable] way to simulate from complex distributions [as we do], but without defining the notion. It also references the precursor paper of Tanner and Wong (1987). The chapter further covers slice sampling and Hamiltonian Monte Carlo, the later with sufficient details to lead to reproducible implementations. Followed by another standard section on convergence assessment, returning to the 1990’s feud of single versus multiple chain(s). The exercise section gets much larger than in earlier chapters with several pages dedicated to most problems. Including one on ABC, maybe not very helpful in this context!

“…dimension padding (…) is essentially all that is to be said about the reversible jump. The rest are details.”

The next chapter is (somewhat logically) the follow-up for trans-dimensional problems and marginal likelihood approximations. Including Chib’s (1995) method [with no warning about potential biases], the spike & slab approach of George and McCulloch (1993) that I remember reading in a café at the University of Wyoming!, the somewhat antiquated MC³ of Madigan and York (1995). And then the much more recent array of Bayesian lasso techniques. The trans-dimensional issues are covered by the pseudo-priors of Carlin and Chib (1995) and the reversible jump MCMC approach of Green (1995), the later being much more widely employed in the literature, albeit difficult to tune [and even to comprehensively describe, as shown by the algorithmic representation in the book] and only recommended for a large number of models under comparison. Once again the exercise section is most detailed, with recent entries like the EM-like variable selection algorithm of Ročková and George (2014).

The book also includes a chapter on analytical approximations, which is also the case in ours [with George Casella] despite my reluctance to bring them next to exact (simulation) methods. The central object is the INLA methodology of Rue et al. (2009) [absent from our book for obvious calendar reasons, although Laplace and saddlepoint approximations are found there as well]. With a reasonable amount of details, although stopping short of implementable reproducibility. Variational Bayes also makes an appearance, mostly following the very recent Blei et al. (2017).

The gem and originality of the book are primarily to be found in the final and ninth chapter where four software are described, all with interfaces to R: OpenBUGS, JAGS, BayesX, and Stan, plus R-INLA which is processed in the second half of the chapter (because this is *not* a simulation method). As in the remainder of the book, the illustrations are related to medical applications. Worth mentioning is the reminder that BUGS came in parallel with Gelfand and Smith (1990) Gibbs sampler rather than as a consequence. Even though the formalisation of the Markov chain Monte Carlo principle by the later helped in boosting the power of this software. (I also appreciated the mention made of Sylvia Richardson’s role in this story.) Since every software is illustrated in depth with relevant code and output, and even with the shortest possible description of its principle and *modus vivendi*, the chapter is 60 pages long [and missing a comparative conclusion]. Given my total ignorance of the very existence of the BayesX software, I am wondering at the relevance of its inclusion in this description rather than, say, other general R packages developed by authors of books such as Peter Rossi. The chapter also includes a description of CODA, with an R version developed by Martin Plummer [now a Warwick colleague].

In conclusion, this is a high-quality and all-inclusive introduction to Bayesian statistics and its computational aspects. By comparison, I find it much more ambitious and informative than Albert’s. If somehow less pedagogical than the thicker book of Richard McElreath. (The repeated references to Paulino et al. (2018) in the text do not strike me as particularly useful given that this other book is written in Portuguese. Unless an English translation is in preparation.)

*Disclaimer: this book was sent to me by CUP for endorsement and here is what I wrote in reply for a back-cover entry:*

An introduction to computational Bayesian statistics cooked to perfection, with the right mix of ingredients, from the spirited defense of the Bayesian approach, to the description of the tools of the Bayesian trade, to a definitely broad and very much up-to-date presentation of Monte Carlo and Laplace approximation methods, to an helpful description of the most common software. And spiced up with critical perspectives on some common practices and an healthy focus on model assessment and model selection. Highly recommended on the menu of Bayesian textbooks!

And this review is likely to appear in CHANCE, in my book reviews column.

## Measuring statistical evidence using relative belief [book review]

Posted in Books, Statistics, University life with tags ABC, Bayes factor, CHANCE, CRC Press, discrepancies, Error and Inference, improper prior, integrated likelihood, Jeffreys-Lindley paradox, Likelihood Principle, marginalisation paradoxes, model checking, model validation, Monty Hall problem, Murray Aitkin, p-value, point null hypotheses, relative belief ratio, University of Toronto on July 22, 2015 by xi'an

“It is necessary to be vigilant to ensure that attempts to be mathematically general do not lead us to introduce absurdities into discussions of inference.” (p.8)

**T**his new book by Michael Evans (Toronto) summarises his views on statistical evidence (expanded in a large number of papers), which are a quite unique mix of Bayesian principles and less-Bayesian methodologies. I am quite glad I could receive a version of the book before it was published by CRC Press, thanks to Rob Carver (and Keith O’Rourke for warning me about it).* [Warning: this is a rather long review and post, so readers may chose to opt out now!]*

“The Bayes factor does not behave appropriately as a measure of belief, but it does behave appropriately as a measure of evidence.” (p.87)

## maximum likelihood: an introduction

Posted in Books, Statistics with tags Bahadur, International Statistical Review, Likelihood Principle, Lucien Le Cam, maximum likelihood estimation on December 20, 2014 by xi'an

“Basic Principle 0. Do not trust any principle.” L. Le Cam (1990)

**H**ere is the abstract of a International Statistical Rewiew 1990 paper by Lucien Le Cam on maximum likelihood. ISR keeping a tradition of including an abstract in French for every paper, Le Cam (most presumably) wrote his own translation [or maybe wrote the French version first], which sounds much funnier to me and so I cannot resist posting both, pardon my/his French! *[I just find “Ce fait” rather unusual, as I would have rather written “Ceci fait”…]*:

Maximum likelihood estimates are reported to be best under all circumstances. Yet there are numerous simple examples where they plainly misbehave. One gives some examples for problems that had not been invented for the purpose of annoying maximum likelihood fans. Another example, imitated from Bahadur, has been specially created with just such a purpose in mind. Next, we present a list of principles leading to the construction of good estimates. The main principle says that one should not believe in principles but study each problem for its own sake.

L’auteur a ouï dire que la méthode du maximum de vraisemblance est la meilleure méthode d’estimation. C’est bien vrai, et pourtant la méthode se casse le nez sur des exemples bien simples qui n’avaient pas été inventés pour le plaisir de montrer que la méthode peut être très désagréable. On en donne quelques-uns, plus un autre, imité de Bahadur et fabriqué exprès pour ennuyer les admirateurs du maximum de vraisemblance. Ce fait, on donne une savante liste de principes de construction de bons estimateurs, le principe principal étant qu’il ne faut pas croire aux principes.

The entire paper is just as witty, as in describing the mixture model as “contaminated and not fit to drink”! Or in “Everybody knows that taking logarithms is unfair”. Or, again, in “biostatisticians, being complicated people, prefer to work out not with the dose y but with its logarithm”… And a last line: “One possibility is that there are too many horse hairs in e”.

## Deborah Mayo’s talk in Montréal (JSM 2013)

Posted in Books, Statistics, Uncategorized with tags Allan Birnbaum, Deborah Mayo, JSM 2013, Likelihood Principle, Montréal, Sufficiency principle, weak conditionality principle on July 31, 2013 by xi'an**A**s posted on her blog, Deborah Mayo is giving a lecture at JSM 2013 in Montréal about why Birnbaum’s derivation of the Strong Likelihood Principle (SLP) is wrong. Or, more accurately, why *“WCP entails SLP”*. It would have been a great opportunity to hear Deborah presenting her case and I am sorry I am missing this opportunity. (Although not sorry to be in the beautiful Dolomites at that time.) Here are the slides:

**D**eborah’s argument is the same as previously: there is no reason for the inference in the mixed (or Birnbaumized) experiment to be equal to the inference in the conditional experiment. As previously, I do not get it: the weak conditionality principle (WCP) implies that inference from the mixture output, once we know which component is used (hence rejecting the* “and we don’t know which”* on slide 8), should only be dependent on that component. I also fail to understand why either WCP or the Birnbaum experiment refers to a mixture (sl.13) in that the index of the experiment is assumed to be known, contrary to mixtures. Thus (still referring at slide 13), the presentation of Birnbaum’s experiment is erroneous. It is indeed impossible to force the outcome of y* if tail and of x* if head *but* it is possible to choose the experiment index at random, 1 versus 2, and then, if y* is observed, to report (E_{1},x*) as a sufficient statistic. (Incidentally, there is a typo on slide 15, it should be “likewise for x*”.)

## Birnbaum’s proof missing one bar?!

Posted in Statistics with tags Allan Birnbaum, Likelihood Principle, set theory, Sufficiency principle, weak conditionality principle on March 4, 2013 by xi'an**M**ichael Evans just posted a new paper on arXiv yesterday about Birnbaum’s proof of his likelihood principle theorem. There has recently been a lot of activity around this theorem (some of which reported on the ‘Og!) and the flurry of proofs, disproofs, arguments, counterarguments, and counter-counterarguments, mostly by major figures in the field, is rather overwhelming! This paper is however highly readable as it sets everything in terms of set theory and relations. While I am not completely convinced that the conclusion holds, the steps in the paper seem correct. The starting point is that the likelihood relation, L, the invariance relation, G, and the sufficiency relation, S, all are equivalence relations (on the set of inference bases/parametric families). The conditionality relation,C, however fails to be transitive and hence an equivalence relation. Furthermore, the smallest equivalence relation containing the conditionality relation is the likelihood relation. Then Evans proves that the conjunction of the sufficiency and the conditionality relations is strictly included in the likelihood relation, which is the smallest equivalence relation containing the union. Furthermore, the fact that the smallest equivalence relation containing the conditionality relation is the likelihood relation means that sufficiency is irrelevant (in this sense, and in this sense only!).

**T**his is a highly interesting and well-written document. I just do not know what to think of it in correspondence with my understanding of the likelihood principle. That

rather than

makes a difference from a mathematical point of view, however I cannot relate it to the statistical interpretation. Like, why would we have to insist upon equivalence? why does invariance appear in some lemmas? why is a maximal ancillary statistics relevant at this stage when it does not appear in the original proof of Birbaum (1962)? why is there no mention made of weak versus strong conditionality principle?