Archive for Bayes factor

dodging bullets, IEDs, and fingerprint detection at SimStat19

Posted in pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , on September 10, 2019 by xi'an

I attended a fairly interesting forensic science session at SimStat 2019 in Salzburg as it concentrated on evidence and measures of evidence rather than on strict applications of Bayesian methodology to forensic problems. Even though American administrations like the FBI or various police departments were involved. It was a highly coherent session and I had a pleasant discussion with some of the speakers after the session. For instance, my friend Alicia Carriquiry presented an approach to determined from images of bullets whether or not they have been fired from the same gun, leading to an interesting case for a point null hypothesis where the point null makes complete sense. The work has been published in Annals of Applied Statistics and is used in practice. The second talk by Danica Ommen on fiducial forensics on IED, asking whether or not copper wires used in the bombs are the same, which is another point null illustration. Which also set an interesting questioning on the dependence of the alternative prior on the distribution of material chosen as it is supposed to cover all possible origins for the disputed item. But more interestingly this talk launched into a discussion of making decision based on finite samplers and unknown parameters, not that specific to forensics, with a definitely surprising representation of the Bayes factor as an expected likelihood ratio which made me first reminiscent of Aitkin’s (1991) infamous posterior likelihood (!) before it dawned on me this was a form of bridge sampling identity where the likelihood ratio only involved parameters common to both models, making it an expression well-defined under both models. This identity could be generalised to the general case by considering a ratio of integrated likelihoods, the extreme case being the ratio equal to the Bayes factor itself. The following two talks by Larry Tang and Christopher Saunders were also focused on the likelihood ratio and their statistical estimates, debating the coherence of using a score function and presenting a functional ABC algorithm where the prior is a Dirichlet (functional) prior. Thus a definitely relevant session from a Bayesian perspective!

 

Bertrand-Borel debate

Posted in Books, Statistics with tags , , , , , , , , , , , , , on May 6, 2019 by xi'an

On her blog, Deborah Mayo briefly mentioned the Bertrand-Borel debate on the (in)feasibility of hypothesis testing, as reported [and translated] by Erich Lehmann. A first interesting feature is that both [starting with] B mathematicians discuss the probability of causes in the Bayesian spirit of Laplace. With Bertrand considering that the prior probabilities of the different causes are impossible to set and then moving all the way to dismiss the use of probability theory in this setting, nipping the p-values in the bud..! And Borel being rather vague about the solution probability theory has to provide. As stressed by Lehmann.

“The Pleiades appear closer to each other than one would naturally expect. This statement deserves thinking about; but when one wants to translate the phenomenon into numbers, the necessary ingredients are lacking. In order to make the vague idea of closeness more precise, should we look for the smallest circle that contains the group? the largest of the angular distances? the sum of squares of all the distances? the area of the spherical polygon of which some of the stars are the vertices and which contains the others in its interior? Each of these quantities is smaller for the group of the Pleiades than seems plausible. Which of them should provide the measure of implausibility? If three of the stars form an equilateral triangle, do we have to add this circumstance, which is certainly very unlikely apriori, to those that point to a cause?” Joseph Bertrand (p.166)

 

“But whatever objection one can raise from a logical point of view cannot prevent the preceding question from arising in many situations: the theory of probability cannot refuse to examine it and to give an answer; the precision of the response will naturally be limited by the lack of precision in the question; but to refuse to answer under the pretext that the answer cannot be absolutely precise, is to place oneself on purely abstract grounds and to misunderstand the essential nature of the application of mathematics.” Emile Borel (Chapter 4)

Another highly interesting objection of Bertrand is somewhat linked with his conditioning paradox, namely that the density of the observed unlikely event depends on the choice of the statistic that is used to calibrate the unlikeliness, which makes complete sense in that the information contained in each of these statistics and the resulting probability or likelihood differ to an arbitrary extend, that there are few cases (monotone likelihood ratio) where the choice can be made, and that Bayes factors share the same drawback if they do not condition upon the entire sample. In which case there is no selection of “circonstances remarquables”. Or of uniformly most powerful tests.

Jeffreys priors for hypothesis testing [Bayesian reads #2]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , on February 9, 2019 by xi'an

A second (re)visit to a reference paper I gave to my OxWaSP students for the last round of this CDT joint program. Indeed, this may be my first complete read of Susie Bayarri and Gonzalo Garcia-Donato 2008 Series B paper, inspired by Jeffreys’, Zellner’s and Siow’s proposals in the Normal case. (Disclaimer: I was not the JRSS B editor for this paper.) Which I saw as a talk at the O’Bayes 2009 meeting in Phillie.

The paper aims at constructing formal rules for objective proper priors in testing embedded hypotheses, in the spirit of Jeffreys’ Theory of Probability “hidden gem” (Chapter 3). The proposal is based on symmetrised versions of the Kullback-Leibler divergence κ between null and alternative used in a transform like an inverse power of 1+κ. With a power large enough to make the prior proper. Eventually multiplied by a reference measure (i.e., the arbitrary choice of a dominating measure.) Can be generalised to any intrinsic loss (not to be confused with an intrinsic prior à la Berger and Pericchi!). Approximately Cauchy or Student’s t by a Taylor expansion. To be compared with Jeffreys’ original prior equal to the derivative of the atan transform of the root divergence (!). A delicate calibration by an effective sample size, lacking a general definition.

At the start the authors rightly insist on having the nuisance parameter v to differ for each model but… as we all often do they relapse back to having the “same ν” in both models for integrability reasons. Nuisance parameters make the definition of the divergence prior somewhat harder. Or somewhat arbitrary. Indeed, as in reference prior settings, the authors work first conditional on the nuisance then use a prior on ν that may be improper by the “same” argument. (Although conditioning is not the proper term if the marginal prior on ν is improper.)

The paper also contains an interesting case of the translated Exponential, where the prior is L¹ Student’s t with 2 degrees of freedom. And another one of mixture models albeit in the simple case of a location parameter on one component only.

Computational Bayesian Statistics [book review]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , on February 1, 2019 by xi'an

This Cambridge University Press book by M. Antónia Amaral Turkman, Carlos Daniel Paulino, and Peter Müller is an enlarged translation of a set of lecture notes in Portuguese. (Warning: I have known Peter Müller from his PhD years in Purdue University and cannot pretend to perfect objectivity. For one thing, Peter once brought me frozen-solid beer: revenge can also be served cold!) Which reminds me of my 1994 French edition of Méthodes de Monte Carlo par chaînes de Markov, considerably upgraded into Monte Carlo Statistical Methods (1998) thanks to the input of George Casella. (Re-warning: As an author of books on the same topic(s), I can even less pretend to objectivity.)

“The “great idea” behind the development of computational Bayesian statistics is the recognition that Bayesian inference can be implemented by way of simulation from the posterior distribution.”

The book is written from a strong, almost militant, subjective Bayesian perspective (as, e.g., when half-Bayesians are mentioned!). Subjective (and militant) as in Dennis Lindley‘s writings, eminently quoted therein. As well as in Tony O’Hagan‘s. Arguing that the sole notion of a Bayesian estimator is the entire posterior distribution. Unless one brings in a loss function. The book also discusses the Bayes factor in a critical manner, which is fine from my perspective.  (Although the ban on improper priors makes its appearance in a very indirect way at the end of the last exercise of the first chapter.)

Somewhat at odds with the subjectivist stance of the previous chapter, the chapter on prior construction only considers non-informative and conjugate priors. Which, while understandable in an introductory book, is a wee bit disappointing. (When mentioning Jeffreys’ prior in multidimensional settings, the authors allude to using univariate Jeffreys’ rules for the marginal prior distributions, which is not a well-defined concept or else Bernardo’s and Berger’s reference priors would not have been considered.) The chapter also mentions the likelihood principle at the end of the last exercise, without a mention of the debate about its derivation by Birnbaum. Or Deborah Mayo’s recent reassessment of the strong likelihood principle. The following chapter is a sequence of illustrations in classical exponential family models, classical in that it is found in many Bayesian textbooks. (Except for the Poison model found in Exercise 3.3!)

Nothing to complain (!) about the introduction of Monte Carlo methods in the next chapter, especially about the notion of inference by Monte Carlo methods. And the illustration by Bayesian design. The chapter also introduces Rao-Blackwellisation [prior to introducing Gibbs sampling!]. And the simplest form of bridge sampling. (Resuscitating the weighted bootstrap of Gelfand and Smith (1990) may not be particularly urgent for an introduction to the topic.) There is furthermore a section on sequential Monte Carlo, including the Kalman filter and particle filters, in the spirit of Pitt and Shephard (1999). This chapter is thus rather ambitious in the amount of material covered with a mere 25 pages. Consensus Monte Carlo is even mentioned in the exercise section.

“This and other aspects that could be criticized should not prevent one from using this [Bayes factor] method in some contexts, with due caution.”

Chapter 5 turns back to inference with model assessment. Using Bayesian p-values for model assessment. (With an harmonic mean spotted in Example 5.1!, with no warning about the risks, except later in 5.3.2.) And model comparison. Presenting the whole collection of xIC information criteria. from AIC to WAIC, including a criticism of DIC. The chapter feels somewhat inconclusive but methinks this is the right feeling on the current state of the methodology for running inference about the model itself.

“Hint: There is a very easy answer.”

Chapter 6 is also a mostly standard introduction to Metropolis-Hastings algorithms and the Gibbs sampler. (The argument given later of a Metropolis-Hastings algorithm with acceptance probability one does not work.) The Gibbs section also mentions demarginalization as a [latent or auxiliary variable] way to simulate from complex distributions [as we do], but without defining the notion. It also references the precursor paper of Tanner and Wong (1987). The chapter further covers slice sampling and Hamiltonian Monte Carlo, the later with sufficient details to lead to reproducible implementations. Followed by another standard section on convergence assessment, returning to the 1990’s feud of single versus multiple chain(s). The exercise section gets much larger than in earlier chapters with several pages dedicated to most problems. Including one on ABC, maybe not very helpful in this context!

“…dimension padding (…) is essentially all that is to be said about the reversible jump. The rest are details.”

The next chapter is (somewhat logically) the follow-up for trans-dimensional problems and marginal likelihood approximations. Including Chib’s (1995) method [with no warning about potential biases], the spike & slab approach of George and McCulloch (1993) that I remember reading in a café at the University of Wyoming!, the somewhat antiquated MC³ of Madigan and York (1995). And then the much more recent array of Bayesian lasso techniques. The trans-dimensional issues are covered by the pseudo-priors of Carlin and Chib (1995) and the reversible jump MCMC approach of Green (1995), the later being much more widely employed in the literature, albeit difficult to tune [and even to comprehensively describe, as shown by the algorithmic representation in the book] and only recommended for a large number of models under comparison. Once again the exercise section is most detailed, with recent entries like the EM-like variable selection algorithm of Ročková and George (2014).

The book also includes a chapter on analytical approximations, which is also the case in ours [with George Casella] despite my reluctance to bring them next to exact (simulation) methods. The central object is the INLA methodology of Rue et al. (2009) [absent from our book for obvious calendar reasons, although Laplace and saddlepoint approximations are found there as well]. With a reasonable amount of details, although stopping short of implementable reproducibility. Variational Bayes also makes an appearance, mostly following the very recent Blei et al. (2017).

The gem and originality of the book are primarily to be found in the final and ninth chapter where four software are described, all with interfaces to R: OpenBUGS, JAGS, BayesX, and Stan, plus R-INLA which is processed in the second half of the chapter (because this is not a simulation method). As in the remainder of the book, the illustrations are related to medical applications. Worth mentioning is the reminder that BUGS came in parallel with Gelfand and Smith (1990) Gibbs sampler rather than as a consequence. Even though the formalisation of the Markov chain Monte Carlo principle by the later helped in boosting the power of this software. (I also appreciated the mention made of Sylvia Richardson’s role in this story.) Since every software is illustrated in depth with relevant code and output, and even with the shortest possible description of its principle and modus vivendi, the chapter is 60 pages long [and missing a comparative conclusion]. Given my total ignorance of the very existence of the BayesX software, I am wondering at the relevance of its inclusion in this description rather than, say, other general R packages developed by authors of books such as Peter Rossi. The chapter also includes a description of CODA, with an R version developed by Martin Plummer [now a Warwick colleague].

In conclusion, this is a high-quality and all-inclusive introduction to Bayesian statistics and its computational aspects. By comparison, I find it much more ambitious and informative than Albert’s. If somehow less pedagogical than the thicker book of Richard McElreath. (The repeated references to Paulino et al.  (2018) in the text do not strike me as particularly useful given that this other book is written in Portuguese. Unless an English translation is in preparation.)

Disclaimer: this book was sent to me by CUP for endorsement and here is what I wrote in reply for a back-cover entry:

An introduction to computational Bayesian statistics cooked to perfection, with the right mix of ingredients, from the spirited defense of the Bayesian approach, to the description of the tools of the Bayesian trade, to a definitely broad and very much up-to-date presentation of Monte Carlo and Laplace approximation methods, to an helpful description of the most common software. And spiced up with critical perspectives on some common practices and an healthy focus on model assessment and model selection. Highly recommended on the menu of Bayesian textbooks!

And this review is likely to appear in CHANCE, in my book reviews column.

mixture modelling for testing hypotheses

Posted in Books, Statistics, University life with tags , , , , , , , , , , on January 4, 2019 by xi'an

After a fairly long delay (since the first version was posted and submitted in December 2014), we eventually revised and resubmitted our paper with Kaniav Kamary [who has now graduated], Kerrie Mengersen, and Judith Rousseau on the final day of 2018. The main reason for this massive delay is mine’s, as I got fairly depressed by the general tone of the dozen of reviews we received after submitting the paper as a Read Paper in the Journal of the Royal Statistical Society. Despite a rather opposite reaction from the community (an admittedly biased sample!) including two dozens of citations in other papers. (There seems to be a pattern in my submissions of Read Papers, witness our earlier and unsuccessful attempt with Christophe Andrieu in the early 2000’s with the paper on controlled MCMC, leading to 121 citations so far according to G scholar.) Anyway, thanks to my co-authors keeping up the fight!, we started working on a revision including stronger convergence results, managing to show that the approach leads to an optimal separation rate, contrary to the Bayes factor which has an extra √log(n) factor. This may sound paradoxical since, while the Bayes factor  converges to 0 under the alternative model exponentially quickly, the convergence rate of the mixture weight α to 1 is of order 1/√n, but this does not mean that the separation rate of the procedure based on the mixture model is worse than that of the Bayes factor. On the contrary, while it is well known that the Bayes factor leads to a separation rate of order √log(n) in parametric models, we show that our approach can lead to a testing procedure with a better separation rate of order 1/√n. We also studied a non-parametric setting where the null is a specified family of distributions (e.g., Gaussians) and the alternative is a Dirichlet process mixture. Establishing that the posterior distribution concentrates around the null at the rate √log(n)/√n. We thus resubmitted the paper for publication, although not as a Read Paper, with hopefully more luck this time!

a glaring mistake

Posted in Statistics with tags , , , , , , on November 28, 2018 by xi'an

Someone posted this question about Bayes factors in my book on Saturday morning and I could not believe the glaring typo pointed out there had gone through the centuries without anyone noticing! There should be no index 0 or 1 on the θ’s in either integral (or indices all over). I presume I made this typo when cutting & pasting from the previous formula (which addressed the case of two point null hypotheses), but I am quite chagrined that I sabotaged the definition of the Bayes factor for generations of readers of the Bayesian Choice. Apologies!!!

simulated summary statistics [in the sky]

Posted in Statistics with tags , , , , , , , on October 10, 2018 by xi'an

Thinking it was related with ABC, although in the end it is not!, I recently read a baffling cosmology paper by Jeffrey and Abdalla. The data d there means an observed (summary) statistic, while the summary statistic is a transform of the parameter, μ(θ), which calibrates the distribution of the data. With nuisance parameters. More intriguing to me is the sentence that the correct likelihood of d is indexed by a simulated version of μ(θ), μ'(θ), rather than by μ(θ). Which seems to assume that the pseudo- or simulated data can be produced for the same value of the parameter as the observed data. The rest of the paper remains incomprehensible for I do not understand how the simulated versions are simulated.

“…the corrected likelihood is more than a factor of exp(30) more probable than the uncorrected. This is further validation of the corrected likelihood; the model (i.e. the corrected likelihood) shows a better goodness-of-fit.”

The authors further ressort to Bayes factors to compare corrected and uncorrected versions of the likelihoods, which leads (see quote) to picking the corrected version. But are they comparable as such, given that the corrected version involves simulations that are treated as supplementary data? As noted by the authors, the Bayes factor  unsurprisingly goes to one as the number M of simulations grows to infinity, as supported by the graph below.