**I**n the past weeks I have received and read several papers (and X validated entries)where the Bayes factor is used to compare priors. Which does not look right to me, not on the basis of my general dislike of Bayes factors!, but simply because this seems to clash with the (my?) concept of Bayesian model choice and also because data should not play a role in that situation, from being used to select a *prior*, hence at least twice to run the inference, to resort to a *single* parameter value (namely the one behind the data) to decide between two distributions, to having no asymptotic justification, to eventually favouring the prior concentrated on the maximum likelihood estimator. And more. But I fear that this reticence to test for prior adequacy also extends to the prior predictive, or Box’s p-value, namely the probability under this prior predictive to observe something “more extreme” than the current observation, to quote from David Spiegelhalter.

## Archive for Bayes factors

## leave Bayes factors where they once belonged

Posted in Statistics with tags Bayes factors, Bayesian Analysis, Bayesian decision theory, cross validated, prior comparison, prior predictive, prior selection, The Bayesian Choice, The Beatles, using the data twice, xkcd on February 19, 2019 by xi'an## a question from McGill about The Bayesian Choice

Posted in Books, pictures, Running, Statistics, Travel, University life with tags Bayes factors, Bayesian hypothesis testing, Canada, cross validated, improper prior, McGill University, Montréal, posterior probability on December 26, 2018 by xi'an**I** received an email from a group of McGill students working on Bayesian statistics and using The Bayesian Choice (although the exercise pictured below is not in the book, the closest being exercise 1.53 inspired from Raiffa and Shlaiffer, 1961, and exercise 5.10 as mentioned in the email):

There was a question that some of us cannot seem to decide what is the correct answer. Here are the issues,

Some people believe that the answer to both is ½, while others believe it is 1. The reasoning for ½ is that since Beta is a continuous distribution, we never could have θ exactly equal to ½. Thus regardless of α, the probability that θ=½ in that case is 0. Hence it is ½. I found a related stack exchange question that seems to indicate this as well.

The other side is that by Markov property and mean of Beta(a,a), as α goes to infinity , we will approach ½ with probability 1. And hence the limit as α goes to infinity for both (a) and (b) is 1. I think this also could make sense in another context, as if you use the Bayes factor representation. This is similar I believe to the questions in the Bayesian Choice, 5.10, and 5.11.

As it happens, the answer is ½ in the first case (a) because π(H⁰) is ½ regardless of α and 1 in the second case (b) because the evidence against H⁰ goes to zero as α goes to zero *(watch out!)*, along with the mass of the prior on any compact of (0,1) since Γ(2α)/Γ(α)². (The limit does not correspond to a proper prior and hence is somewhat meaningless.) However, when α goes to infinity, the evidence against H⁰ goes to infinity and the posterior probability of ½ goes to zero, despite the prior under the alternative being more and more concentrated around ½!

## a come-back of the harmonic mean estimator

Posted in Statistics with tags Alan Gelfand, Bayes factors, Bayesian computing, harmonic mean estimator, Max Planck Institute, München, Werner-Heisenberg-Institut on September 6, 2018 by xi'an**A**re we in for a return of the harmonic mean estimator?! Allen Caldwell and co-authors arXived a new document that Allen also sent me, following a technique that offers similarities with our earlier approach with Darren Wraith, the difference being in the more careful and practical construct of the partition set and use of multiple hypercubes, which is the smart thing. I visited Allen’s group at the Max Planck Institut für Physik (Heisenberg) in München (Garching) in 2015 and we confronted our perspectives on harmonic means at that time. The approach followed in the paper starts from what I would call the canonical Gelfand and Dey (1995) representation with a uniform prior, namely that the integral of an arbitrary non-negative function [or unnormalised density] ƒ can be connected with the integral of the said function ƒ over a smaller set Δ with a finite measure measure [or volume]. And therefore to simulations from the density ƒ restricted to this set Δ. Which can be recycled by the harmonic mean identity towards producing an estimate of the integral of ƒ over the set Δ. When considering a partition, these integrals sum up to the integral of interest but this is not necessarily the only exploitation one can make of the fundamental identity. The most novel part stands in constructing an adaptive partition based on the sample, made of hypercubes obtained after whitening of the sample. Only keeping points with large enough density and sufficient separation to avoid overlap. (I am unsure a genuine partition is needed.) In order to avoid selection biases the original sample is separated into two groups, used independently. Integrals that stand too much away from the others are removed as well. This construction may sound a bit daunting in the number of steps it involves and in the poor adequation of a Normal to an hypercube or conversely, but it seems to shy away from the number one issue with the basic harmonic mean estimator, the almost certain infinite variance. Although it would be nice to be completely certain this doom is avoided. I still wonder at the degenerateness of the approximation of the integral with the dimension, as well as at other ways of exploiting this always fascinating [if fraught with dangers] representation. And comparing variances.

## JASP, a really really fresh way to do stats

Posted in Statistics with tags Bayes factors, Bayesian inference, design, Harold Jeffreys, JASP, tee-shirt, University of Amsterdam on February 1, 2018 by xi'an## Bayesian spectacles

Posted in Books, pictures, Statistics, University life with tags Amsterdam, Bayes factors, Bayesian Spectacles, blogging, Holland, JASP, non-informative priors, objective Bayes, reference priors, UMPBTs, uniformly most powerful tests, University of Amsterdam on October 4, 2017 by xi'anE.J. Wagenmakers and his enthusiastic team of collaborators at University of Amsterdam and in the JASP software designing team have started a blog called Bayesian spectacles which I find a fantastic title. And not only because I wear glasses. Plus, they got their own illustrator, Viktor Beekman, which sounds like the epitome of sophistication! (Compared with resorting to vacation or cat pictures…)

In a most recent post they addressed the criticisms we made of the 72 author paper on p-values, one of the co-authors being E.J.! Andrew already re-addressed some of the address, but here is a disagreement he let me to chew on my own [and where the Abandoners are us!]:

Disagreement 2.The Abandoners’ critique the UMPBTs –the uniformly most powerful Bayesian tests– that features in the original paper. This is their right (see also the discussion of the 2013 Valen Johnson PNAS paper), but they ignore the fact that the original paper presented a series of other procedures that all point to the same conclusion: p-just-below-.05 results are evidentially weak. For instance, a cartoon on the JASP blog explains the Vovk-Sellke bound. A similar result is obtained using the upper bounds discussed in Berger & Sellke (1987) and Edwards, Lindman, & Savage (1963). We suspect that the Abandoners’ dislike of Bayes factors (and perhaps their upper bounds) is driven by a disdain for the point-null hypothesis. That is understandable, but the two critiques should not be mixed up. The first question is Given that we wish to test a point-null hypothesis, do the Bayes factor upper bounds demonstrate that the evidence is weak for p-just-below-.05 results? We believe they do, and in this series of blog posts we have provided concrete demonstrations.

Obviously, this reply calls for an examination of the entire BS blog series, but being short in time at the moment, let me point out that the upper lower bounds on the Bayes factors showing much more support for H⁰ than a p-value at 0.05 only occur in special circumstances. Even though I spend some time in my book discussing those bounds. Indeed, the [interesting] fact that the lower bounds are larger than the p-values does not hold in full generality. Moving to a two-dimensional normal with potentially zero mean is enough to see the order between lower bound and p-value reverse, as I found [quite] a while ago when trying to expand Berger and Sellker (1987, the same year as I was visiting Purdue where both had a position). I am not sure this feature has been much explored in the literature, I did not pursue it when I realised the gap was missing in larger dimensions… I must also point out I do not have the same repulsion for point nulls as Andrew! While considering whether a parameter, say a mean, is exactly zero [or three or whatever] sounds rather absurd when faced with the strata of uncertainty about models, data, procedures, &tc.—even in theoretical physics!—, comparing several [and all wrong!] models with or without some parameters for later use still makes sense. And my reluctance in using Bayes factors does not stem from an opposition to comparing models or from the procedure itself, which is quite appealing within a Bayesian framework [thus appealing *per se*!], but rather from the unfortunate impact of the prior [and its tail behaviour] on the quantity and on the delicate calibration of the thing. And on a lack of reference solution [to avoid the O and the N words!]. As exposed in the demise papers. (Which main version remains in a publishing limbo, the onslaught from the referees proving just too much for me!)

## priors without likelihoods are like sloths without…

Posted in Books, Statistics with tags Austin, Bayes factors, Bayesian Analysis, identifiability, improper priors, noninformative priors, O'Bayes17, Pierre Simon Laplace, posterior predictive, reference priors, sloth, The American Statistician, The University of Texas at Austin on September 11, 2017 by xi'an

“The idea of building priors that generate reasonable data may seem like an unusual idea…”

**A**ndrew, Dan, and Michael arXived a opinion piece last week entitled “The prior can generally only be understood in the context of the likelihood”. Which connects to the earlier Read Paper of Gelman and Hennig I discussed last year. I cannot state strong disagreement with the positions taken in this piece, actually, in that I do not think prior distributions ever occur as *a given* but are rather chosen as a reference measure to probabilise the parameter space and eventually prioritise regions over others. If anything I find myself even further on the prior agnosticism gradation. (Of course, this lack of disagreement applies to the likelihood understood as a function of both the data and the parameter, rather than of the parameter only, conditional on the data. Priors cannot be depending on the data without incurring disastrous consequences!)

“…it contradicts the conceptual principle that the prior distribution should convey only information that is available before the data have been collected.”

The first example is somewhat disappointing in that it revolves as so many Bayesian textbooks (since Laplace!) around the [sex ratio] Binomial probability parameter and concludes at the strong or long-lasting impact of the Uniform prior. I do not see much of a contradiction between the use of a Uniform prior and the collection of prior information, if only because there is not standardised way to transfer prior information into prior construction. And more fundamentally because a parameter rarely makes sense by itself, alone, without a model that relates it to potential data. As for instance in a regression model. More, following my epiphany of last semester, about the relativity of the prior, I see no damage in the prior being relevant, as I only attach a *relative* meaning to statements based on the posterior. Rather than trying to limit the impact of a prior, we should rather build assessment tools to measure this impact, for instance by prior predictive simulations. And this is where I come to quite agree with the authors.

“…non-identifiabilities, and near nonidentifiabilites, of complex models can lead to unexpected amounts of weight being given to certain aspects of the prior.”

Another rather straightforward remark is that non-identifiable models see the impact of a prior remain as the sample size grows. And I still see no issue with this fact in a relative approach. When the authors mention (p.7) that purely mathematical priors perform more poorly than weakly informative priors it is hard to see what they mean by this “performance”.

“…judge a prior by examining the data generating processes it favors and disfavors.”

Besides those points, I completely agree with them about the fundamental relevance of the prior as a generative process, only when the likelihood becomes available. And simulatable. (This point is found in many references, including our response to the American Statistician paper *Hidden dangers of specifying noninformative priors*, with Kaniav Kamary. With the same illustration on a logistic regression.) I also agree to their criticism of the marginal likelihood and Bayes factors as being so strongly impacted by the choice of a prior, if treated as absolute quantities. I also if more reluctantly and somewhat heretically see a point in using the posterior predictive for assessing whether a prior is relevant for the data at hand. At least at a conceptual level. I am however less certain about how to handle improper priors based on their recommendations. In conclusion, it would be great to see one [or more] of the authors at O-Bayes 2017 in Austin as I am sure it would stem nice discussions there! (And by the way I have no prior idea on how to conclude the comparison in the title!)