## Archive for Bayesian tests of hypotheses

## Judith’s colloquium at Warwick

Posted in Statistics with tags Bayesian inference, Bayesian nonparametrics, Bayesian tests of hypotheses, colloquium, Hawkes processes, Judith Rousseau, seminar, University of Oxford, University of Warwick on February 21, 2020 by xi'an## Jeffreys priors for hypothesis testing [Bayesian reads #2]

Posted in Books, Statistics, University life with tags Arnold Zellner, Bayes factor, Bayesian tests of hypotheses, CDT, class, classics, Gaussian mixture, improper priors, Jeffreys prior, JRSSB, Kullback-Leibler divergence, Oxford, PhD course, Saint Giles cemetery, Susie Bayarri, Theory of Probability, University of Oxford on February 9, 2019 by xi'anA second (re)visit to a reference paper I gave to my OxWaSP students for the last round of this CDT joint program. Indeed, this may be my first complete read of Susie Bayarri and Gonzalo Garcia-Donato 2008 Series B paper, inspired by Jeffreys’, Zellner’s and Siow’s proposals in the Normal case. *(Disclaimer: I was not the JRSS B editor for this paper.) *Which I saw as a talk at the O’Bayes 2009 meeting in Phillie.

The paper aims at constructing formal rules for objective proper priors in testing embedded hypotheses, in the spirit of Jeffreys’ Theory of Probability “hidden gem” (Chapter 3). The proposal is based on symmetrised versions of the Kullback-Leibler divergence κ between null and alternative used in a transform like an inverse power of 1+κ. With a power large enough to make the prior proper. Eventually multiplied by a reference measure (i.e., the arbitrary choice of a dominating measure.) Can be generalised to any intrinsic loss (not to be confused with an intrinsic prior à la Berger and Pericchi!). Approximately Cauchy or Student’s t by a Taylor expansion. To be compared with Jeffreys’ original prior equal to the derivative of the atan transform of the root divergence (!). A delicate calibration by an effective sample size, lacking a general definition.

At the start the authors rightly insist on having the nuisance parameter v to differ for each model but… as we all often do they relapse back to having the “same ν” in both models for integrability reasons. Nuisance parameters make the definition of the divergence prior somewhat harder. Or somewhat arbitrary. Indeed, as in reference prior settings, the authors work first conditional on the nuisance then use a prior on ν that may be improper by the “same” argument. (Although *conditioning* is not the proper term if the marginal prior on ν is improper.)

The paper also contains an interesting case of the translated Exponential, where the prior is L¹ Student’s t with 2 degrees of freedom. And another one of mixture models albeit in the simple case of a location parameter on one component only.

## John Kruschke on Bayesian assessment of null values

Posted in Books, Kids, pictures, Statistics, University life with tags arXiv, Bayesian tests of hypotheses, Doing Bayesian Data Analysis, HPD region, hypothesis testing, India, John Kruschke, PsyArXiv, ROPE on February 28, 2017 by xi'an**J**ohn Kruschke pointed out to me a blog entry he wrote last December as a follow-up to my own entry on an earlier paper of his. Induced by an X validated entry. Just in case this sounds a wee bit too convoluted for unraveling the threads (!), the central notion there is to replace a point null hypothesis testing [of bad reputation, for many good reasons] with a check whether or not the null value stands within the 95% HPD region [modulo a buffer zone], which offers the pluses of avoiding a Dirac mass at the null value and a long-term impact of the prior tails on the decision, as well as the possibility of a no-decision, with the minuses of replacing the null with a tolerance region around the null and calibrating both the rejection level and the buffer zone. The December blog entry exposes this principle with graphical illustrations familiar to readers of Doing Bayesian Data Analysis.

As I do not want to fall into an infinite regress of mirror discussions, I will not proceed further than referring to my earlier post, which covers my reservations about the proposal. But interested readers may want to check the latest paper by Kruschke and Liddel on that perspective. (With the conclusion that “Bayesian estimation does everything the New Statistics desires, better”.) Available on PsyArXiv, an avatar of arXiv for psychology papers.

## Bayesian parameter estimation versus model comparison

Posted in Books, pictures, Statistics with tags Bayes factors, Bayesian model comparison, Bayesian tests of hypotheses, cross validated, HPD region, John Kruschke, marginal likelihood on December 5, 2016 by xi'an**J**ohn Kruschke [of puppies’ fame!] wrote a paper in Perspectives in Psychological Science a few years ago on the comparison between two Bayesian approaches to null hypotheses. Of which I became aware through a X validated question that seemed to confuse Bayesian parameter estimation with Bayesian hypothesis testing.

“Regardless of the decision rule, however, the primary attraction of using parameter estimation to assess null values is that the an explicit posterior distribution reveals the relative credibility of all the parameter values.” (p.302)

After reading this paper, I realised that Kruschke meant something completely different, namely that a Bayesian approach to null hypothesis testing could operate from the posterior on the corresponding parameter, rather than to engage into formal Bayesian model comparison (null versus the rest of the World). The notion is to check whether or not the null value stands within the 95% [why 95?] HPD region [modulo a buffer zone], which offers the pluses of avoiding a Dirac mass at the null value and a long-term impact of the prior tails on the decision, with the minus of replacing the null with a tolerance region around the null and calibrating the rejection level. This opposition is thus a Bayesian counterpart of running tests on point null hypotheses either by Neyman-Pearson procedures or by confidence intervals. Note that in problems with nuisance parameters this solution requires a determination of the 95% HPD region associated with the marginal on the parameter of interest, which may prove a challenge.

“…the measure provides a natural penalty for vague priors that allow a broad range of parameter values, because a vague prior dilutes credibility across a broad range of parameter values, and therefore the weighted average is also attenuated.” (p. 306)

While I agree with most of the critical assessment of Bayesian model comparison, including Kruschke’s version of Occam’s razor [and Lindley’s paradox] above, I do not understand how Bayesian model comparison fails to return a full posterior on both the model indices [for model comparison] and the model parameters [for estimation]. To state that it does not because the Bayes factor only depends on marginal likelihoods (p.307) sounds unfair if only because most numerical techniques to approximate the Bayes factors rely on preliminary simulations of the posterior. The point that the Bayes factor strongly depends on the modelling of the alternative model is well-taken, albeit the selection of the null in the “estimation” approach does depend as well on this alternative modelling. Which is an issue if one ends up accepting the null value and running a Bayesian analysis based on this null value.

“The two Bayesian approaches to assessing null values can be unified in a single hierarchical model.” (p.308)

Incidentally, the paper briefly considers a unified modelling that can be interpreted as a mixture across both models, but this mixture representation completely differs from ours [where we also advocate estimation to replace testing] since the mixture is at the *likelihood x prior* level, as in O’Neill and Kypriaos.

## Lindley’s paradox as a loss of resolution

Posted in Books, pictures, Statistics with tags AIC, Bayesian inference, Bayesian tests of hypotheses, BIC, Cross Validation, Inferential Models, Jeffreys-Lindley paradox, p-value, pseudo-Bayes factors on November 9, 2016 by xi'an

“The principle of indifference states that in the absence of prior information, all mutually exclusive models should be assigned equal prior probability.”

**C**olin LaMont and Paul Wiggins arxived a paper on Lindley’s paradox a few days ago. The above quote is the (standard) argument for picking (½,½) partition between the two hypotheses, which I object to if only because it does not stand for multiple embedded models. The main point in the paper is to argue about the loss of resolution induced by averaging against the prior, as illustrated by the picture above for the N(0,1) versus N(μ,1) toy problem. What they call resolution is the lowest possible mean estimate for which the null is rejected by the Bayes factor (assuming a rejection for Bayes factors larger than 1). While the detail is missing, I presume the different curves on the lower panel correspond to different choices of L when using U(-L,L) priors on μ… The “Bayesian rejoinder” to the Lindley-Bartlett paradox (p.4) is in tune with my interpretation, namely that as the prior mass under the alternative gets more and more spread out, there is less and less prior support for reasonable values of the parameter, hence a growing tendency to accept the null. This is an illustration of the long-lasting impact of the prior on the posterior probability of the model, because the data cannot impact the tails very much.

“If the true prior is known, Bayesian inference using the true prior is optimal.”

This sentence and the arguments following is meaningless in my opinion as knowing the “true” prior makes the Bayesian debate superfluous. If there was a unique, Nature provided, known prior π, it would loose its original meaning to become part of the (frequentist) model. The argument is actually mostly used in negative, namely that since it is not know we should not follow a Bayesian approach: this is, e.g., the main criticism in Inferential Models. But there is no such thing as a “true” prior! (Or a “true’ model, all things considered!) In the current paper, this pseudo-natural approach to priors is utilised to justify a return to the pseudo-Bayes factors of the 1990’s, when one part of the data is used to stabilise and proper-ise the (improper) prior, and a second part to run the test *per se*. This includes an interesting insight on the limiting cases of partitioning corresponding to AIC and BIC, respectively, that I had not seen before. With the surprising conclusion that “AIC is the derivative of BIC”!

## A new approach to Bayesian hypothesis testing

Posted in Books, Statistics with tags Bayes factor, Bayesian decision theory, Bayesian tests of hypotheses, deviance, improper prior, Kullback-Leibler divergence on September 8, 2016 by xi'an

“The main purpose of this paper is to develop a new Bayesian hypothesis testing approach for the point null hypothesis testing (…) based on the Bayesian deviance and constructed in a decision theoretical framework. It can be regarded as the Bayesian version of the likelihood ratio test.”

**T**his paper got published in *Journal of Econometrics* two years ago but I only read it a few days ago when Kerrie Mengersen pointed it out to me. Here is an interesting criticism of Bayes factors.

“In the meantime, unfortunately, Bayes factors also suffers from several theoretical and practical difficulties. First, when improper prior distributions are used,Bayes factorscontains undefined constants and takes arbitrary values (…) Second, when a proper but vague prior distribution with a large spread is used to represent prior ignorance,Bayes factorstends to favour the null hypothesis. The problem may persist even whenthe sample size is large (…) Third, the calculation ofBayes factorsgenerally requires the evaluation of marginal likelihoods. In many models, the marginal likelihoods may be difficult to compute.”

I completely agree with these points, which are part of a longer list in our testing by mixture estimation paper. The authors also rightly blame the rigidity of the 0-1 loss function behind the derivation of the Bayes factor. An alternative decision-theoretic based on the Kullback-Leibler distance has been proposed by José Bernardo and Raúl Rueda, in a 2002 paper, evaluating the average divergence between the null and the full under the full, with the slight drawback that any nuisance parameter has the same prior under both hypotheses. (Which makes me think of the Savage-Dickey paradox, since everything here seems to take place under the alternative.) And the larger drawback of requiring a lower bound for rejecting the null. (Although it could be calibrated under the null prior predictive.)

This paper suggests using instead the difference of the Bayesian deviances, which is the expected log ratio integrated against the posterior. (With the possible embarrassment of the quantity having no prior expectation since the ratio depends on the data. But after all the evidence or marginal likelihood faces the same “criticism”.) So it is a sort of Bayes factor on the logarithms, with a strong similarity with Bernardo & Rueda’s solution since they are equal in expectation under the marginal. As in Dawid et al.’s recent paper, the logarithm removes the issue with the normalising constant and with the Lindley-Jeffreys paradox. The approach then needs to be calibrated in order to define a decision bound about the null. The asymptotic distribution of the criterion is χ²(p)−p, where p is the dimension of the parameter to be tested, but this sounds like falling back on frequentist tests. And the deadly .05% bounds. I would rather favour a calibration of the criterion using prior or posterior predictives under both models…

## MDL multiple hypothesis testing

Posted in Books, pictures, Statistics, Travel, University life with tags Australia, Bayesian tests of hypotheses, EM algorithm, minimal description length principle, mixtures of distributions, Monash University, Robert Menzies, seminar, statistical tests, Victoria on September 1, 2016 by xi'an

“This formulation reveals an interesting connection between multiple hypothesis testing and mixture modelling with the class labels corresponding to the accepted hypotheses in each test.”

**A**fter my seminar at Monash University last Friday, David Dowe pointed out to me the recent work by Enes Makalic and Daniel Schmidt on minimum description length (MDL) methods for multiple testing as somewhat related to our testing by mixture paper. Work which appeared in the proceedings of the *4th Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-11)*, that took place in Helsinki, Finland, in 2011. Minimal encoding length approaches lead to choosing the model that enjoys the smallest coding length. Connected with, e.g., Rissannen‘s approach. The extension in this paper consists in considering K hypotheses at once on a collection of m datasets (the *multiple* then bears on the datasets rather than on the hypotheses). And to associate an hypothesis index to each dataset. When the objective function is the sum of (generalised) penalised likelihoods [as in BIC], it leads to selecting the “minimal length” model for each dataset. But the authors introduce weights or probabilities for each of the K hypotheses, which indeed then amounts to a mixture-like representation on the exponentiated codelengths. Which estimation by optimal coding was first proposed by Chris Wallace in his book. This approach eliminates the model parameters at an earlier stage, e.g. by maximum likelihood estimation, to return a quantity that only depends on the model index and the data. *In fine*, the purpose of the method differs from ours in that the former aims at identifying an appropriate hypothesis for each group of observations, rather than ranking those hypotheses for the entire dataset by considering the posterior distribution of the weights in the later. The mixture has somehow more of a substance in the first case, where separating the datasets into groups is part of the inference.