uniformly most powerful Bayesian tests???

“The difficulty in constructing a Bayesian hypothesis test arises from the requirement to specify an alternative hypothesis.”

Vale Johnson published (and arXived) a paper in the Annals of Statistics on uniformly most powerful Bayesian tests. This is in line with earlier writings of Vale on the topic and good quality mathematical statistics, but I cannot really buy the arguments contained in the paper as being compatible with (my view of) Bayesian tests. A “uniformly most powerful Bayesian test” (acronymed as UMBT)  is defined as

“UMPBTs provide a new form of default, nonsubjective Bayesian tests in which the alternative hypothesis is determined so as to maximize the probability that a Bayes factor exceeds a specified threshold”

which means selecting the prior under the alternative so that the frequentist probability of the Bayes factor exceeding the threshold is maximal for all values of the parameter. This does not sound very Bayesian to me indeed, due to this averaging over all possible values of the observations x and comparing the probabilities for all values of the parameter θ rather than integrating against a prior or posterior and selecting the prior under the alternative with the sole purpose of favouring the alternative, meaning its further use when the null is rejected is not considered at all and catering to non-Bayesian theories, i.e. trying to sell Bayesian tools as supplementing p-values and arguing the method is objective because the solution satisfies a frequentist coverage (at best, this maximisation of the rejection probability reminds me of minimaxity, except there is no clear and generic notion of minimaxity in hypothesis testing).

“Unfortunately, subjective Bayesian testing procedures have not been—and will likely never be—generally accepted by the scientific community. In most testing problems, the range of scientific opinion regarding the magnitude of violations from a standard theory is simply too large to make the report of a single, subjective Bayes factor worthwhile. Furthermore, scientific journals have demonstrated an unwillingness to replace the report of a single p-value with a range of subjectively determined Bayes factors or posterior model probabilities.”

I also object to the definition of the uniformly most powerful Bayesian tests, starting with the alien notion of a “true” prior density (p.6) that would be misspecified, corresponding to “a point mass concentrated on the true value” for frequentists and to the summary of prior information for Bayesians, “not available”.  Again, Bayesians and non-Bayesians have no reason to buy this presentation of the prior under the alternative. I also do not see why one should compare the probability of rejection of H0 in favour of H1 for every value of θ when (a) a prior on H1 is used to define the Bayes factor, (b) the conditioning on the data inherent to the Bayesian approach is lost, (c) the boundary or threshold γ is fixed, and (d) the order thus induced is incomplete (as in minimax problems), hence unlikely to produce a solution except in stylised settings such as the one of one-dimensional exponential families treated in the paper (and in the classical UMP literature). A more theoretical aspect is that the prior behind the uniformly most powerful Bayesian tests is quite likely to be atomic, while the natural dominating measure is the Lebesgue measure. A last remark is that those tests are not uniformly most powerful unless one picks a new definition of UMP tests.

“…the tangible consequence of a Bayesian hypothesis test is often the rejection of one hypothesis in favor of the second (…) It is therefore of some practical interest to determine alternative hypotheses that maximize the probability that the Bayes factor from a test exceeds a specified threshold”.

The question that the above quote begets is why?! In the Bayesian approach, the definition of the alternative hypothesis is paramount. To replace a genuine alternative with one spawned by the null hypothesis voids the appeal of this approach, turning testing of hypotheses into a goodness-of-fit assessment (for which my own if unimplemented proposal is to use a non-parametric Bayesian modelling as the alternative). And the above argument is not really making a point: why would we look for the alternative that is most against H0? As debated in Spanos (2013) and answered in my reassessment of the Jeffreys-Lindley paradox, there are many alternative values that are more likely than the null. This does not make them of particular interest or bound to support an alternative prior.

“Thus, the posterior probability of the null hypothesis does not converge to1 as the sample size grows. The null hypothesis is never fully accepted—nor the alternative rejected—when the evidence threshold is held constant as n increases.”

The whole notion of an abstract and fixed threshold γ is also a point of contention. Keeping it fixed leads to the Jeffreys-Lindley paradox. Assuming a golden number like 3 (even though I like to use the number 3 as my default number!) does not make more sense than using 0.05 or 5σ as the constant bound in frequentist statistics. Even the Neyman-Pearson perspective on tests relies on a decreasing (against the sample size) Type I error in order to have both types of error decreasing with n. This aspect greatly jeopardises the whole construct of uniformly most powerful Bayesian tests, as they depend on a parameter γ which choice remains arbitrary, unconnected with a loss function and orthogonal to any kind of prior information. The fact that the “behavior of UMPBTs with fixed evidence thresholds is similar to the Jeffreys-Lindley paradox” (p.11) is not very surprising, because this is the essence of the Jeffreys-Lindley paradox…

“The simultaneous report of default Bayes factors and p-values may play a pivotal role in dispelling the perception held by many scientists that a p-value of 0.05 corresponds to “significant” evidence against the null hypothesis (…) the report of Bayes factors based upon [UMPBTs] may lead to more realistic interpretations of evidence obtained from scientific studies.”

The paper has a section on the CERN Higgs boson experiment, but I do not see any added value in using an uniformly most powerful Bayesian test and there is no confirmation or infirmation of the Higgs discovery from this quarter (reminding me of the physicist’ off-the-record remark when we went to discuss Bayes’ theorem on France Culture).

The conclusion I can draw on this paper is that the notion proposed by Vale is a purely frequentist one, using Bayes factors as the statistic instead of another divergence statistic. This is not enough to turn the whole principle into a Bayesian one and I do not see how it makes an advance in the specification of Bayesian tests.

(Disclaimer: I was not involved at any stage in the editorial processing of the paper!)

9 Responses to “uniformly most powerful Bayesian tests???”

  1. A follow up paper just came out in PNAS:

    http://m.pnas.org/content/early/2013/10/28/1313476110.full.pdf?with-ds=yes

    One comment concerns the word “equipoise”, used to justify the equal priors of the alternative hypotheses. If one were to adhere to standard clinical trial vernacular (and in fact the second paper appeals to medical/clinical audience), equipoise refers to an average/group state of uncertainty. So if one wanted to be consistent in term usage and mathematical representation, the alternative hypothesis should be framed as an ensemble of hypothesis which on average are uncertain.

  2. nothing to do with this, but this acronym is an anagram of Ultra-Trail du Mont Blanc.

  3. Christian: I haven’t read this paper, but your posts reminds me of a confusion I keep seeing regarding the status of the alternative hypothesis in the land of best tests and certainly in the severity construal of N-P statistics: It’s not just a single null, there’s a test hypothesis and discrepancies from it. (And I can juggle the test hypothesis and alternatives.) The exception is in the case of so-called “pure” significance tests (and even then, operationalizing requires considering types of alternatives). Moreover, the form of inference of interest is not to the parameter values most likely given data, but to discrepancies that are either well indicated, or well ruled out, with severity. A very different picture of statistical inference emerges. I realize this is not directly related to the paper you discuss, but it has come up before.

    • Deborah: Thanks. My criticism of this paper is not about the point null (which may be relevant in exceptional cases like testing for ESP or for the boundary value of the Hubble constant), but about not caring about the alternative. If the null is rejected, a new prior must be constructed, which is an incoherent requirement…

  4. Dan Simpson Says:

    I know Annals is famous for it’s somewhat opaque style, but surely someone in the review process should’ve asked what on earth a ‘true prior’ is (and the distribution representing the total amount of information prior to the experiment is so vague as to be useless).

    I also don’t understand what equation (4) means [which is as far as I’ve gotten]. Clearly this is algebraically true, but what does that expectation actually mean. It’s the expectation of the log-bayes factor w.r.t. the “true marginal likelihood”, but why would that be important?

    Obviously if you took the expectation with respect to any marginal likelihood, the prior corresponding to that marginal likelihood would maximise the expected weight of evidence, so what’s so special about this one? It seems like circular logic (the true prior is the best because it maximises the expected WOE defined w.r.t. the true prior)

    • Dan:

      As Christian and I have discussed, there can definitely be such a thing as a “true prior.” All you need is a model in which the parameter theta has been drawn from some distribution or process. The true prior is the distribution of those thetas. This sort of thing occurs, for example, in genetics (where the mixing genes provide a sampling process) or, more generally, any time you are applying a statistical method repeatedly (for example, spell checking): the true prior is the distribution of true values under all the cases where the method is being applied.

Leave a reply to Andrew Gelman Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.