uniformly most powerful Bayesian tests???
“The difficulty in constructing a Bayesian hypothesis test arises from the requirement to specify an alternative hypothesis.”
Vale Johnson published (and arXived) a paper in the Annals of Statistics on uniformly most powerful Bayesian tests. This is in line with earlier writings of Vale on the topic and good quality mathematical statistics, but I cannot really buy the arguments contained in the paper as being compatible with (my view of) Bayesian tests. A “uniformly most powerful Bayesian test” (acronymed as UMBT) is defined as
“UMPBTs provide a new form of default, nonsubjective Bayesian tests in which the alternative hypothesis is determined so as to maximize the probability that a Bayes factor exceeds a specified threshold”
which means selecting the prior under the alternative so that the frequentist probability of the Bayes factor exceeding the threshold is maximal for all values of the parameter. This does not sound very Bayesian to me indeed, due to this averaging over all possible values of the observations x and comparing the probabilities for all values of the parameter θ rather than integrating against a prior or posterior and selecting the prior under the alternative with the sole purpose of favouring the alternative, meaning its further use when the null is rejected is not considered at all and catering to non-Bayesian theories, i.e. trying to sell Bayesian tools as supplementing p-values and arguing the method is objective because the solution satisfies a frequentist coverage (at best, this maximisation of the rejection probability reminds me of minimaxity, except there is no clear and generic notion of minimaxity in hypothesis testing).
“Unfortunately, subjective Bayesian testing procedures have not been—and will likely never be—generally accepted by the scientific community. In most testing problems, the range of scientific opinion regarding the magnitude of violations from a standard theory is simply too large to make the report of a single, subjective Bayes factor worthwhile. Furthermore, scientific journals have demonstrated an unwillingness to replace the report of a single p-value with a range of subjectively determined Bayes factors or posterior model probabilities.”
I also object to the definition of the uniformly most powerful Bayesian tests, starting with the alien notion of a “true” prior density (p.6) that would be misspecified, corresponding to “a point mass concentrated on the true value” for frequentists and to the summary of prior information for Bayesians, “not available”. Again, Bayesians and non-Bayesians have no reason to buy this presentation of the prior under the alternative. I also do not see why one should compare the probability of rejection of H0 in favour of H1 for every value of θ when (a) a prior on H1 is used to define the Bayes factor, (b) the conditioning on the data inherent to the Bayesian approach is lost, (c) the boundary or threshold γ is fixed, and (d) the order thus induced is incomplete (as in minimax problems), hence unlikely to produce a solution except in stylised settings such as the one of one-dimensional exponential families treated in the paper (and in the classical UMP literature). A more theoretical aspect is that the prior behind the uniformly most powerful Bayesian tests is quite likely to be atomic, while the natural dominating measure is the Lebesgue measure. A last remark is that those tests are not uniformly most powerful unless one picks a new definition of UMP tests.
“…the tangible consequence of a Bayesian hypothesis test is often the rejection of one hypothesis in favor of the second (…) It is therefore of some practical interest to determine alternative hypotheses that maximize the probability that the Bayes factor from a test exceeds a specified threshold”.
The question that the above quote begets is why?! In the Bayesian approach, the definition of the alternative hypothesis is paramount. To replace a genuine alternative with one spawned by the null hypothesis voids the appeal of this approach, turning testing of hypotheses into a goodness-of-fit assessment (for which my own if unimplemented proposal is to use a non-parametric Bayesian modelling as the alternative). And the above argument is not really making a point: why would we look for the alternative that is most against H0? As debated in Spanos (2013) and answered in my reassessment of the Jeffreys-Lindley paradox, there are many alternative values that are more likely than the null. This does not make them of particular interest or bound to support an alternative prior.
“Thus, the posterior probability of the null hypothesis does not converge to1 as the sample size grows. The null hypothesis is never fully accepted—nor the alternative rejected—when the evidence threshold is held constant as n increases.”
The whole notion of an abstract and fixed threshold γ is also a point of contention. Keeping it fixed leads to the Jeffreys-Lindley paradox. Assuming a golden number like 3 (even though I like to use the number 3 as my default number!) does not make more sense than using 0.05 or 5σ as the constant bound in frequentist statistics. Even the Neyman-Pearson perspective on tests relies on a decreasing (against the sample size) Type I error in order to have both types of error decreasing with n. This aspect greatly jeopardises the whole construct of uniformly most powerful Bayesian tests, as they depend on a parameter γ which choice remains arbitrary, unconnected with a loss function and orthogonal to any kind of prior information. The fact that the “behavior of UMPBTs with fixed evidence thresholds is similar to the Jeffreys-Lindley paradox” (p.11) is not very surprising, because this is the essence of the Jeffreys-Lindley paradox…
“The simultaneous report of default Bayes factors and p-values may play a pivotal role in dispelling the perception held by many scientists that a p-value of 0.05 corresponds to “significant” evidence against the null hypothesis (…) the report of Bayes factors based upon [UMPBTs] may lead to more realistic interpretations of evidence obtained from scientific studies.”
The paper has a section on the CERN Higgs boson experiment, but I do not see any added value in using an uniformly most powerful Bayesian test and there is no confirmation or infirmation of the Higgs discovery from this quarter (reminding me of the physicist’ off-the-record remark when we went to discuss Bayes’ theorem on France Culture).
The conclusion I can draw on this paper is that the notion proposed by Vale is a purely frequentist one, using Bayes factors as the statistic instead of another divergence statistic. This is not enough to turn the whole principle into a Bayesian one and I do not see how it makes an advance in the specification of Bayesian tests.
(Disclaimer: I was not involved at any stage in the editorial processing of the paper!)