The paper by Peter Grünwald, Rianne de Heide and Wouter Koolen on safe testing was read before The Royal Statistical Society at a meeting organized by the Research Section on Wednesday, 24th January, 2024, after many years in the making, to the point that several papers based on this initial one have appeared in the meanwhile, incl. some submissions to Biometrika. Like this one in the current issue of Statistical Science dedicated to reproducibility and replicability. Joshua Bon and I wrote a discussion that synthesised the following and sometimes rambling remarks.
Overall, this is a mind-challenging paper with definitely original style and contents for which the authors are to be congratulated!
“…p-values are interpreted as indicating amounts of evidence against the null, and their definition does not need to refer to any specific alternative H¹. Exactly the same holds for e-values: the basic interpretation ‘a large e-value provides evidence against H⁰’ holds no matter how the e-variable is defined, as long as it satisfies (1). If they are defined relative to H¹ that is close to the actual process generating the data they will grow fast and provide a lot of evidence, but the basic interpretation holds regardless.”
About the entry section, one may ask why would a Bayesian want to test the veracity of a null hypothesis. The debate has been raging since the early days, although Jeffreys spent two chapters of his book on the topic of testing. (While appearing in Example 5 p.14 for his point estimation prior.) From an opposite viewpoint, the construction of e-values and such in the paper is highly model dependent, but all models are wrong! and more to the point both hypotheses may turn out to be wrong for misspecified cases. The notion thus seems on the opposite to be very M-close, with no idea of what is happening under misspecified models or why is rejecting H⁰ the ultimate argument.
When introducing e-values, (1) is not a definition per se, since otherwise E≡1 would be an e-value. This is unfortunate as the topic is already confusing enough. E[E] must be larger than 1 under H¹, otherwise product of e-values would always degenerate to zero (?)
The points
- behaviour under optional continuation [by a martingale reasoning]
- interpretation as ‘evidence against the null’ as gambling [unethical!]
- in all cases preserving frequentist Type I error guarantees
- e-variables turn out to be Bayes factors based on the right Haar prior [rather than sometimes with highly unusual (e.g. degenerate) priors? p.4]
- e-variables need more extreme data than p-values in order to reject the null
are rather worthwhile, even though 2. is vague and 3. is firmly frequentist. Any theory involving Haar priors (and even better amenability) cannot be all wrong, though, even considering that Haar priors are improper. The optional continuation in 1. is a nice argument from a Bayesian viewpoint since it has also been used to defend the Bayesian approach. Point 4. brings a formal way to define least favourable priors in the testing sense. One may then wonder at the connection with the solution of Bayarri and Garcia-Donato (Biometrika, 2007). The perspective adopted therein is somehow an inverse of the more common stance when the prior on H⁰ is the starting point [and obviously known]. So, is there any dual version of e-values where this would happen, i.e. leading to deriving the optimal prior on H¹ for a given prior on H⁰? (Which would further offer a maximin interpretation.) Theorem 1 indeed sounds like the minimax=maximin result for test settings. (In Corollary 2, why is (10) necessarily a Bayes factor, given the two models?)
While I first thought that the approach leads to finding a proper prior, the “Almost Bayesian Case” [p.17] (ABC!!) comes to justify the use of a “common” improper prior over nuisance parameters under both hypotheses, which while more justifiable than in the original objective Bayes literature, remains unsatisfactory to me. But I like the notion in 2.2 [p.10] that a prior chosen on H¹ forces one to adopt a particular corresponding prior on H⁰, as it defines a form of automated projection that we also considered in Goutis [RIP] and Robert (Biometrika, 1998). Corollary 2 is most interesting as well. However, taking the toy example of H⁰ being a normal mean standing in (-a,a) seems to lead to the optimal prior on H⁰ being a point mass at +/- a for any marginal m(y) centred at zero. Which is a disappointing outcome when compared with the point mass situation. It is another disappointment that the Bayes Factor cannot be an e-value since (6) fails to hold, but (1) is not (6) and one could argue that the BF is an e-value when integrating under the marginals!
As a marginalia, the paper made me learn about the term (and theme) tragedy of the commons, a concept developed by [the neomalthusian and eugenist] Garett Hardin.
In conclusion, we congratulate the authors on this endeavour but it remains unclear to us (as Bayesians) (i) how to construct the least favourable prior on H0 on a general basis, especially from a computational viewpoint, and, more importantly, (ii) whether it is at all of inferential interest [i.e., whether it degenerates into a point mass]. With respect to the sequential directions of the paper, we also wonder at the potential connections with sequential Monte Carlo, for instance, towards conducting sequential model choice by constructing efficiently an amalgamated evidence value when the product of Bayes factors is not a Bayes factor (see Buchholz et al., 2023).