whetstone and alum block for Occam’s razor
A strange title, if any! (The whetstone is a natural hard stone used for sharpening steel instruments, like knifes or sickles and scythes, I remember my grand-fathers handling one when cutting hay and weeds. Alum is hydrated potassium aluminium sulphate and is used as a blood coagulant. Both items are naturally related with shaving and razors, if not with Occam!) The whole title of the paper published by Guido Consonni, Jon Forster and Luca La Rocca in Statistical Science is “The whetstone and the alum block: balanced objective Bayesian comparison of nested models for discrete data“. The paper builds on the notions introduced in the last Valencia meeting by Guido and Luca (and discussed by Judith Rousseau and myself).
Beyond the pun (that forced me to look for “alum stone” on Wikipedia!, and may be lost on some other non-native readers), the point in the title is to build a prior distribution aimed at the comparison of two models such that those models are more sharply distinguished: Occam’s razor would thus cut better when the smaller model is true (hence the whetstone) and less when it is not (hence the alum block)… The solution proposed by the authors is to replace the reference prior on the larger model, π1, with a moment prior à la Johnson and Rossell (2010, JRSS B) and then to turn this moment prior into an intrinsic prior à la Pérez and Berger (2002, Biometrika), making it an “intrinsic moment”. The first transform turns π1 into a non-local prior,
with the aim of correcting for the imbalanced convergence rates of the Bayes factor under the null and under the alternative (this is the whetstone). The second transform accumulates more mass in the vicinity of the null model (this is the alum block). (While I like the overall perspective on intrinsic priors, the introduction is a wee confusing about them, e.g. when it mentions fictive observations instead of predictives.)
Being a referee for this paper, I read it in detail (and also because this is one of my favourite research topics!) Further, we already engaged into a fruitful discussion with Guido since the last Valencia meeting and the current paper incorporates some of our comments (and replies to others). I find the proposal of the authors clever and interesting, but not completely Bayesian. Overall, the paper provides a clearly novel methodology that calls for further studies…
My first issue is foundational: it is not fully coherent to build priors that depend on the alternative hypothesis. For one thing, they should accommodate a series of alternative hypotheses in cases several of those are under comparison. For another. they come close to being data-dependent: assuming the smaller model is rejected, the prior on the larger model is either the original prior or the “intrinsic moment” modified prior. In the first case, this means that the “intrinsic moment” modified prior is not the “true” prior but is simply used for testing purposes, a deconstruction of the Bayesian inferential machine!. In the second case, this means that the prior on a model is influenced by a model that is likely to be wrong and, of course, by the data that led to this conclusion. (Of course one could reply this is an item of prior information that intrinsic priors incorporate.) A further reticence of mine’s is that the approach creates an asymmetry between the two models that either has no reason to exist or should be explicited within the prior or the loss function. (For instance, in our paper with J. Cano and D. Salmeròn, both priors were transformed by the same intrinsic principle à la Pérez and Berger.) If the asymmetry is lost, there is no reason to favour the smaller model (except for a vague reference to Occam’s razor): arguments go both ways, in that the smaller model requires less prior information (meaning less parameters) and more information (meaning more restrictions). At last, the correction à la Johnson and Rossell is aiming at different convergence rates for the Bayes factor under the two models, namely that it goes more slowly to zero than to infinity. I simply fail to see why this should be a relevant argument.
My second issue is methodological: in most nested models, the null hypothesis has zero mass under the alternate prior π1 because it puts hard constraints on (some of) the parameters. I thus find alarming (from a measure-theoretic perspective) that a theorem like Result 2.1 assumes the alternate prior π1 to be positive over the zero measure subset and sees it as crucial (I did not go and check Dawid, 1999). In addition, the paper does not provide a generic way for constructing moment priors i.e. for finding a baseline function canceling on the null hypothesis subset. (I obviously object to the point that “to separate two nested model is a primitive operation, which does not require a decision-theoretic setup”: testing is meaningless outside a decision-theoretic framework!) The paper should at the very least include a general description on the construction of the “intrinsic moment” modified priors, rather than relying solely on examples. The calibration aspects (of choosing distance power and integrated pseudo-data size) are based on graphical evaluations I cannot understand (see bottom of page 11). The notion of a total weight of evidence is puzzling as it averages Bayes factors uniformly over the sampling space, even though I have no obvious weighted alternative to propose.
My third issue is theoretical: the authors advance several arguments to construct their procedure, but given that the two steps are opposed (first, the moment prior goes away from the null parameter subset; second, the intrinsic prior increases the mass in the vicinity of the null parameter subset) it is difficult to ascertain generic properties of such a construction. The paper fails short of providing any argument in this direction, concentrating instead on two examples. I am thus uncertain as to how general the presented methodology is.
Leave a Reply