Following the publication of several papers on the topic of integrated evidence (about competing models), Murray Aitkin has now published a book entitled Statistical Inference and I have now finished reading it. While I appreciate the effort made by Murray Aitkin to place his theory within a coherent Bayesian framework, I remain unconvinced of the said coherence, for reasons exposed below.
The main chapters of the book are Chapter 2 about the “Integrated Bayes/likelihood approach” and Chapter 4 about the “Unified analysis of finite populations”, Chapter 7 also containing a new proposal about “Goodness of fit and model diagnostics”. Chapter 1 is a nice introduction to frequentist, likelihood and Bayesian approaches to inference and the four remaining chapters are applications of Murray Aitkin‘s principles to various models. The style of the book is quite pleasant although slightly discursive in what I (a Frenchman!) would qualify as an English style in that it is often relying on intuition to develop concepts. I also think that the argument of being close to the frequentist decision (aka the p-value) too often serves as a justification in the book (see, e.g., page 43 “the p-value has a direct interpretation as a posterior probability”). As an aside, Murray Aitkin is a strong believer in plotting cdfs rather than densities to provide information about a distribution and hence cdf plots abound throughout the book. (I counted 82 pictures of them.) While the book contains a helpful array of examples and datasets, the captions of the (many) figures are too terse for my taste: The figures are certainly not self-contained and even with the help of the main text they do not always make complete sense.
“This quite small change to standard Bayesian analysis allows a very general approach to a wide range of apparently different inference problems; a particular advantage of the approach is that it can use the same noninformative priors.” page xiii
The “quite small change” advocated by Murray Aitkin (following an earlier proposal by Arthur Dempster in 1974) has been discussed twice on this Og. It consists in considering the likelihood function as a generic function of the parameter that can be considered a posteriori, hence allows for (posterior) mean, variance and quantiles… As argued by the author (Chapter 2, page 21), the “small change” has several appealing features:
- the approach is general and allows to resolve the difficulties with the Bayesian processing of point null hypotheses;
- the approach allows for the use of generic noninformative priors;
- the approach handles more naturally the “vexed question of model fit”;
- the approach is “simple”.
Being still perplexed by the difficulty in juggling with noninformative priors AND point null hypotheses—and, correlatively, having qualms about teaching only Bayes factor solutions to my students!—, I am obviously interested in pursing alternative solutions. The above arguments are clearly compelling but…
“A persistent criticism of the posterior likelihood approach (…) has been based on the claim that these approaches are `using the data twice’, or are `violating temporal coherence” page 48
This criticism is not my main reservation about the method because “using the data twice” is not a more clearly defined concept than “Occam’s razor”. When computing a Bayes factor, one could as well advance that both the numerator and the denominator each use the data once… Anyway, what I cannot follow is how the “posterior” distribution of the likelihood function is justified from a Bayesian perspective. Murray Aitkin stays away from decision-theory (page xiv) so there is no derivation based on a loss function or such. My difficulty with the integrated likelihood idea is (a) that the likelihood function does not exist a priori and (b) that it misses the reference to a joint distributions in the case of model comparison. The case for (a) is debatable, as Murray would presumably
contest dispute the fact that there exists a joint distribution on the likelihood. I still consider the notion of a posterior probability that the likelihood ratio is larger than 1 to be meaningless. The case for (b) is more clearcut in that, when considering two models, a Bayesian analysis does need a joint distribution on the two sets of parameters to reach a decision, even though in the end only one set will be used. This is an old argument of mine (and others!) about Murray Aitkin‘s approach. As detailed below, this point is related with the introduction of pseudo-priors by Carlin & Chib (JRSS Series B, 1995) who need arbitrarily defined [proper] distributions on the parameters “that are not”…
“The p-value is equal to the posterior probability that the likelihood ratio, for null hypothesis to alternative, is greater than 1.” page 42
“The posterior probability is p that the posterior probability of H0 is greater than 0.5.” page 43
Those two equivalent statements show that it is difficult to give a Bayesian interpretation to the method, since the two “posterior probabilities” above are incompatible. Indeed, a Bayesian fundamental property is that the posterior probability of an event related with the parameters of the model is not a random quantity but a number. To consider the “posterior probability of the posterior probability” means we are exiting the Bayesian domain, both from a logical and a philosophical viewpoint.
Once again, the essential chapter in the book is Chapter 2, as Murray Aitkin exposes his (foundational) reasons for choosing this new approach by integrated Bayes/likelihood. His criticism of Bayes factors is based on several points:
- “Have we really eliminated the uncertainty about the model parameters by integration? The integrated likelihood (…) is the expected value of the likelihood. But what of the prior variance of the likelihood?” (page 47).
- “Any expectation with respect to the prior implies that the data has not yet been observed (…) So the “integrated likelihood” is the joint distribution of random variables drawn by a two-stage process. (…) The marginal distribution of these random variables is not the same as the distribution of Y (…) and does not bear on the question of the value of in that population” (page 47).
- “We cannot use an improper prior to compute the integrated likelihood. This eliminate the usual improper noninformative priors widely used in posterior inference.” (page 47).
- “Any parameters in the priors (…) will affect the value of the integrated likelihood and this effect does not disappear with increasing sample size” (page 47).
- “The Bayes factor is equal to the posterior mean of the likelihood ratio between the models” (meaning under the full model posterior) (page 48).
- “The Bayes factor diverges as the prior becomes diffuse. (…) This property of the Bayes factor has been known since the Lindley/Bartlett paradox of 1957″.
While the difficulty 3. with improper priors is real, and while the impact of the prior modelling (4.) may have a lingering effect, the other points can be easily rejected on the ground that the posterior distribution of the likelihood is meaningless. This argument is anticipated by Murray Aitkin who protests on pages 48-49 that, given point 5., the posterior distribution must be “meaningful”, since the posterior mean is “meaningful” (!), but the interpretation of the Bayes factor as a “posterior mean” is only an interpretation of an existing integral, it does not give any validation to the analysis. (It could as well be considered as a prior mean, despite depending on the observation x!) Note also that point 2. above sounds more like an argument against the integrated likelihood/Bayes perspective. And that the representation mentioned in point 5. pertains to the Savage-Dickey paradox exposed in a recent paper of ours.
In the case of unrelated models, the argument against using posterior distributions of the likelihoods and of related terms is that the approach leads to use in parallel simulations from the posteriors under each model. As discussed in an earlier post, this is not possible. The book suggests to run model comparison via the distribution of the likelihood ratio values
where the ‘s and ‘s are drawn from the respective posteriors. This seems very close to Steve Scott’s (JASA, 2002) and to Peter Congdon’s (CSDA, 2006) mistaken solutions analysed in our Bayesian Analysis paper with Jean-Michel Marin, in that MCMC runs are ran for each model separately and the samples are gathered together to produce either the posterior expectation (in Scott’s case) or the posterior distribution (for the current paper) of
which do not correspond to genuine Bayesian solutions (see our Bayesian Analysis paper). Again, this is not as much because the data x is used repeatedly in this process (since reversible MCMC produces as well separate samples from the different posteriors) as to the lack of a common ground or referential that is needed in the Bayesian framework. This means, e.g., that the integrated likelihood/Bayes technology is producing samples from the product of the posteriors (a product that clearly is not defined in a Bayesian framework) instead of using pseudo-priors as in Carlin and Chib (Series B, 1995), i.e. considering a joint posterior on , which is [proportional to]
This of course makes a difference in the outcome, as shown on the R graph produced in an earlier post, which compares the distribution of the likelihood ratio under the true posterior and under the product of posteriors. (Again, this is inherently the same flaw found in the reasoning leading to Scott’s and Congdon’s solutions.) Section 2.9.2 also relates the problems with the Bayes factor with the harmonic mean identity (Newton and Raftery, JRSS Series B, 1994), while the later is a mere (unstable) computational tool to “avoid the prior integral problem”.
“Without a specific alternative, the best we can do is to make posterior probability statements about μ and transfer these to the posterior distribution of the likelihood ratio.” page 42
“There cannot be strong evidence in favor of a point null hypothesis against a general alternative hypothesis.” page 44
Once Murray Aitkin has set the principle of using the posterior distribution of the likelihood ratio (or rather of the divergence difference since this is at least symmetric in both hypotheses), there is a whole range of output available to him including confidence intervals on the difference, for checking whether or not they contain zero. This is appealing but (a) is not Bayesian for reasons exposed above, (b) is not parameterisation invariant, (c) relies once again on an arbitrary confidence level.
“The posterior has a nonintegrable spike at zero. This is equivalent to assigning zero prior probability to these unobserved values.” page 98
I will not discuss here Chapter 4, “a diversion from the main theme of the book” (page 91), except to criticise the use of Haldane’s prior (which does not allow for empty cells in a contingency table) and for the corresponding (mathematically false) semantic switch reproduced above. Similarly, Chapter 7 “points out difficulties with the posterior predictive distribution” (page 172) that I refrain from addressing here (namely, that the predictive lacks variability, page 173), despite some puzzling statements like the one that a marginal distribution (meaning density) being “the mean of the conditional distribution, (…) there is more information about this distribution in the variance of the conditional distribution averaged over X” (page 176). (For one thing, not every density is L2, hence square-integrable…) Chapter 8 covers mixtures of distributions and relates to our (critical) 2006 Bayesian Analysis paper on DIC, bur does not end up as critical despite the variety of DIC criteria. (This part on mixture is also related to a chapter Murray Aitkin contributed to our volume on mixtures.)