hidden dangers of noninformative priors
Last year, John Seaman (III), John Seaman (Jr.), and James Stamey published a paper in The American Statistician with the title Hidden dangers of specifying noninformative priors. (It does not seem to be freely available on-line.) I gave it to read to my PhD students, meaning to read towards the goal of writing a critical reply to the authors. In the meanwhile, here are my own two-cents on the paper.
“Applications typically employ Markov chain Monte Carlo (MCMC) methods to obtain posterior features, resulting in the need for proper priors, even when the modeler prefers that priors be relatively noninformative.” (p.77)
Apart from the above quote, which confuses proper priors with proper posteriors (maybe as the result of a contagious BUGS!), and which is used to focus solely and sort-of inappropriately on proper priors, there is no hard fact to bite in, but rather a collection of soft decisions and options that end up weakly supporting the authors’ thesis. (Obviously, following an earlier post, there is no such thing as a “noninformative” prior.) The paper is centred on four examples where a particular choice of (“noninformative”) prior leads to peaked or informative priors on some transform(s) of the parameters. Note that there is no definition provided for informative, non-informative, diffuse priors, except those found in BUGS with “extremely large variance” (p.77). (The quote below seems to settle on a uniform prior if one understands the “likely” as evaluated through the posterior density.) The argument of the authors is that “if parameters with diffuse proper priors are subsequently transformed, the resulting induced priors can, of course, be far from diffuse, possibly resulting in unintended influence on the posterior of the transformed parameters” (p.77).
“…a prior is informative to the degree it renders some values of the quantity of interest more
likely than others.” (p.77)
The first example is about a one-covariate logistic regression The first surprising choice is one of an identical prior on both the intercept and the regression coefficients. Instead of, say, a g-prior that would rescale the coefficients according to the variation of the corresponding covariate. Since x corresponds to age, the second part of the regression varies 50 times more. When plotting the resulting logistic pdf across a few thousands simulations from the prior, the functions mostly end up as the constant functions with values 0 or 1. Not particularly realistic since the predicted phenomenon is the occurrence of coronary heart disease. The prior is thus using the wrong scale: the simulated pdfs should have a reasonable behaviour over the range (20,100) of the covariate x. For instance, focussing on a -5 log-odds ratio at age 20 and a +5 log-odds ratio at 100. Leading to the comparison pictured below. Furthermore, the fact that the coefficient of x may be negative is also ignoring a basic issue about the model and answers the later (dishonest) criticism that “the [prior] probability is 0.5 that the ED50 is negative” (p.78). Using a flat prior in this example is just fine and would avoid criticisms about the prior behaviour, since this behaviour is then meaningless from a probabilistic viewpoint.
“…in a more complicated model, it may be hard to determine the sample size beyond which induced prior influence on the posterioris negligible.” (p.79)
There is also the undercurrent in the paper (not that under!) that Bayesian inference should look like MLE inference and that if it does not then something is wrong. If the MLE outcome is “right”, there is indeed no point in running a Bayesian analysis. Strange argument. (Example 2.4 uses the MLE of the evenness as “the true evenness”, p.80.)
The second example is inspired by Barnard, McCulloch and Meng (2000, Statistica Sinica) estimating a covariance matrix with a proper hyperprior on regression coefficient variances that results in a peaked prior on the covariances. The paper falls short of demonstrating a clear impact on the posterior inference. And the solution (p.82) of using another proper prior resulting in a wider dispersion requires a prior knowledge of how wide is wide enough.
Example 2.3 is about a confusing model, inspired by Cowles (2002, Statistics in Medicine), and inferring about surrogate endpoints. If I understand correctly the model, there are two models under comparison, one with the surrogate and one without. What puzzles me is that the quantity of interest, the proportion of treatment effect, involves parameters from both models. Even if this can be turned into a meaningful quantity, the criticism that the “proportion” may take values outside (0,1) is rather dead-born as it suffices to impose a joint prior that ensures the ratio stays within (0,1). Which is the solution proposed by the authors (pp.82-83).The fourth and last example concentrates on estimating a Shannon entropy for a vector of eight probabilities. Using a uniform (Dirichlet) prior induces a prior on the relative entropy that is concentrated on (0.5,1). Since there is nothing special about the uniform (!), re-running the evaluation with a Jeffreys prior Dir(½,½,…,½) reduces this feature, which anyway is a characteristic of the prior distribution, not of the posterior distribution which accounts for the data. The authors actually propose to use (p.83) a Dir(¼,¼,…,¼) prior, presumably on the basis that the induced prior on the evenness is then centred close to 0.5.
“A related solution to this problem is to specify a joint prior for meaningful summaries of the parameters in the sampling model. Then the induced prior on the original parameters can be computed.” (p.81)
I thus find the level of the criticism found in the paper rather superficial, as it either relies on a specific choice of a proper prior distribution or on ignoring basic prior information. The paper concludes with recommendations for prior checks. Again, no deficient hardware: the recommendations are mostly sensible if expressing the fact that some prior information is almost always available on some quantities of interest, as translated in the above quote. My only point of contention is the repeated reference to MLE, since it implies assessing/building the prior from the data… The most specific (if related to the above) recommendation is to use conditional mean priors as exposed in Christensen et al. (2010). (I did not spot this notion in my review of two years ago.) For instance, in the first (logistic) example, this meant putting a prior on the cdfs at age 40 and age 60. The authors picked a uniform in both cases, which sounds inconsistent with the presupposed shape of the probability function.
“…it is more natural for experts to think in terms of observables than parameters…” (p.81)
In conclusion, there is nothing pathologically wrong with either this paper or the use of “noninformative” priors! Looking at induced priors on more intuitive transforms of the original parameters is a commendable suggestion, provided some intuition or prior information is indeed available on those. Using a collection of priors incl. reference or invariant priors helps as well. And (in connection with the above quote) looking at the induced dataset by simulating from the corresponding predictive cannot hurt.