hidden dangers of noninformative priors

Last year, John Seaman (III), John Seaman (Jr.), and James Stamey published a paper in The American Statistician with the title Hidden dangers of specifying noninformative priors. (It does not seem to be freely available on-line.) I gave it to read to my PhD students, meaning to read towards the goal of writing a critical reply to the authors. In the meanwhile, here are my own two-cents on the paper.

“Applications typically employ Markov chain Monte Carlo (MCMC) methods to obtain posterior features, resulting in the need for proper priors, even when the modeler prefers that priors be relatively noninformative.” (p.77)

Apart from the above quote, which confuses proper priors with proper posteriors (maybe as the result of a contagious BUGS!), and which is used to focus solely and sort-of inappropriately on proper priors, there is no hard fact to bite in, but rather a collection of soft decisions and options that end up weakly supporting the authors’ thesis. (Obviously, following an earlier post, there is no such thing as a “noninformative” prior.) The paper is centred on four examples where a particular choice of (“noninformative”) prior leads to peaked or informative priors on some transform(s) of the parameters. Note that there is no definition provided for informative, non-informative, diffuse priors, except those found in BUGS with “extremely large variance” (p.77). (The quote below seems to settle on a uniform prior if one understands the “likely” as evaluated through the posterior density.) The argument of the authors is that “if parameters with diffuse proper priors are subsequently transformed, the resulting induced priors can, of course, be far from diffuse, possibly resulting in unintended influence on the posterior of the transformed parameters” (p.77).

“…a prior is informative to the degree it renders some values of the quantity of interest more
likely than others.” (p.77)

The first example is about a one-covariate logistic regression The first surprising choice is one of an identical prior on both the intercept and the regression coefficients. Instead of, say, a g-prior that would rescale the coefficients according to the variation of the corresponding covariate. Since x corresponds to age, the second part of the regression varies 50 times more. When plotting the resulting logistic pdf across a few thousands simulations from the prior,  the functions mostly end up as the constant functions with values 0 or 1. Not particularly realistic since the predicted phenomenon is the occurrence of coronary heart disease. The prior is thus using the wrong scale: the simulated pdfs should have a reasonable behaviour over the range (20,100) of the covariate x. For instance, focussing on a -5 log-odds ratio at age 20 and a +5 log-odds ratio at 100. Leading to the comparison pictured below. Furthermore, the fact that the coefficient of x may be negative is also ignoring a basic issue about the model and answers the later (dishonest) criticism that “the [prior] probability is 0.5 that the ED50 is negative” (p.78). Using a flat prior in this example is just fine and would avoid criticisms about the prior behaviour, since this behaviour is then meaningless from a probabilistic viewpoint.

“…in a more complicated model, it may be hard to determine the sample size beyond which induced prior influence on the posterioris negligible.” (p.79)

There is also the undercurrent in the paper (not that under!) that Bayesian inference should look like MLE inference and that if it does not then something is wrong. If the MLE outcome is “right”, there is indeed no point in running a Bayesian analysis. Strange argument. (Example 2.4 uses the MLE of the evenness as “the true evenness”, p.80.)

coronThe second example is inspired by Barnard, McCulloch and Meng (2000, Statistica Sinica) estimating a covariance matrix with a proper hyperprior on regression coefficient variances that results in a peaked prior on the covariances. The paper falls short of demonstrating a clear impact on the posterior inference. And the solution (p.82) of using another proper prior resulting in a wider dispersion requires a prior knowledge of how wide is wide enough.

Example 2.3 is about a confusing model, inspired by Cowles (2002, Statistics in Medicine), and inferring about surrogate endpoints. If I understand correctly the model, there are two models under comparison, one with the surrogate and one without. What puzzles me is that the quantity of interest, the proportion of treatment effect, involves parameters from both models. Even if this can be turned into a meaningful quantity, the criticism that the “proportion” may take values outside (0,1) is rather dead-born as it suffices to impose a joint prior that ensures the ratio stays within (0,1). Which is the solution proposed by the authors (pp.82-83).evenesThe fourth and last example concentrates on estimating a Shannon entropy for a vector of eight probabilities. Using a uniform (Dirichlet) prior induces a prior on the relative entropy that is concentrated on (0.5,1). Since there is nothing special about the uniform (!), re-running the evaluation with a Jeffreys prior Dir(½,½,…,½) reduces this feature, which anyway is a characteristic of the prior distribution, not of the posterior distribution which accounts for the data. The authors actually propose to use (p.83) a Dir(¼,¼,…,¼) prior, presumably on the basis that the induced prior on the evenness is then centred close to 0.5.

“A related solution to this problem is to specify a joint prior for meaningful summaries of the parameters in the sampling model. Then the induced prior on the original parameters can be computed.” (p.81)

I thus find the level of the criticism found in the paper rather superficial, as it either relies on a specific choice of a proper prior distribution or on ignoring basic prior information. The paper concludes with recommendations for prior checks. Again, no deficient hardware: the recommendations are mostly sensible if expressing the fact that some prior information is almost always available on some quantities of interest, as translated in the above quote. My only point of contention is the repeated reference to MLE, since it implies assessing/building the prior from the data… The most specific (if related to the above) recommendation is to use conditional mean priors as exposed in Christensen et al. (2010). (I did not spot this notion in my review of two years ago.) For instance, in the first (logistic) example, this meant putting a prior on the cdfs at age 40 and age 60. The authors picked a uniform in both cases, which sounds inconsistent with the presupposed shape of the probability function.

“…it is more natural for experts to think in terms of observables than parameters…” (p.81)

In conclusion, there is nothing pathologically wrong with either this paper or the use of “noninformative” priors! Looking at induced priors on more intuitive transforms of the original parameters is a commendable suggestion, provided some intuition or prior information is indeed available on those. Using a collection of priors incl. reference or invariant priors helps as well. And (in connection with the above quote) looking at the induced dataset by simulating from the corresponding predictive cannot hurt.

12 Responses to “hidden dangers of noninformative priors”

  1. […] Robert reviewed on line a paper that was critical of non-informative priors. Among the points that were discussed by him and other […]

  2. […] Gelman’s new favorite example of the hidden dangers of noninformative priors is the following. If we observe data y ~ N(theta,1) and get y=1, then this is consistent with being […]

  3. Xi’an: I am sure you are not in the intended audience for this paper, which would be folks who do Bayes mostly in BUGs and likely would have a hard time understanding your post.

    “Looking at induced priors on more intuitive transforms of the original parameters is a commendable suggestion” – yes it is commendable and easy to do using simulation but seldom is suggested (at least just using simulation).

    I put that suggestion in here
    http://andrewgelman.com/2011/05/14/missed_friday_t/
    but gave up on trying to get it in a journal given editors comments like “agree the paper’s methods would be helpful but there is not enough technical innovation to justify publication in the journal”.

    These commendable suggestions are hard to get into journals and when I read this paper I largely skipped over anything other than the simulation stuff. I think the authors over-extended thier technical abilities just to get the paper published.

    • Quite a strong claim, you may want to look at the aims and scope of the journal.

      Anyway, the point of the paper is to set “noninformative priors” on the quantities of interest, not just flat priors on the parameters …

      • X: Could you be more explicit as I do not get the point of this comment? iX

      • And, do you see the point of “I think the authors over-extended thier technical abilities just to get the paper published.”?

        Perhaps the paper is not revolutionary, but it appears to contain sensible recommendations, as you mention. Also, the statistical literature is full of misuses of “noninformative” and “vague” priors. Therefore, a paper containing some simple guidelines helps more than it hurts.

  4. A few months ago I had to deal with a biomarker project that involved the analysis of a two way table: detectable – undetectable v.s. Condition A-Condition B.

    Seemed like a fair game for logistic regression and soooo trivial to fit in BUGS, until we (the group) asked ourselves what priors to put on the coefficient of the log-odds of the marker being detectable in Condition A. The dnorm(0,1.0E-6) loses its appeal once you realize you would be putting a prior you would never use, had the problem been formulated not as a regression one, but as the independent specifications for the probability of the marker being expressed in Condition A and Condition B.

    It seemed that everyone was very confy in putting a non-informative Beta(1,1) prior on the probability of the marker being detected in A and an (independent) Beta(1,1) prior on the detection probability of the marker in Condition B, yet no-one was comfortable at all in even guessing what the non-informative priors would look like in the logistic regression formulation of the problem.

    So we took a step back and said the hell with the BUGS legacy: we will specify the logistic regression priors by transforming our non-informative Beta priors for the probability of detection in the two conditions. The answer, surprising at first, is that the priors in the transformed formulation of the problem were no-where near the dnorm(0.0,1.0E-6) default (anyone wants to take a shot?).

    Bottom line: non-informativeness is in the eyes of the beholder. If there is a formulation of your problem that you are comfortable reasoning about, choose priors that best corresponds to your state of knowledge (or ignorance) in that formulation/parameterization. But don’t expect these non-informative priors of yours to map to non-informative “folklore” priors in a different parameterization

    • thanks: what you show here is that it is very rare to be in a state of complete and absolute ignorance and thus that with the focus on entities one can build intuition about, it is manageable to get mildly informative priors….

      • Of course the methods section in the journal will read something like: “we used non-informative priors to model our prior beliefs for the detectability of the biomarker in Condition A and B” (a true statement), yet these priors look pretty darn informative in the log-odds scale. (JAGS and BUGS samplers worked much better as an added bonus)

        I think Keith’s journal experience shows we have a long way to go before we accept that the word “non-informative” is misleading and that a very useful line of research for us doing applied Bayesian analyses is exactly the priors induced through transformations of parameters.

        I could even go so far to suggest to Keith to try one of the Epidemiology Journals (or even Stat Med) as a venue for him to disseminate this sort of research because it is of immense practical importance.

    • A shot: Rougly dnorm(mu = 0, sigma = 2)?

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.