on noninformative priors
A few weeks ago, Larry Wasserman posted on Normal Deviate an entry on noninformative priors as a lost cause for statistics. I first reacted rather angrily to this post, then decided against posting my reply. After a relaxing week in Budapest, and the prospect of the incoming summer break, I went back to the post and edited it towards more constructive goals… The post also got discussed by Andrew and Entsophy, generating in each case a heap of heated discussions. (Enjoy your summer, winter is coming!)
Although Larry wrote he wanted to restrain from only posting on Bayesian statistics, he does seem attracted to them like a moth to a candle… This time, it is about the “lost cause of noninformative priors”. While Larry is 200% entitled to post about whatever he likes or dislikes, the post does not really bring new fuel to the debate, if debate there is. First, I think everyone agrees that there is no such thing as a noninformative prior or a prior representing ignorance. (To quote from Jeffreys: “A prior probability used to express ignorance is merely the formal statement of ignorance” (ToP, VIII, x8.1). Every prior brings something into the game and this is reflected in the posterior inference. Sometimes, the impact is enormous and we may be unaware of it. Take for instance Bayesian nonparametrics. It is thus essential to keep this in mind. (And to keep calm!) Which does not mean we should not use them. Indeed, noninformative priors are a way of setting a reference measure, from which one can start evaluating the impact of picking this or that prior. Just a measure. (No-one gets emotional when hearing the Lebesgue measure mentioned, right?!) And if the reference prior is a σ-finite measure, one cannot even put a meaning to events like θ>0. This reference measure is required to set the Bayesian crank turning, here or there depending on one’s prior beliefs or information. If we reject those reference priors for accepting only the cases when the prior is provided along with the data and the model, I think everyone is a Bayesian. Even Feller. Even Larry (?).
Second, there is alas too much pathos or unintended meaning put in names like noninformative, ignorance, objective, &tc. And this may be the major message in Larry’s post. We should call those reference priors Linear A priors in reference to the mostly undeciphered Minoan alphabet. Or whatever name with no emotional content whatsoever in order not to drive people crazy. Noninformative is not even a word, to start with… And I dunno how to define ignorance in a mathematical manner.
Once more in connection with the EMS 2013 meeting in Budapest, I do not see why one should object more to reference priors than to the so-called “subjective” priors, as the former provide a baseline against which to test the latter, using e.g. Xiao Li’s approach. I am actually much more annoyed by the use of a specific proper prior in a statistical analysis when this prior is neither justified nor assessed in terms of robustness. And I see nothing wrong in establishing either asymptotic or frequentist properties about some procedures connected with some of those reference priors: I became a Bayesian this way, after all.
Anyway, have a nice (end of the) summer if you are in the Northern Hemisphere, and expect delays (or snapshots!) on the ‘Og for the coming fortnight…
August 25, 2013 at 5:52 am
[…] https://xianblog.wordpress.com/… “on noninformative […]
August 12, 2013 at 1:41 am
Christian
You misunderstood my post.
I have nothing against reference priors.
I just don’t think there exists a prior which
is truly “noninformative.”
I suspect you agree with me on this.
Larry
ps. we missed you at JSM
August 12, 2013 at 1:13 pm
Yes, then we completely agree!!! And I am sorry I missed JSM and you and Montréal: I had to squeeze family vacations between ESM in Budapest and WSC in Hong Kong…!
August 7, 2013 at 5:13 am
Prof Neal/Robert
A very interesting exchange!
It connects to something that I was thinking about recently and is well captured by a classic Neal (1997) paper:
“In typical applications, the constant part of the covariance and possibly the jitter part (if present) would have fixed values, but the available prior information would not be sufficient to fix the other hyperparameters in the covariance function. These hyperparameters will often be given prior distributions that are fairly vague, but they should not be improper, since this will often produce an improper posterior.”
I consider that Toto is right in that if an improper prior implies an improper posterior then the the prior must be informative (even if the Gamma example is poor). This would suggest that priors are informative for GP hyperparameters. It makes intuitive sense to me that (say) the amount of smoothing cannot be set purely on the basis of the data, but in machine learning papers the concept of an informative prior is almost completely absent.
I put this down to a few things…. the influence of Jaynes in machine learning, the use of hierarchical models that can give the impression that everything is data driven, the plain difficulty of elicitation (particularly when combined with other already difficult tasks), cultural barriers in getting published…
Another side is that elicitation perhaps makes less sense if you see yourself as an algorithm developer, i.e. the just use something convenient criticised by Robert here:
“I am actually much more annoyed by the use of a specific proper prior in a statistical analysis when this prior is neither justified nor assessed in terms of robustness. “. it seems this might be more understandable if the goal is to develop an algorithm… as Olivier Cappe says “Choosing the Prior is a very important issue in some contexts, but for information processing one generally sticks to conjugate prior families, tuning them to be somewhat noninformative, without being too careful about what the term exactly means”
Anyway, I am interested in your thoughts on if subjective priors are useful in machine learning and if so why they are so rarely used…
Cheers,
David
August 9, 2013 at 6:24 am
I think that things like subjective priors are used all the time in machine learning. Not always formally, in a Bayesian framework, but in one way or another prior knowledge is used to constrain how the machine learning method operates.
Note that subjective prior knowledge needn’t be very specific. It is often enough to just constrain some quantities to within a few orders of magnitude. And one can use such a vague prior even when you have more specific prior knowledge, if it’s just not worth the effort to formalize that specific knowledge. The problem comes when you try to go from this sort of practical reason for using vague, or even improper, priors to thinking that they can be justified as something other than a pragmatic short-cut.
August 5, 2013 at 2:15 am
I was pretty bummed by Larry’s post as well. It confirmed a suspicion I’ve long had that no amount of mathematical expertise can make up for just not “getting” Bayesian statistics.
I can sum up my objection to Wasserman’s post succinctly. Larry claimed it was a major issue that a distribution P(x) could be uninformative about x but be highly informative about f=F(x).
Not only is this not a problem, it’s highly desirable and a key ingredient in the practical success of statistics. Statistics is full of functions f=F(x) which have the property that the values of f are largely insensitive to x. When that happens it’s perfectly possible to be uninformed about x and still be highly informed about f.
For example if F(x) ~ 0 for almost all x, then even if little is known about x, I can still say that f~0. A real life example is the process of averaging data points. Even if we know little about the errors, it’s still true that over a reasonable domain the average of the errors will be ~0. It’s for precisely that reason that averages get used so much in statistics!
August 5, 2013 at 12:45 am
Improper priors are also used to justify the use of “vague priors”, to some extent, whenever they produce a proper posterior distribution. For instance, when people use the $Gamma(\epsilon,\epsilon)$ for scale parameters, $\epsilon\approx 0$, it is important to check that the limit prior ($\epsilon\rightarrow 0$) produces a proper posterior. It makes no sense to call this prior “vague” if its limit leads to an improper posterior.
August 6, 2013 at 8:39 pm
No, what happens to the posterior using a Gamma(epsilon,epsilon) prior when epsilon goes to zero is not important, because you should not be using a Gamma(epsilon,epsilon) prior anyway. These priors are just ridiculous.
If you don’t believe me, just try sampling 100000 X values from Gamma(0.01,0.01) and plotting a histogram of log(X). Note how it extends to very extreme values in one direction, but to only very modest values (a few hundered) in the other. People who think this is a “vague” prior are deluding themselves. And no, it doesn’t help to use Gamma(0.001,0.001).
The fact that hundreds of paper have been published using Gamma(0.001,0.001) priors shows the dangers of the whole “non-informative prior” business. There is no substitute for actually thinking about what your prior should be.
As for changing the name from “non-informative prior” to “reference prior”, it sounds more reasonable, but how many papers are there that present results with a “reference prior” and then go on to present results for an informative prior, using the results with the reference prior as a useful reference point? If you present only results with the reference prior, that’s like publishing an experimental physics paper that only gives results on the standard kilogram stored in Paris. I suspect that whatever it’s called, people actually think (incorrectly) that their improper priors are “non-informative”, not a “reference”.
August 7, 2013 at 12:58 am
You raise a valid but entirely different point. My point is that this sort of priors are intended to be approximations of something that people usually call “noninformative priors”. The same can be said about a Normal(0,10000000) for a location parameter, which is supposed to be an approximation to a flat prior on the entire real line. Then, the least you should do is to check that those “noninformative priors” (the limit ones) lead to a proper posterior in order to justify such approximation. Whether they are reasonable or not, that’s a different question.
August 9, 2013 at 6:17 am
Yes, your general point is correct. I went off on a bit of a tangent regarding the Gamma priors in particular…