Yes, then we completely agree!!! And I am sorry I missed JSM and you and Montréal: I had to squeeze family vacations between ESM in Budapest and WSC in Hong Kong…!

]]>You misunderstood my post.

I have nothing against reference priors.

I just don’t think there exists a prior which

is truly “noninformative.”

I suspect you agree with me on this.

Larry

ps. we missed you at JSM

I think that things like subjective priors are used all the time in machine learning. Not always formally, in a Bayesian framework, but in one way or another prior knowledge is used to constrain how the machine learning method operates.

Note that subjective prior knowledge needn’t be very specific. It is often enough to just constrain some quantities to within a few orders of magnitude. And one can use such a vague prior even when you have more specific prior knowledge, if it’s just not worth the effort to formalize that specific knowledge. The problem comes when you try to go from this sort of practical reason for using vague, or even improper, priors to thinking that they can be justified as something other than a pragmatic short-cut.

]]>Yes, your general point is correct. I went off on a bit of a tangent regarding the Gamma priors in particular…

]]>A very interesting exchange!

It connects to something that I was thinking about recently and is well captured by a classic Neal (1997) paper:

“In typical applications, the constant part of the covariance and possibly the jitter part (if present) would have fixed values, but the available prior information would not be sufficient to fix the other hyperparameters in the covariance function. These hyperparameters will often be given prior distributions that are fairly vague, but they should not be improper, since this will often produce an improper posterior.”

I consider that Toto is right in that if an improper prior implies an improper posterior then the the prior must be informative (even if the Gamma example is poor). This would suggest that priors are informative for GP hyperparameters. It makes intuitive sense to me that (say) the amount of smoothing cannot be set purely on the basis of the data, but in machine learning papers the concept of an informative prior is almost completely absent.

I put this down to a few things…. the influence of Jaynes in machine learning, the use of hierarchical models that can give the impression that everything is data driven, the plain difficulty of elicitation (particularly when combined with other already difficult tasks), cultural barriers in getting published…

Another side is that elicitation perhaps makes less sense if you see yourself as an algorithm developer, i.e. the just use something convenient criticised by Robert here:

“I am actually much more annoyed by the use of a specific proper prior in a statistical analysis when this prior is neither justified nor assessed in terms of robustness. “. it seems this might be more understandable if the goal is to develop an algorithm… as Olivier Cappe says “Choosing the Prior is a very important issue in some contexts, but for information processing one generally sticks to conjugate prior families, tuning them to be somewhat noninformative, without being too careful about what the term exactly means”

Anyway, I am interested in your thoughts on if subjective priors are useful in machine learning and if so why they are so rarely used…

Cheers,

David

You raise a valid but entirely different point. My point is that this sort of priors are intended to be approximations of something that people usually call “noninformative priors”. The same can be said about a Normal(0,10000000) for a location parameter, which is supposed to be an approximation to a flat prior on the entire real line. Then, the least you should do is to check that those “noninformative priors” (the limit ones) lead to a proper posterior in order to justify such approximation. Whether they are reasonable or not, that’s a different question.

]]>No, what happens to the posterior using a Gamma(epsilon,epsilon) prior when epsilon goes to zero is not important, because you should not be using a Gamma(epsilon,epsilon) prior anyway. These priors are just ridiculous.

If you don’t believe me, just try sampling 100000 X values from Gamma(0.01,0.01) and plotting a histogram of log(X). Note how it extends to very extreme values in one direction, but to only very modest values (a few hundered) in the other. People who think this is a “vague” prior are deluding themselves. And no, it doesn’t help to use Gamma(0.001,0.001).

The fact that hundreds of paper have been published using Gamma(0.001,0.001) priors shows the dangers of the whole “non-informative prior” business. There is no substitute for actually thinking about what your prior should be.

As for changing the name from “non-informative prior” to “reference prior”, it sounds more reasonable, but how many papers are there that present results with a “reference prior” and then go on to present results for an informative prior, using the results with the reference prior as a useful reference point? If you present only results with the reference prior, that’s like publishing an experimental physics paper that only gives results on the standard kilogram stored in Paris. I suspect that whatever it’s called, people actually think (incorrectly) that their improper priors are “non-informative”, not a “reference”.

]]>I can sum up my objection to Wasserman’s post succinctly. Larry claimed it was a major issue that a distribution P(x) could be uninformative about x but be highly informative about f=F(x).

Not only is this not a problem, it’s highly desirable and a key ingredient in the practical success of statistics. Statistics is full of functions f=F(x) which have the property that the values of f are largely insensitive to x. When that happens it’s perfectly possible to be uninformed about x and still be highly informed about f.

For example if F(x) ~ 0 for almost all x, then even if little is known about x, I can still say that f~0. A real life example is the process of averaging data points. Even if we know little about the errors, it’s still true that over a reasonable domain the average of the errors will be ~0. It’s for precisely that reason that averages get used so much in statistics!

]]>