Following an earlier post on the American Statistician 2013 paper by Seaman III and co-authors, Hidden dangers of specifying noninformative priors, my PhD student Kaniav Kamary wrote a paper re-analysing the examples processed by those authors and concluding to the stability of the posterior distributions of the parameters and to the effect of the noninformative prior being essentially negligible. (This is the very first paper quoting verbatim from the ‘Og!) Kaniav logically submitted the paper to the American Statistician.
Archive for BUGS
At MCMSki IV, I attended (and chaired) a session where Martyn Plummer presented some developments on cut models. As I was not sure I had gotten the idea [although this happened to be one of those few sessions where the flu had not yet completely taken over!] and as I wanted to check about a potential explanation for the lack of convergence discussed by Martyn during his talk, I decided to (re)present the talk at our “MCMSki decompression” seminar at CREST. Martyn sent me his slides and also kindly pointed out to the relevant section of the BUGS book, reproduced above. (Disclaimer: do not get me wrong here, the title is a pun on the infamous “drill, baby, drill!” and not connected in any way to Martyn’s talk or work!)
I cannot say I get the idea any clearer from this short explanation in the BUGS book, although it gives a literal meaning to the word “cut”. From this description I only understand that a cut is the removal of an edge in a probabilistic graph, however there must/may be some arbitrariness in building the wrong conditional distribution. In the Poisson-binomial case treated in Martyn’s case, I interpret the cut as simulating from
hence loosing some of the information about φ… Now, this cut version is a function of φ and θ that can be fed to a Metropolis-Hastings algorithm. Assuming we can handle the posterior on φ and the conditional on θ given φ. If we build a Gibbs sampler instead, we face a difficulty with the normalising constant m(y|φ). Said Gibbs sampler thus does not work in generating from the “cut” target. Maybe an alternative borrowing from the rather large if disparate missing constant toolbox. (In any case, we do not simulate from the original joint distribution.) The natural solution would then be to make a independent proposal on φ with target the posterior given z and then any scheme that preserves the conditional of θ given φ and y; “any” is rather wistful thinking at this stage since the only practical solution that I see is to run a Metropolis-Hasting sampler long enough to “reach” stationarity… I also remain with a lingering although not life-threatening question of whether or not the BUGS code using cut distributions provide the “right” answer or not. Here are my five slides used during the seminar (with a random walk implementation that did not diverge from the true target…):
Already on the final day..! And still this frustration in being unable to attend three sessions at once… Andrew Gelman started the day with a non-computational talk that broached on themes that are familiar to readers of his blog, on the misuse of significance tests and on recommendations for better practice. I then picked the Scaling and optimisation of MCMC algorithms session organised by Gareth Roberts, with optimal scaling talks by Tony Lelièvre, Alex Théry and Chris Sherlock, while Jochen Voss spoke about the convergence rate of ABC, a paper I already discussed on the blog. A fairly exciting session showing that MCMC’ory (name of a workshop I ran in Paris in the late 90′s!) is still well and alive!
After the break (sadly without the ski race!), the software round-table session was something I was looking for. The four softwares covered by this round-table were BUGS, JAGS, STAN, and BiiPS, each presented according to the same pattern. I would have like to see a “battle of the bands”, illustrating pros & cons for each language on a couple of models & datasets. STAN got the officious prize for cool tee-shirts (we should have asked the STAN team for poster prize tee-shirts). And I had to skip the final session for a flu-related doctor appointment…
I called for a BayesComp meeting at 7:30, hoping for current and future members to show up and discuss the format of the future MCMski meetings, maybe even proposing new locations on other “sides of the Italian Alps”! But (workshop fatigue syndrome?!), no-one showed up. So anyone interested in discussing this issue is welcome to contact me or David van Dyk, the new BayesComp program chair.
[Those are comments sent yesterday by Shravan Vasishth in connection with my post. Since they are rather lengthy, I made them into a post. Shravan is also the author of The foundations of Statistics and we got in touch through my review of the book . I may address some of his points later, but, for now, I find the perspective of a psycholinguist quite interesting to hear.]
Christian, Is the problem for you that the p-value, however low, is only going to tell you the probability of your data (roughly speaking) assuming the null is true, it’s not going to tell you anything about the probability of the alternative hypothesis, which is the real hypothesis of interest.
However, limiting the discussion to (Bayesian) hierarchical models (linear mixed models), which is the type of model people often fit in repeated measures studies in psychology (or at least in psycholinguistics), as long as the problem is about figuring out P(θ>0) or P(θ>0), the decision (to act as if θ>0) is going to be the same regardless of whether one uses p-values or a fully Bayesian approach. This is because the likelihood is going to dominate in the Bayesian model.
Andrew has objected to this line of reasoning by saying that making a decision like θ>0 is not a reasonable one in the first place. That is true in some cases, where the result of one experiment never replicates because of study effects or whatever. But there are a lot of effects which are robust and replicable, and where it makes sense to ask these types of questions.
One central issue for me is: in situations like these, using a low p-value to make such a decision is going to yield pretty similar outcomes compared to doing inference using the posterior distribution. The machinery needed to do a fully Bayesian analysis is very intimidating; you need to know a lot, and you need to do a lot more coding and checking than when you fit an lmer type of model.
It took me 1.5 to 2 years of hard work (=evenings spent not reading novels) to get to the point that I knew roughly what I was doing when fitting Bayesian models. I don’t blame anyone for not wanting to put their life on hold to get to such a point. I find the Bayesian method attractive because it actually answers the question I really asked, namely is θ>0 or θ<0? This is really great, I don’t have beat around the bush any more! (there; I just used an exclamation mark). But for the researcher unwilling (or more likely: unable) to invest the time into the maths and probability theory and the world of BUGS, the distance between a heuristic like a low p-value and the more sensible Bayesian approach is not that large.
Last year, John Seaman (III), John Seaman (Jr.), and James Stamey published a paper in The American Statistician with the title Hidden dangers of specifying noninformative priors. (It does not seem to be freely available on-line.) I gave it to read to my PhD students, meaning to read towards the goal of writing a critical reply to the authors. In the meanwhile, here are my own two-cents on the paper.
“Applications typically employ Markov chain Monte Carlo (MCMC) methods to obtain posterior features, resulting in the need for proper priors, even when the modeler prefers that priors be relatively noninformative.” (p.77)
Apart from the above quote, which confuses proper priors with proper posteriors (maybe as the result of a contagious BUGS!), and which is used to focus solely and sort-of inappropriately on proper priors, there is no hard fact to bite in, but rather a collection of soft decisions and options that end up weakly supporting the authors’ thesis. (Obviously, following an earlier post, there is no such thing as a “noninformative” prior.) The paper is centred on four examples where a particular choice of (“noninformative”) prior leads to peaked or informative priors on some transform(s) of the parameters. Note that there is no definition provided for informative, non-informative, diffuse priors, except those found in BUGS with “extremely large variance” (p.77). (The quote below seems to settle on a uniform prior if one understands the “likely” as evaluated through the posterior density.) The argument of the authors is that “if parameters with diffuse proper priors are subsequently transformed, the resulting induced priors can, of course, be far from diffuse, possibly resulting in unintended influence on the posterior of the transformed parameters” (p.77).