messages from Harvard

As in Bristol two months ago, where I joined the statistics reading in the morning, I had the opportunity to discuss the paper on testing via mixtures prior to my talk with a group of Harvard graduate students. Which concentrated on the biasing effect of the Bayes factor against the more complex hypothesis/model. Arguing [if not in those terms!] that Occam’s razor was too sharp. With a neat remark that decomposing the log Bayes factor as

log(p¹(y¹,H))+log(p²(y²|y¹,H))+…

meant that the first marginal was immensely and uniquely impacted by the prior modelling, hence very likely to be very small for a larger model H, which would then take forever to recover from. And asking why there was such a difference with cross-validation

log(p¹(y¹|y⁻¹,H))+log(p²(y²|y⁻²,H))+…

where the leave-one out posterior predictor is indeed more stable. While the later leads to major overfitting in my opinion, I never spotted the former decomposition which does appear as a strong and maybe damning criticism of the Bayes factor in terms of long-term impact of the prior modelling.

Other points made during the talk or before when preparing the talk:

  1. additive mixtures are but one encompassing model, geometric mixtures could be fun too, if harder to process (e.g., missing normalising constant). Or Zellner’s mixtures (with again the normalising issue);
  2. if the final outcome of the “test” is the posterior on α itself, the impact of the hyper-parameter on α is quite relative since this posterior can be calibrated by simulation against limiting cases (α=0,1);
  3. for the same reason the different rate of accumulation near zero and one  when compared with a posterior probability is hardly worrying;
  4. what I see as a fundamental difference in processing improper priors for Bayes factors versus mixtures is not perceived as such by everyone;
  5. even a common parameter θ on both models does not mean both models are equally weighted a priori, which relates to an earlier remark in Amsterdam about the different Jeffreys priors one can use;
  6. the MCMC output also produces a sample of θ’s which behaviour is obviously different from single model outputs. It would be interesting to study further the behaviour of those samples, which are not to be confused with model averaging;
  7. the mixture setting has nothing intrinsically Bayesian in that the model can be processed in other ways.

One Response to “messages from Harvard”

  1. Maxime Rischard Says:

    Hi Christian,
    I’m glad you enjoyed your time at Harvard, and thrilled that you found my question about cross-validation thought-provoking. Thank you for pointing out Aki Vehtari’s 2012 survey of Bayesian predictive methods for model assessment, selection and comparison. I’ve been reading his follow-up paper (Comparison of Bayesian predictive methods for model selection), which has answered a lot of my questions, and sets out the problem really nicely. They demonstrate how CV can overfit when selecting between a huge number of models with a small dataset, which makes sense. While they make much less of a point of it, their results also do seem to show that the more stable methods that they recommend (like choosing the maximum-a-posteriori model) have a tendency to *underfit*, which I think is possibly related to the conversation we had.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s