## messages from Harvard

**A**s in Bristol two months ago, where I joined the statistics reading in the morning, I had the opportunity to discuss the paper on testing via mixtures prior to my talk with a group of Harvard graduate students. Which concentrated on the biasing effect of the Bayes factor against the more complex hypothesis/model. Arguing [if not in those terms!] that Occam’s razor was too sharp. With a neat remark that decomposing the log Bayes factor as

log(p¹(y¹,H))+log(p²(y²|y¹,H))+…

meant that the first marginal was immensely and uniquely impacted by the prior modelling, hence very likely to be very small for a larger model H, which would then take forever to recover from. And asking why there was such a difference with cross-validation

log(p¹(y¹|y⁻¹,H))+log(p²(y²|y⁻²,H))+…

where the leave-one out posterior predictor is indeed more stable. While the later leads to major overfitting in my opinion, I never spotted the former decomposition which does appear as a strong and maybe damning criticism of the Bayes factor in terms of long-term impact of the prior modelling.

Other points made during the talk or before when preparing the talk:

- additive mixtures are but one encompassing model, geometric mixtures could be fun too, if harder to process (e.g., missing normalising constant). Or Zellner’s mixtures (with again the normalising issue);
- if the final outcome of the “test” is the posterior on α itself, the impact of the hyper-parameter on α is quite relative since this posterior can be calibrated by simulation against limiting cases (α=0,1);
- for the same reason the different rate of accumulation near zero and one when compared with a posterior probability is hardly worrying;
- what I see as a fundamental difference in processing improper priors for Bayes factors versus mixtures is not perceived as such by everyone;
- even a common parameter θ on both models does not mean both models are equally weighted a priori, which relates to an earlier remark in Amsterdam about the different Jeffreys priors one can use;
- the MCMC output also produces a sample of θ’s which behaviour is obviously different from single model outputs. It would be interesting to study further the behaviour of those samples, which are not to be confused with model averaging;
- the mixture setting has nothing intrinsically Bayesian in that the model can be processed in other ways.

April 3, 2016 at 9:25 pm

Hi Christian,

I’m glad you enjoyed your time at Harvard, and thrilled that you found my question about cross-validation thought-provoking. Thank you for pointing out Aki Vehtari’s 2012 survey of Bayesian predictive methods for model assessment, selection and comparison. I’ve been reading his follow-up paper (Comparison of Bayesian predictive methods for model selection), which has answered a lot of my questions, and sets out the problem really nicely. They demonstrate how CV can overfit when selecting between a huge number of models with a small dataset, which makes sense. While they make much less of a point of it, their results also do seem to show that the more stable methods that they recommend (like choosing the maximum-a-posteriori model) have a tendency to *underfit*, which I think is possibly related to the conversation we had.