## back to Ockham’s razor

“All in all, the Bayesian argument for selecting the MAP model as the single ‘best’ model is suggestive but not compelling.”

**L**ast month, Jonty Rougier and Carey Priebe arXived a paper on Ockham’s factor, with a generalisation of a prior distribution acting as a regulariser, R(θ). Calling on the late David MacKay to argue that the evidence involves the correct penalising factor although they acknowledge that his central argument is not absolutely convincing, being based on a first-order Laplace approximation to the posterior distribution and hence “dubious”. The current approach stems from the candidate’s formula that is already at the core of Sid Chib’s method. The log evidence then decomposes as the sum of the maximum log-likelihood minus the log of the posterior-to-prior ratio at the MAP estimator. Called the flexibility.

“Defining model complexity as flexibility unifies the Bayesian and Frequentist justifications for selecting a single model by maximizing the evidence.”

While they bring forward rational arguments to consider this as a measure model complexity, it remains at an informal level in that other functions of this ratio could be used as well. This is especially hard to accept by non-Bayesians in that it (seriously) depends on the choice of the prior distribution, as all transforms of the evidence would. I am thus skeptical about the reception of the argument by frequentists…

August 6, 2019 at 9:36 am

You are quite right: perfect unification between the pseudo-Bayesian and Frequentist approaches requires that if ‘fit’ is measured by the maximum of the log likelihood, then ‘complexity’ must be precisely our definition (‘flexibility’), in order to have a decomposition of the form log evidence = max log likelihood – complexity penalty.

(I am obliged to write ‘pseudo-Bayesian’ because maximizing the evidence to choose the hyper-parameters is only sorta Bayesian — and in my own practice I prefer to integrate out.)

Analysts who are ‘regularizers’ have no particular desire to conform to a pseudo-Bayesian approach. So regularizers can penalize max log likelihood in any way that they think is appropriate; for example they can use 2 * flexibility as their complexity penalty.

(This reminds me of the situation in gambling. The Kelly criterion is optimal for long-term growth of your fund, but it is also quite conservative, and so some gamblers will use 2 * Kelly, or even more. Of course they are doing this because their gambling is about more than just increasing the size of the fund.)

My extreme subjectivism enables me to accept as someone’s defense of their modelling choices — “it feels right to me”. But, as I write in ‘Confidence in Risk Assessments’ (doi:10.1111/rssa.12445) in the ‘bazaar of experts’ clients and their auditors might require a little more than this, in order to select their expert. So I hope that complexity = flexibility might catch on a little: it is an attractive choice where there is no compelling reason to select a particular complexity penalty, because it has ‘cross tribe’ appeal.

Also, flexibility is asymptotically BIC in the Linear Model, although, as we say in the paper, it is better to estimate the evidence directly, than to approximate it with a BIC penalty, which misses the distinction between the nominal number of parameters and the effective number of parameters.

July 31, 2019 at 10:57 am

May I sugges ti edit the blog entry to leave a [pointer](https://arxiv.org/abs/1906.11592) to the paper ?

July 31, 2019 at 3:02 pm

Oups, merci!