a unified treatment of predictive model comparison

“Applying various approximation strategies to the relative predictive performance derived from predictive distributions in frequentist and Bayesian inference yields many of the model comparison techniques ubiquitous in practice, from predictive log loss cross validation to the Bayesian evidence and Bayesian information criteria.”

Michael Betancourt (Warwick) just arXived a paper formalising predictive model comparison in an almost Bourbakian sense! Meaning that he adopts therein a very general representation of the issue, with minimal assumptions on the data generating process (excluding a specific metric and obviously the choice of a testing statistic). He opts for an M-open perspective, meaning that this generating process stands outside the hypothetical statistical model or, in Lindley’s terms, a small world. Within this paradigm, the only way to assess the fit of a model seems to be through the predictive performances of that model. Using for instance an f-divergence like the Kullback-Leibler divergence, based on the true generated process as the reference. I think this however puts a restriction on the choice of small worlds as the probability measure on that small world has to be absolutely continuous wrt the true data generating process for the distance to be finite. While there are arguments in favour of absolutely continuous small worlds, this assumes a knowledge about the true process that we simply cannot gather. Ignoring this difficulty, a relative Kullback-Leibler divergence can be defined in terms of an almost arbitrary reference measure. But as it still relies on the true measure, its evaluation proceeds via cross-validation “tricks” like jackknife and bootstrap. However, on the Bayesian side, using the prior predictive links the Kullback-Leibler divergence with the marginal likelihood. And Michael argues further that the posterior predictive can be seen as the unifying tool behind information criteria like DIC and WAIC (widely applicable information criterion). Which does not convince me towards the utility of those criteria as model selection tools, as there is too much freedom in the way approximations are used and a potential for using the data several times.

5 Responses to “a unified treatment of predictive model comparison”

  1. betanalpha Says:

    Yup — Cameron-Martin is not a forgiving theorem. But you can get around this by marginalizing over the finite-dimensional hyperparameters, which is why a fully Bayesian approach is so important in practice. Too bad those hyperparameter posteriors are so incredibly nasty!

  2. betanalpha Says:

    Ah, yes, the absolute continuity struck me a few days after I submitted the draft — my subconscious runs far too slowly! I’m curious if absolute continuity is a requirement for all of the model comparison strategies, or if they can be derived without requiring this. In other words, has everyone been implicitly assuming absolutely continuity by applying these methods?

    And I completely agree about the ultimate utility of many of these criteria — the preponderance of assumptions implicit in them is exactly what motivated the paper in the first place. Personally, I’ll embrace uncertainty and stick to qualitative judgements (and visual posterior predictive checks) and avoid even trying to make any definite claims.

    • There always seems to be an assumption in the misspecification stuff that the prior has sufficient mass in a Kullback-Leibler ball around the true generating measure. I suspect that when this isn’t true you still get some notion of convergence (perhaps of nice functionals), but if this assumption isn’t met, I’m not sure any general theory is known.

      • betanalpha Says:

        Yeah, if every element of the small world is not absolutely continuous with respect to the true data generating process then any of these predictive scores will be infinite so all of the comparison approaches will fail one way or another. If anything this gives formal support to the intuition of keeping ones model as general as possible and never placing zero probability on any neighborhood of parameter space.

      • But it says *very* bad things about non-parametric models. Most infinite-dimensional probability measures are singular (if x(.) is a Gaussian Process, the laws of x(.) and (1+eps)*x(.) are mutually singular).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s