## comparison of Bayesian predictive methods for model selection

“Dupuis and Robert (2003) proposed choosing the simplest model with enough explanatory power, for example 90%, but did not discuss the effect of this threshold for the predictive performance of the selected models. We note that, in general, the relative explanatory power is an unreliable indicator of the predictive performance of the submodel,”

**J**uho Piironen and Aki Vehtari arXived a survey on Bayesian model selection methods that is a sequel to the extensive survey of Vehtari and Ojanen (2012). Because most of the methods described in this survey stem from Kullback-Leibler proximity calculations, it includes some description of our posterior projection method with Costas Goutis and Jérôme Dupuis. We indeed did not consider prediction in our papers and even failed to include consistency result, as I was pointed out by my discussant in a model choice meeting in Cagliari, in … 1999! Still, I remain fond of the notion of defining a prior on the embedding model and of deducing priors on the parameters of the submodels by Kullback-Leibler projections. It obviously relies on the notion that the embedding model is “true” and that the submodels are only approximations. In the simulation experiments included in this survey, the projection method “performs best in terms of the predictive ability” (p.15) and “is much less vulnerable to the selection induced bias” (p.16).

Reading the other parts of the survey, I also came to the perspective that model averaging makes much more sense than model choice in predictive terms. Sounds obvious stated that way but it took me a while to come to this conclusion. Now, with our mixture representation, model averaging also comes as a natural consequence of the modelling, a point presumably not stressed enough in the current version of the paper. On the other hand, the MAP model now strikes me as artificial and linked to a very rudimentary loss function. A loss that does not account for the final purpose(s) of the model. And does not connect to the “all models are wrong” theorem.

April 10, 2015 at 11:17 am

It seems there’s some overlap of ideas here in the difference between LASSO and ridge regression: do we want to assign some models (sets of covariates / explanatory variables) exactly zero weights (choose the MAP model), or do we want to keep all covariates even those with near-zero coefficients (average over all models) [loosely speaking].

I also note an overlap in that I’ve seen people use the LASSO or ridge regression to first identify the optimum set (or sets) of covariates, then perform a second round of fitting to find their optimum coefficients as a way to avoid the under-estimation of coefficients produced by the shrinkage term. This seems similar to the separate derivation of the parameter priors post-weighting in your mixture of models scheme.

April 9, 2015 at 1:12 pm

I guess the best reason to do selection for predictive inference is the “data science” limit, where you just have too many predictors to feasibly use them all.

But otherwise, I think my favourite bit of your mixture as model choice paper is that you get good predictions for free.

I wonder if you get better predictions than you do with more traditional BMA, given that the weights have a polynomial learning rate vs the exponential learning rate for BFs.