Catching up faster by switching sooner
Here is our discussion (with Nicolas Chopin) of the Read Paper of last Wednesday by T. van Erven, P. Grünwald and S. de Rooij (Centrum voor Wiskunde en Informatica, Amsterdam), entitled Catching up faster by switching sooner: a predictive approach to adaptive estimation with an application to the Akaike information criterion–Bayesian information criterion dilemma. It is still available for written discussions, to be published in Series B. Even though the topic is quite tangential to our interests, the fact that the authors evolve in a Bayesian environment called for the following (my main contribution being in pointing out that the procedure is not Bayesian by failing to incorporate the switch in the predictive (6), hence using the same data for all models under competition…):
Figure 1 - Bayes factors of Model 2 vs.~Model 1 (gray line) and Model 3 vs.~Model 1 (dark line), plotted against the number of observations, i.e. of iterations, when comparing three stochastic volatility models; see Chopin et al. (2011) for full details.
This paper is an interesting attempt at a particularly important problem. We nonetheless believe more classical tools should be used instead if models are truly relevant in the inference led by the authors: Figure 1, reproduced from Chopin et al. (2011), plots [against time] the Bayes factors of Models 2 and 3 vs. Model 1, where all models are state-space models of increasing complexity, fitted to some real data. In this context, one often observes that more complex models need more time to “ascertain themselves”. On the other hand, even BMA based prediction is a very challenging computational problem (the only generic solution currently being the SMC² algorithm of the aforementioned paper), and we believe that the current proposed predictive strategy will remain too computationally expensive for practical use for nonlinear state-space models.
For other classes of models, since the provable methods put forward by this paper are based on “frozen strategies”, which are hard to defend from a modelling perspective, and since the more reasonable “basic switch” strategy seems to perform as well numerically, we would be curious to see how the proposed methods compare to predictive distributions obtained from genuine Bayesian models. A true change point model for instance would generate a coherent prediction strategy, which is not equivalent to the basic switch strategy. (Indeed, for one thing, the proposal made by the authors utilises the whole past to compute the switching probabilities, rather than allocating the proper portion of the data to the relevant model. In this sense, the proposal is “using the data [at least] twice” in a pseudo-Bayesian setting, similar to Aitkin’s, 1991.) More generally, the authors seem to focus on situations where the true generative process is a non-parametric class, and the completed models is an infinite sequence of richer and richer—but also of more and more complex—parametric models, which is a very sensible set-up in practice. Then, we wonder whether or not it would make more sense to set the prior distribution over the switch parameter s in such a way that (a) switches only occurs from one model to another model with greater complexity and (b) the number of switches is infinite.