PAC-Bayesians
Yesterday, I took part in the thesis defence of James Ridgway [soon to move to the University of Bristol[ at Université Paris-Dauphine. While I have already commented on his joint paper with Nicolas on the Pima Indians, I had not read in any depth another paper in the thesis, “On the properties of variational approximations of Gibbs posteriors” written jointly with Pierre Alquier and Nicolas Chopin.
PAC stands for probably approximately correct and starts with an empirical form of posterior, called the Gibbs posterior, where the log-likelihood is replaced with an empirical error
that is rescaled by a factor λ. Factor that is called the learning rate, to be optimised as the (Kullback) closest approximation to the true unknown distribution, by Peter Grünwald (2012) in his SafeBayes approach. In the paper of James, Pierre and Nicolas, there is no visible Bayesian perspective, since the pseudo-posterior is used to define a randomised estimator that achieves optimal oracle bounds. When λ is of order n. The purpose of the paper is rather to produce an efficient approximation to the Gibbs posterior, by using variational Bayes techniques. And to derive point estimators. With the added appeal that the approximation also achieves the oracle bounds. (Surprisingly, the authors do not leave the Pima Indians alone as they use this benchmark for a ranking model.) Since there is no discussion on the choice of the learning rate λ, as opposed to Bissiri et al. (2013) I discussed around Bayes.250, I have difficulties perceiving the possible impact of this representation on Bayesian analysis. Except maybe as an ABC device, as suggested by Christophe Andrieu.
September 22, 2015 at 11:44 pm
I really liked this paper, mainly because VB is amazing for finding the centre of a complex posterior and that’s all that is needed for PAC-Bayes.
I agree with you that it doesn’t say anything about the impact on Bayesian analysis, but I dont’ think that’s a downside. PAC-Bayes is explicitly trying to find just one thing (rather than the everythign that Bayes aims for), so they’re not compatible ideologies.
September 22, 2015 at 11:19 am
Thanks xi’an for discussing the paper. I would like to add that the goal
of the paper is to show that variational approximation of Gibbs posteriors can achieve the same rate of convergence as the posterior itself (and to show the conditions under which it does).
As it is discussed in the paper we use cross-validation to choose \lambda as it remains the “go to” method for general notions of risks…Also note that PAC methodology has origins in papers way before [Bissiri et al 2013], in particular see [Shawe-Taylor and Williamson, 1997,
McAllester, 1998, Catoni, 2004].
Concerning the pima indians, they do not appear as a sole justification
for the algorithms (as in some papers …). They are also tested on additional datasets with more covariates (in the case of classification) or more individuals when this is the computational issue (as for AUC ranking). Also note that the Pima indians is a very noisy dataset making it trivial to sample from the posterior (probit/logit likelihood) but relativelly hard as a classification task.