JSM 2015 [day #2]
Today, at JSM 2015, in Seattle, I attended several Bayesian sessions, having sadly missed the Dennis Lindley memorial session yesterday, as it clashed with my own session. In the morning sessions on Bayesian model choice, David Rossell (Warwick) defended non-local priors à la Johnson (& Rossell) as having better frequentist properties. Although I appreciate the concept of eliminating a neighbourhood of the null in the alternative prior, even from a Bayesian viewpoint since it forces us to declare explicitly when the null is no longer acceptable, I find the asymptotic motivation for the prior less commendable and open to arbitrary choices that may lead to huge variations in the numerical value of the Bayes factor. Another talk by Jin Wang merged spike and slab with EM with bootstrap with random forests in variable selection. But I could not fathom what the intended properties of the method were… Besides returning another type of MAP.
The second Bayesian session of the morn was mostly centred on sparsity and penalisation, with Carlos Carvalho and Rob McCulloch discussing a two step method that goes through a standard posterior construction on the saturated model, before using a utility function to select the pertinent variables. Separation of utility from prior was a novel concept for me, if not for Jay Kadane who objected to Rob a few years ago that he put in the prior what should be in the utility… New for me because I always considered the product prior x utility as the main brick in building the Bayesian edifice… Following Herman Rubin’s motto! Veronika Rocková linked with this post-LASSO perspective by studying spike & slab priors based on Laplace priors. While Veronicka’s goal was to achieve sparsity and consistency, this modelling made me wonder at the potential equivalent in our mixtures for testing approach. I concluded that having a mixture of two priors could be translated in a mixture over the sample with two different parameters, each with a different prior. A different topic, namely multiple testing, was treated by Jim Berger, who showed convincingly in my opinion that a Bayesian approach provides a significant advantage.
In the afternoon finalists of the ISBA Savage Award presented their PhD work, both in the theory and methods section and in the application section. Besides Veronicka Rocková’s work on a Bayesian approach to factor analysis, with a remarkable resolution via a non-parametric Indian buffet prior and a variable selection interpretation that avoids MCMC difficulties, Vinayak Rao wrote his thesis on MCMC methods for jump processes with a finite number of observations, using a highly convincing completion scheme that created independence between blocks and which reminded me of the Papaspiliopoulos et al. (2005) trick for continuous time processes. I do wonder at the potential impact of this method for processing the coalescent trees in population genetics. Two talks dealt with inference on graphical models, Masanao Yajima and Christine Peterson, inferring the structure of a sparse graph by Bayesian methods. With applications in protein networks. And with again a spike & slab prior in Christine’s work. The last talk by Sayantan Banerjee was connected to most others in this Savage session in that it also dealt with sparsity. When estimating a large covariance matrix. (It is always interesting to try to spot tendencies in awards and conferences. Following the Bayesian non-parametric era, are we now entering the Bayesian sparsity era? We will see if this is the case at ISBA 2016!) And the winner is..?! We will know tomorrow night! In the meanwhile, congrats to my friends Sudipto Banerjee, Igor Prünster, Sylvia Richardson, and Judith Rousseau who got nominated IMS Fellows tonight.
July 8, 2016 at 1:56 pm
[…] There is a rapidly growing literature on shrinking priors for such models, just look at Polson and Scott (2012), Caron and Doucet (2008), Carvalho, Polson, and Scott (2010) among many, many others, or simply have a look at the program of the last BNP conference. There is also an on growing literature on theoretical properties of some of these priors. The Horseshoe prior was studied in Pas, Kleijn, and Vaart (2014), an extention of the Horseshoe was then study in Ghosh and Chakrabarti (2015), and recently, the spike and slab Lasso was studied in Rocková (2015) (see also Xian ’Og) […]
March 30, 2016 at 7:11 pm
Chris, regarding the EM algorithm presented by Jin at JSM, you can check our recent paper at: http://arxiv.org/abs/1603.04360.
October 19, 2015 at 4:26 pm
[…] There is a rapidly growing literature on shrinking priors for such models, just look at Polson and Scott (2012), Caron and Doucet (2008), Carvalho, Polson, and Scott (2010) among many, many others, or simply have a look at the program of the last BNP conference. There is also an on growing literature on theoretical properties of some of these priors. The Horseshoe prior was studied in Pas, Kleijn, and Vaart (2014), an extention of the Horseshoe was then study in Ghosh and Chakrabarti (2015), and recently, the spike and slab Lasso was studied in Rocková (2015) (see also Xian ’Og) […]
September 17, 2015 at 6:56 pm
Well Dan, your point of view are critical but certainly the type of issues one should be considering, hence appreciated! :-) One big advantage is assessing uncertainty (as Terrance argues), of course if p>>n one cannot do that exactly but there are various approximations that often work reasonably well in practice (variational Bayes, local searches, split-and-conquer etc), and anyway it’s definitely better than ignoring uncertainty altogether (note that there’s frequentist work on confidence intervals for penalized likelihood, but as far as I’m aware this is still quite challenging). This is not just a philosophical discussion, in real applications one really wants to know if a variable that was not selected has a chance of actually being important (e.g. for designing follow-up studies). Another advantage of “classical” Bayesian variable selection (i.e. based on posterior probabilities on models rather than shrinkage) is they typically return more accurate solutions (in terms of selecting the “right variables”, though the precise meaning of this would take a much longer discussion), this has been investigated theoretically but is also often seen in practice. The result is not unique to the Bayesian paradigm, e.g. L0 likelihood penalties also have very good properties (returning smaller models with same or better predictive properties than say L1).
Given these theoretical & practical advantages, the real question is can we do it in practice? My personal experience is that, while a full model search is not feasible, finding local modes on the model space is no harder computationally than finding local modes for penalized likelihoods / continuous posteriors, and in practice often returns a better solution (in terms of model selection). That is, for some reason the idea that “classical” Bayesian model selection is unfeasible has spred, but that is not the case in my experience. I would say that this is an interesting open research question, but definitely not unfeasible. I’ve done quite a lot of applied work and can tell you that (if careful) you can get pretty good results in practice that you just won’t get with most shrinkage approaches.
Arrgh, I’m afraid I wrote a too long answer, apologies. Re Xian’s comment I completely agree that sensitivity to prior parameters is an issue for any model selection prior (local or non-local). However results are often only sensitive to big changes in the prior (e.g. taking things to infinity as in Jeffreys-Lindley-Bartlett’s paradox), moderate changes have little impact usually (Dawid has an interesting 1995 paper on that, “the trouble with Bayes factors”). In most real application the range of “reasonable” prior parameter values is relatively small, e.g. in practice I check that the range of parameter values where I’m putting prior mass is reasonable (effect sizes beta/sigma between 0.2 and 2, odds or hazard ratios between 1.5 and 3 or whatever). I realize this is a bit subjective but again small changes to these numbers don’t affect results very much even for moderately large n. A perhaps more “objective” strategy I personally like is to follow the “unit information prior” philosophy and set default prior parameters to match the entropy / variance of the UIP, again it often works surprisingly well. Another interesting option pursued by Val Johnson is calibration via frequentist type I error, this might make special sense when false positives cost us something (though I realize we’re sinning & intermingling the prior with the utility). Anyway, just my biased personal ramblings. :-)
September 17, 2015 at 8:56 pm
Mixing prior and utility is not a sin in my opinion..! Thanks for the detailed answer, worth of a post on its own!
September 18, 2015 at 1:14 am
Thanks for your answer David!
I guess my question would then be that, if local modes are the computable quantity, then now does this differ from penalised likelihood? (There may be a conceptual difference, but I’d still struggle not to see local MAP estimation as careful penalised likelihood)
I guess when I say Bayesian Sparsity I mean “full posterior analysis for very high dimensional models with a priori mass on sparse signals”
My fear with something like VB is that the resulting intervals (which will not be the credible intervals) are not interpretable, and so it’s only the location of the prior that is meaningful.
Or to say it differently, simply not ignoring uncertainty is not enough to make the inference more valid than one that only provides a mode. In some sense, I feel like not providing uncertainty is better than providing too narrow (or otherwise misleading) uncertainty.
(I’ve not seen high [bigger than, say >5e5 models where the computed interval is meaningful, but that doesn’t mean that such a thing doesn’t exist! And if it does I’d love to see it!)
I guess that I see bayes as a means to an end (meaningful inference) rather than an end itself. So I want to see some reason to go to the massive inconvenience of a Bayesian analysis before I do it.
So I still feel like I don’t understand the aim of they Bayesian Sparsity community. It might be that I’m making a false distinction between aiming for “Sparsity” and “posterior Sparsity”. And this is certainly not the corner of statistics in which I spend my time (which, incidentally, does not have totally different problems).
But even after chewing on your excellent response, I’m still having problems with Bayesian Sparsity. (But as always, there’s no reason to expect the problem isn’t just me)
August 11, 2015 at 9:14 am
I’m struggling with Bayesian sparsity, to be honest. I’m just not sure what it gives you over non-Bayesian versions. It’s fun maths, but to consider full posterior inference for p»n is a pipe dream at the moment unless n is comically small. (none of the “big data bayes” things deal with this structure so it’s basically Unimplementable)
I feel like I’m probably being uncharitable here, but I’m not seeing a practical advantage to this work (which, let’s face it, has been going for a while). But this is one of those situations where I’d rather be wrong…
August 27, 2015 at 9:06 pm
The large size of the model space is what suggests a Bayesian approach. The LASSO and it’s variants will return an answer that represents some local mode, but almost surely not the global mode. So the returned model is probably not robust and the standard errors well under-estimate the true uncertainty of the joint distribution over the coefficients. Of course it’s hard, and maybe not yet well-resolved, how to more reliability (under a fixed sample size, not asymptotically) assign a high posterior probability to the set of models close to the truth. Of course, no model can sample or assign a probability to every point in the model space. The hope is to have a method that finds the high posterior region. That said, at least the Bayesian approach is actually contemplating the true uncertainty.