Archive for spike-and-slab prior

JSM 2018 [#3]

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on August 1, 2018 by xi'an

As I skipped day #2 for climbing, here I am on day #3, attending JSM 2018, with a [fully Canadian!] session on (conditional) copula (where Bruno Rémillard talked of copulas for mixed data, with unknown atoms, which sounded like an impossible target!), and another on four highlights from Bayesian Analysis, (the journal), with Maria Terres defending the (often ill-considered!) spectral approach within Bayesian analysis, modelling spectral densities (Fourier transforms of correlations functions, not probability densities), an advantage compared with MCAR modelling being the automated derivation of dependence graphs. While the spectral ghost did not completely dissipate for me, the use of DIC that she mentioned at the very end seems to call for investigation as I do not know of well-studied cases of complex dependent data with clearly specified DICs. Then Chris Drobandi was speaking of ABC being used for prior choice, an idea I vaguely remember seeing quite a while ago as a referee (or another paper!), paper in BA that I missed (and obviously did not referee). Using the same reference table works (for simple ABC) with different datasets but also different priors. I did not get first the notion that the reference table also produces an evaluation of the marginal distribution but indeed the entire simulation from prior x generative model gives a Monte Carlo representation of the marginal, hence the evidence at the observed data. Borrowing from Evans’ fringe Bayesian approach to model choice by prior predictive check for prior-model conflict. I remain sceptic or at least agnostic on the notion of using data to compare priors. And here on using ABC in tractable settings.

The afternoon session was [a mostly Australian] Advanced Bayesian computational methods,  with Robert Kohn on variational Bayes, with an interesting comparison of (exact) MCMC and (approximative) variational Bayes results for some species intensity and the remark that forecasting may be much more tolerant to the approximation than estimation. Making me wonder at a possibility of assessing VB on the marginals manageable by MCMC. Unless I miss a complexity such that the decomposition is impossible. And Antonietta Mira on estimating time-evolving networks estimated by ABC (which Anto first showed me in Orly airport, waiting for her plane!). With a possibility of a zero distance. Next talk by Nadja Klein on impicit copulas, linked with shrinkage properties I was unaware of, including the case of spike & slab copulas. Michael Smith also spoke of copulas with discrete margins, mentioning a version with continuous latent variables (as I thought could be done during the first session of the day), then moving to variational Bayes which sounds quite popular at JSM 2018. And David Gunawan made a presentation of a paper mixing pseudo-marginal Metropolis with particle Gibbs sampling, written with Chris Carter and Robert Kohn, making me wonder at their feature of using the white noise as an auxiliary variable in the estimation of the likelihood, which is quite clever but seems to get against the validation of the pseudo-marginal principle. (Warning: I have been known to be wrong!)

estimation versus testing [again!]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on March 30, 2017 by xi'an

The following text is a review I wrote of the paper “Parameter estimation and Bayes factors”, written by J. Rouder, J. Haff, and J. Vandekerckhove. (As the journal to which it is submitted gave me the option to sign my review.)

The opposition between estimation and testing as a matter of prior modelling rather than inferential goals is quite unusual in the Bayesian literature. In particular, if one follows Bayesian decision theory as in Berger (1985) there is no such opposition, but rather the use of different loss functions for different inference purposes, while the Bayesian model remains single and unitarian.

Following Jeffreys (1939), it sounds more congenial to the Bayesian spirit to return the posterior probability of an hypothesis H⁰ as an answer to the question whether this hypothesis holds or does not hold. This however proves impossible when the “null” hypothesis H⁰ has prior mass equal to zero (or is not measurable under the prior). In such a case the mathematical answer is a probability of zero, which may not satisfy the experimenter who asked the question. More fundamentally, the said prior proves inadequate to answer the question and hence to incorporate the information contained in this very question. This is how Jeffreys (1939) justifies the move from the original (and deficient) prior to one that puts some weight on the null (hypothesis) space. It is often argued that the move is unnatural and that the null space does not make sense, but this only applies when believing very strongly in the model itself. When considering the issue from a modelling perspective, accepting the null H⁰ means using a new model to represent the model and hence testing becomes a model choice problem, namely whether or not one should use a complex or simplified model to represent the generation of the data. This is somehow the “unification” advanced in the current paper, albeit it does appear originally in Jeffreys (1939) [and then numerous others] rather than the relatively recent Mitchell & Beauchamp (1988). Who may have launched the spike & slab denomination.

I have trouble with the analogy drawn in the paper between the spike & slab estimate and the Stein effect. While the posterior mean derived from the spike & slab posterior is indeed a quantity drawn towards zero by the Dirac mass at zero, it is rarely the point in using a spike & slab prior, since this point estimate does not lead to a conclusion about the hypothesis: for one thing it is never exactly zero (if zero corresponds to the null). For another thing, the construction of the spike & slab prior is both artificial and dependent on the weights given to the spike and to the slab, respectively, to borrow expressions from the paper. This approach thus leads to model averaging rather than hypothesis testing or model choice and therefore fails to answer the (possibly absurd) question as to which model to choose. Or refuse to choose. But there are cases when a decision must be made, like continuing a clinical trial or putting a new product on the market. Or not.

In conclusion, the paper surprisingly bypasses the decision-making aspect of testing and hence ends up with a inconclusive setting, staying midstream between Bayes factors and credible intervals. And failing to provide a tool for decision making. The paper also fails to acknowledge the strong dependence of the Bayes factor on the tail behaviour of the prior(s), which cannot be [completely] corrected by a finite sample, hence its relativity and the unreasonableness of a fixed scale like Jeffreys’ (1939).

variational Bayes for variable selection

Posted in Books, Statistics, University life with tags , , , , , , , on March 30, 2016 by xi'an

Lake Agnes, Canadian Rockies, July 2007Xichen Huang, Jin Wang and Feng Liang have recently arXived a paper where they rely on variational Bayes in conjunction with a spike-and-slab prior modelling. This actually stems from an earlier paper by Carbonetto and Stephens (2012), the difference being in the implementation of the method, which is less Gibbs-like for the current paper. The approach is not fully Bayesian in that, not only an approximate (variational) representation is used for the parameters of interest (regression coefficient and presence-absence indicators) but also the nuisance parameters are replaced with MAPs. The variational approximation on the regression parameters is an independent product of spike-and-slab distributions. The authors show the approximate approach is consistent in both frequentist and Bayesian terms (under identifiability assumptions). The method is undoubtedly faster than MCMC since it shares many features with EM but I still wonder at the Bayesian interpretability of the outcome, which writes out as a product of estimated spike-and-slab mixtures. First, the weights in the mixtures are estimated by EM, hence fixed. Second, the fact that the variational approximation is a product is confusing in that the posterior distribution on the regression coefficients is unlikely to produce posterior independence.

JSM 2015 [day #2]

Posted in Books, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , on August 11, 2015 by xi'an

Today, at JSM 2015, in Seattle, I attended several Bayesian sessions, having sadly missed the Dennis Lindley memorial session yesterday, as it clashed with my own session. In the morning sessions on Bayesian model choice, David Rossell (Warwick) defended non-local priors à la Johnson (& Rossell) as having better frequentist properties. Although I appreciate the concept of eliminating a neighbourhood of the null in the alternative prior, even from a Bayesian viewpoint since it forces us to declare explicitly when the null is no longer acceptable, I find the asymptotic motivation for the prior less commendable and open to arbitrary choices that may lead to huge variations in the numerical value of the Bayes factor. Another talk by Jin Wang merged spike and slab with EM with bootstrap with random forests in variable selection. But I could not fathom what the intended properties of the method were… Besides returning another type of MAP.

The second Bayesian session of the morn was mostly centred on sparsity and penalisation, with Carlos Carvalho and Rob McCulloch discussing a two step method that goes through a standard posterior  construction on the saturated model, before using a utility function to select the pertinent variables. Separation of utility from prior was a novel concept for me, if not for Jay Kadane who objected to Rob a few years ago that he put in the prior what should be in the utility… New for me because I always considered the product prior x utility as the main brick in building the Bayesian edifice… Following Herman Rubin’s motto! Veronika Rocková linked with this post-LASSO perspective by studying spike & slab priors based on Laplace priors. While Veronicka’s goal was to achieve sparsity and consistency, this modelling made me wonder at the potential equivalent in our mixtures for testing approach. I concluded that having a mixture of two priors could be translated in a mixture over the sample with two different parameters, each with a different prior. A different topic, namely multiple testing, was treated by Jim Berger, who showed convincingly in my opinion that a Bayesian approach provides a significant advantage.

In the afternoon finalists of the ISBA Savage Award presented their PhD work, both in the theory and  methods section and in the application section. Besides Veronicka Rocková’s work on a Bayesian approach to factor analysis, with a remarkable resolution via a non-parametric Indian buffet prior and a variable selection interpretation that avoids MCMC difficulties, Vinayak Rao wrote his thesis on MCMC methods for jump processes with a finite number of observations, using a highly convincing completion scheme that created independence between blocks and which reminded me of the Papaspiliopoulos et al. (2005) trick for continuous time processes. I do wonder at the potential impact of this method for processing the coalescent trees in population genetics. Two talks dealt with inference on graphical models, Masanao Yajima and  Christine Peterson, inferring the structure of a sparse graph by Bayesian methods.  With applications in protein networks. And with again a spike & slab prior in Christine’s work. The last talk by Sayantan Banerjee was connected to most others in this Savage session in that it also dealt with sparsity. When estimating a large covariance matrix. (It is always interesting to try to spot tendencies in awards and conferences. Following the Bayesian non-parametric era, are we now entering the Bayesian sparsity era? We will see if this is the case at ISBA 2016!) And the winner is..?! We will know tomorrow night! In the meanwhile, congrats to my friends Sudipto Banerjee, Igor Prünster, Sylvia Richardson, and Judith Rousseau who got nominated IMS Fellows tonight.

a day for comments

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , on April 21, 2014 by xi'an

As I was flying over Skye (with [maybe] a first if hazy perspective on the Cuillin ridge!) to Iceland, three long sets of replies to some of my posts appeared on the ‘Og:

Thanks to them for taking the time to answer my musings…

 

shrinkage-thresholding MALA for Bayesian variable selection

Posted in Statistics, University life with tags , , , , , on March 10, 2014 by xi'an

IMG_2515Amandine Shreck along with her co-authors Gersende Fort, Sylvain LeCorff, and Eric Moulines, all from Telecom Paristech, has undertaken to revisit the problem of large p small n variable selection. The approach they advocate mixes Langevin algorithms with trans-model moves with shrinkage thresholding. The corresponding Markov sampler is shown to be geometrically ergodic, which may be a première in that area. The paper was arXived in December but I only read it on my flight to Calgary, not overly distracted by the frozen plains of Manitoba and Saskatchewan. Nor by my neighbour watching Hunger Games II.)

A shrinkage-thresholding operator is defined as acting on the regressor matrix towards producing sparse versions of this matrix. (I actually had trouble picturing the model until Section 2.2 where the authors define the multivariate regression model, making the regressors a matrix indeed. With a rather unrealistic iid Gaussian noise. And with an unknown number of relevant rows, hence a varying dimension model. Note that this is a strange regression in that the regression coefficients are known and constant across all models.) Because the Langevin algorithm requires a gradient to operate, the log target is divided between a differentiable and a non-differentiable parts, the later accommodating the Dirac masses in the dominating measure. The new MALA moves involve applying the above shrinkage-thresholding operator to a regular Langevin proposal, hence moving to sub-spaces and sparser representations.

The thresholding functions are based on positive part operators, which means that the Markov chain does not visit some neighbourhoods of zero in the embedding and in the sparser spaces. In other words, the proposal operates between models of varying dimensions without further ado because the point null hypotheses are replaced with those neighbourhoods. Hence it is not exactly simulating from the “original” posterior, which may be a minor caveat or not. Not if defining the neighbourhoods is driven by an informed or at least spelled-out choice of a neighbourhood of zero where the coefficients are essentially identified with zero. The difficulty is then in defining how close is close enough. Especially since the thresholding functions seem to all depend on a single number which does not depend on the regressor matrix. It would be interesting to see if the g-prior version could be developed as well… Actually, I would have also included a dose of g-prior in the Langevin move, rather than using an homogeneous normal noise.

The paper contains a large experimental part where the performances of the method are evaluated on various simulated datasets. It includes a comparison with reversible jump MCMC, which slightly puzzles me: (a) I cannot see from the paper whether or not the RJMCMC is applied to the modified (thresholded) posterior, as a regular RJMCMC would not aim at the same target, but the appendix does not indicate a change of target; (b) the mean error criterion for which STMALA does better than RJMCMC is not defined, but the decrease of this criterion along iterations seems to indicate that convergence has not yet occured, since it does not completely level up after 3 10⁵ iterations.

I must have mentioned it in another earlier post, but I find somewhat ironical to see those thresholding functions making a comeback after seeing the James-Stein and smooth shrinkage estimators taking over the then so-called pre-test versions in the 1970’s (Judge and Bock, 1978) and 1980’s. There are obvious reasons for this return, moving away from quadratic loss being one.

workshop a Padova (#2)

Posted in Books, Statistics with tags , , , , , , , , on March 23, 2013 by xi'an

DSC_4739This morning session at the workshop Recent Advances in statistical inference: theory and case studies was a true blessing for anyone working in Bayesian model choice! And it did give me ideas to complete my current paper on the Jeffreys-Lindley paradox, and more. Attending the talks in the historical Gioachino Rossini room of the fabulous Café Pedrocchi with the Italian spring blue sky as a background surely helped! (It is only beaten by this room of Ca’Foscari overlooking the Gran Canale where we had a workshop last Fall…)

First, Phil Dawid gave a talk on his current work with Monica Musio (who gave a preliminary talk on this in Venezia last fall) on the use of new score functions to compare statistical models. While the regular Bayes factor is based on the log score, comparing the logs of the predictives at the observed data, different functions of the predictive q can be used, like the Hyvärinen score

S(x,q)=\Delta\sqrt{q(x)}\big/\sqrt{q(x)}

which offers the immense advantage of being independent of the normalising constant and hence can also be used for improper priors. As written above, a very deep finding that could at last allow for the comparison of models based on improper priors without requiring convoluted constructions (see below) to make the “constants meet”. I first thought the technique was suffering from the same shortcoming as Murray Aitkin’s integrated likelihood, but I eventually figured out (where) I was wrong!

The second talk was given by Ed George, who spoke on his recent research with Veronika Rocková dealing with variable selection via an EM algorithm that proceeds much much faster to the optimal collection of variables, when compared with the DMVS solution of George and McCulloch (JASA, 1993). (I remember discussing this paper with Ed in Laramie during the IMS meeting in the summer of 1993.) This resurgence of the EM algorithm in this framework is both surprising (as the missing data structure represented by the variable indicators could have been exploited much earlier) and exciting, because it opens a new way to explore the most likely models in this variable selection setting and to eventually produce the median model of Berger and Barbieri (Annals of Statistics, 2004). In addition, this approach allows for a fast comparison of prior modellings on the missing variable indicators, showing in some examples a definitive improvement brought by a Markov random field structure. Given that it also produces a marginal posterior density on the indicators, values of hyperparameters can be assessed, escaping the Jeffreys-Lindley paradox (which was clearly a central piece of today’s talks and discussions). I would like to see more details on the MRF part, as I wonder which structure is part of the input and which one is part of the inference.

The third talk of the morning was Susie Bayarri’s, about a collection of desiderata or criteria for building an objective prior in model comparison and achieving a manageable closed-form solution in the case of the normal linear model. While I somehow disagree with the information criterion, which states that the divergence of the likelihood ratio should imply a corresponding divergence of the Bayes factor. While I definitely agree with the invariance argument leading to using the same (improper) prior over parameters common to models under comparison, this may sound too much of a trick to outsiders, especially when accounting for the score solution of Dawid and Musio. Overall, though, I liked the outcome of a coherence reference solution for linear models that could clearly be used as a default in this setting, esp. given the availability of an R package called BayesVarSel. (Even if I also like our simpler solution developped in the incoming edition of Bayesian Core, also available in the bayess R package!) In his discussion, Guido Consonni highlighted the philosophical problem of considering “common paramaters”, a perspective I completely subscribe to, even though I think all that matters is the justification of having a common prior over formally equivalent parameters, even though this may sound like a pedantic distinction to many!

Due to a meeting of the scientific committee of the incoming O’Bayes 2013 meeting (in Duke, December, more about this soon!), whose most members were attending this workshop, I missed the beginning of Alan Aggresti’s talk and could not catch up with the central problem he was addressing (the pianist on the street outside started pounding on his instrument as if intent to break it apart!). A pity as problems with contingency tables are certainly of interest to me… By the end of Alan’s talk, I wished someone would shoot the pianist playing outside (even though he was reasonably gifted) as I had gotten a major headache from his background noise. Following Noel Cressie’s talk proved just as difficult, although I could see his point in comparing very diverse predictors for big Data problems without much of a model structure and even less of a  and I decided to call the day off, despite wishing to stay for Eduardo Gutiérrez-Pena’s talk on conjugate predictives and entropies which definitely interested me… Too bad really (blame the pianist!)