Jonathan Harrison and Ruth Baker (Oxford University) arXived this morning a paper on the optimal combination of summaries for ABC in the sense of deriving the proper weights in an Euclidean distance involving all the available summaries. The idea is to find the weights that lead to the maximal distance between prior and posterior, in a way reminiscent of Bernardo’s (1979) maximal information principle. Plus a sparsity penalty à la Lasso. The associated algorithm is sequential in that the weights are updated at each iteration. The paper does not get into theoretical justifications but considers instead several examples with limited numbers of both parameters and summary statistics. Which may highlight the limitations of the approach in that handling (and eliminating) a large number of parameters may prove impossible this way, when compared with optimisation methods like random forests. Or summary-free distances between empirical distributions like the Wasserstein distance.
Archive for Lasso
As CHANCE book editor, I received the other day from Oxford University Press acts from an École de Physique des Houches on Statistical Physics, Optimisation, Inference, and Message-Passing Algorithms that took place there in September 30 – October 11, 2013. While it is mostly unrelated with Statistics, and since Igor Caron already reviewed the book a year and more ago, I skimmed through the few chapters connected to my interest, from Devavrat Shah’s chapter on graphical models and belief propagation, to Andrea Montanari‘s denoising and sparse regression, including LASSO, and only read in some detail Manfred Opper’s expectation propagation chapter. This paper made me realise (or re-realise as I had presumably forgotten an earlier explanation!) that expectation propagation can be seen as a sort of variational approximation that produces by a sequence of iterations the distribution within a certain parametric (exponential) family that is the closest to the distribution of interest. By writing the Kullback-Leibler divergence the opposite way from the usual variational approximation, the solution equates the expectation of the natural sufficient statistic under both models… Another interesting aspect of this chapter is the connection with estimating normalising constants. (I noticed a slight typo on p.269 in the final form of the Kullback approximation q() to p().
Today, at JSM 2015, in Seattle, I attended several Bayesian sessions, having sadly missed the Dennis Lindley memorial session yesterday, as it clashed with my own session. In the morning sessions on Bayesian model choice, David Rossell (Warwick) defended non-local priors à la Johnson (& Rossell) as having better frequentist properties. Although I appreciate the concept of eliminating a neighbourhood of the null in the alternative prior, even from a Bayesian viewpoint since it forces us to declare explicitly when the null is no longer acceptable, I find the asymptotic motivation for the prior less commendable and open to arbitrary choices that may lead to huge variations in the numerical value of the Bayes factor. Another talk by Jin Wang merged spike and slab with EM with bootstrap with random forests in variable selection. But I could not fathom what the intended properties of the method were… Besides returning another type of MAP.
The second Bayesian session of the morn was mostly centred on sparsity and penalisation, with Carlos Carvalho and Rob McCulloch discussing a two step method that goes through a standard posterior construction on the saturated model, before using a utility function to select the pertinent variables. Separation of utility from prior was a novel concept for me, if not for Jay Kadane who objected to Rob a few years ago that he put in the prior what should be in the utility… New for me because I always considered the product prior x utility as the main brick in building the Bayesian edifice… Following Herman Rubin’s motto! Veronika Rocková linked with this post-LASSO perspective by studying spike & slab priors based on Laplace priors. While Veronicka’s goal was to achieve sparsity and consistency, this modelling made me wonder at the potential equivalent in our mixtures for testing approach. I concluded that having a mixture of two priors could be translated in a mixture over the sample with two different parameters, each with a different prior. A different topic, namely multiple testing, was treated by Jim Berger, who showed convincingly in my opinion that a Bayesian approach provides a significant advantage.
In the afternoon finalists of the ISBA Savage Award presented their PhD work, both in the theory and methods section and in the application section. Besides Veronicka Rocková’s work on a Bayesian approach to factor analysis, with a remarkable resolution via a non-parametric Indian buffet prior and a variable selection interpretation that avoids MCMC difficulties, Vinayak Rao wrote his thesis on MCMC methods for jump processes with a finite number of observations, using a highly convincing completion scheme that created independence between blocks and which reminded me of the Papaspiliopoulos et al. (2005) trick for continuous time processes. I do wonder at the potential impact of this method for processing the coalescent trees in population genetics. Two talks dealt with inference on graphical models, Masanao Yajima and Christine Peterson, inferring the structure of a sparse graph by Bayesian methods. With applications in protein networks. And with again a spike & slab prior in Christine’s work. The last talk by Sayantan Banerjee was connected to most others in this Savage session in that it also dealt with sparsity. When estimating a large covariance matrix. (It is always interesting to try to spot tendencies in awards and conferences. Following the Bayesian non-parametric era, are we now entering the Bayesian sparsity era? We will see if this is the case at ISBA 2016!) And the winner is..?! We will know tomorrow night! In the meanwhile, congrats to my friends Sudipto Banerjee, Igor Prünster, Sylvia Richardson, and Judith Rousseau who got nominated IMS Fellows tonight.
Following last week read of Hartigan and Wong’s 1979 K-Means Clustering Algorithm, my Master students in the Reading Classics Seminar course, listened today to Agnė Ulčinaitė covering Rob Tibshirani‘s original LASSO paper Regression shrinkage and selection via the lasso in JRSS Series B. Here are her (Beamer) slides
Again not the easiest paper in the list, again mostly algorithmic and requiring some background on how it impacted the field. Even though Agnė also went through the Elements of Statistical Learning by Hastie, Friedman and Tibshirani, it was hard to get away from the paper to analyse more widely the importance of the paper, the connection with the Bayesian (linear) literature of the 70’s, its algorithmic and inferential aspects, like the computational cost, and the recent extensions like Bayesian LASSO. Or the issue of handling n<p models. Remember that one of the S in LASSO stands for shrinkage: it was quite pleasant to hear again about ridge estimators and Stein’s unbiased estimator of the risk, as those were themes of my Ph.D. thesis… (I hope the students do not get discouraged by the complexity of those papers: there were fewer questions and fewer students this time. Next week, the compass will move to the Bayesian pole with a talk on Lindley and Smith’s 1973 linear Bayes paper by one of my PhD students.)
As indicated a few weeks ago, we have received very encouraging reviews from Bayesian Analysis about our [Gilles Celeux, Mohammed El Anbari, Jean-Michel Marin and myself] our comparative study of Bayesian and non-Bayesian variable selections procedures (“Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation“) to Bayesian Analysis. We have just rearXived and resubmitted it with additional material and hope this is the last round. (I must acknowledge a limited involvement at this final stage of the paper. Had I had more time available, I would have liked to remove the numerous tables and turn them into graphs…)
The conference in honour of Larry Brown was quite exciting, with lots of old friends gathered in Philadelphia and lots of great talks either recollecting major works of Larry and coauthors or presenting fairly interesting new works. Unsurprisingly, a large chunk of the talks was about admissibility and minimaxity, with John Hartigan starting the day re-reading Larry masterpiece 1971 paper linking admissibility and recurrence of associated processes, a paper I always had trouble studying because of both its depth and its breadth! Bill Strawderman presented a new if classical minimaxity result on matrix estimation and Anirban DasGupta some large dimension consistency results where the choice of the distance (total variation versus Kullback deviance) was irrelevant. Ed George and Susie Bayarri both presented their recent work on g-priors and their generalisation, which directly relate to our recent paper on that topic. On the afternoon, Holger Dette showed some impressive mathematics based on Elfving’s representation and used in building optimal designs. I particularly appreciated the results of a joint work with Larry presented by Robert Wolpert where they classified all Markov stationary infinitely divisible time-reversible integer-valued processes. It produced a surprisingly small list of four cases, two being trivial.. The final talk of the day was about homology, which sounded a priori rebutting, but Robert Adler made it extremely entertaining, so much that I even failed to resent the powerpoint tricks! The next morning, Mark Low gave a very emotional but also quite illuminating about the first results he got during his PhD thesis at Cornell (completing the thesis when I was using Larry’s office!). Brenda McGibbon went back to the three truncated Poisson papers she wrote with Ian Johnstone (via gruesome 13 hour bus rides from Montréal to Ithaca!) and produced an illuminating explanation of the maths at work for moving from the Gaussian to the Poisson case in a most pedagogical and enjoyable fashion. Larry Wasserman explained the concepts at work behind the lasso for graphs, entertaining us with witty acronyms on the side!, and leaving out about 3/4 of his slides! (The research group involved in this project produced an R package called huge.) Joe Eaton ended up the morning with a very interesting result showing that using the right Haar measure as a prior leads to a matching prior, then showing why the consequences of the result are limited by invariance itself. Unfortunately, it was then time for me to leave and I will miss (in both meanings of the term) the other half of the talks. Especially missing Steve Fienberg’s talk for the third time in three weeks! Again, what I appreciated most during those two days (besides the fact that we were all reunited on the very day of Larry’s birthday!) was the pain most speakers went to to expose older results in a most synthetic and intuitive manner… I also got new ideas about generalising our parallel computing paper for random walk Metropolis-Hastings algorithms and for optimising across permutation transforms.
After a huge delay, since the project started in 2006 and was first presented in Banff in 2007 (as well as included in the Bayesian Core), Gilles Celeux, Mohammed El Anbari, Jean-Michel Marin, and myself have eventually completed our paper on using hyper-g priors variable selection and regularisation in linear models . The redaction of this paper was mostly delayed due to the publication of the 2007 JASA paper by Feng Liang, Rui Paulo, German Molina, Jim Berger, and Merlise Clyde, Mixtures of g-priors for Bayesian variable selection. We had indeed (independently) obtained very similar derivations based on hypergeometric function representations but, once the above paper was published, we needed to add material to our derivation and chose to run a comparison study between Bayesian and non-Bayesian methods for a series of simulated and true examples. It took a while to Mohammed El Anbari to complete this simulation study and even longer for the four of us to convene and agree on the presentation of the paper. The only difference between Liang et al.’s (2007) modelling and ours is that we do not distinguish between the intercept and the other regression coefficients in the linear model. On the one hand, this gives us one degree of freedom that allows us to pick an improper prior on the variance parameter. On the other hand, our posterior distribution is not invariant under location transforms, which was a point we heavily debated in Banff… The simulation part shows that all “standard” Bayesian solutions lead to very similar decisions and that they are much more parsimonious than regularisation techniques.
Two other papers posted on arXiv today address the model choice issue. The first one by Bruce Lindsay and Jiawei Liu introduces a credibility index, and the second one by Bazerque, Mateos, and Giannakis considers group-lasso on splines for spectrum cartography.