Archive for variable selection

more and more control variates

Posted in Statistics with tags , , , , , , , on October 5, 2018 by xi'an

A few months ago, François Portier and Johan Segers arXived a paper on a question that has always puzzled me, namely how to add control variates to a Monte Carlo estimator and when to stop if needed! The paper is called Monte Carlo integration with a growing number of control variates. It is related to the earlier Oates, Girolami and Chopin (2017) which I remember discussing with Chris when he was in Warwick. The puzzling issue of control variates is [for me] that, while the optimal weight always decreases the variance of the resulting estimate, in practical terms, implementing the method may increase the actual variance. Glynn and Szechtman at MCqMC 2000 identify six different ways of creating the estimate, depending on how the covariance matrix, denoted P(hh’), is estimated. With only one version integrating constant functions and control variates exactly. Which actually happens to be also a solution to a empirical likelihood maximisation under the empirical constraints imposed by the control variates. Another interesting feature is that, when the number m of control variates grows with the number n of simulations the asymptotic variance goes to zero, meaning that the control variate estimator converges at a faster speed.

Creating an infinite sequence of control variates sounds unachievable in a realistic situation. Legendre polynomials are used in the paper, but is there a generic and cheap way to getting these. And … control variate selection, anyone?!

variational Bayes for variable selection

Posted in Books, Statistics, University life with tags , , , , , , , on March 30, 2016 by xi'an

Lake Agnes, Canadian Rockies, July 2007Xichen Huang, Jin Wang and Feng Liang have recently arXived a paper where they rely on variational Bayes in conjunction with a spike-and-slab prior modelling. This actually stems from an earlier paper by Carbonetto and Stephens (2012), the difference being in the implementation of the method, which is less Gibbs-like for the current paper. The approach is not fully Bayesian in that, not only an approximate (variational) representation is used for the parameters of interest (regression coefficient and presence-absence indicators) but also the nuisance parameters are replaced with MAPs. The variational approximation on the regression parameters is an independent product of spike-and-slab distributions. The authors show the approximate approach is consistent in both frequentist and Bayesian terms (under identifiability assumptions). The method is undoubtedly faster than MCMC since it shares many features with EM but I still wonder at the Bayesian interpretability of the outcome, which writes out as a product of estimated spike-and-slab mixtures. First, the weights in the mixtures are estimated by EM, hence fixed. Second, the fact that the variational approximation is a product is confusing in that the posterior distribution on the regression coefficients is unlikely to produce posterior independence.

at CIRM [#2]

Posted in Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , on March 2, 2016 by xi'an

Sylvia Richardson gave a great talk yesterday on clustering applied to variable selection, which first raised [in me] a usual worry of the lack of background model for clustering. But the way she used this notion meant there was an infinite Dirichlet process mixture model behind. This is quite novel [at least for me!] in that it addresses the covariates and not the observations themselves. I still wonder at the meaning of the cluster as, if I understood properly, the dependent variable is not involved in the clustering. Check her R package PReMiuM for a practical implementation of the approach. Later, Adeline Samson showed us the results of using pMCM versus particle Gibbs for diffusion processes where (a) pMCMC was behaving much worse than particle Gibbs and (b) EM required very few particles and Metropolis-Hastings steps to achieve convergence, when compared with posterior approximations.

Today Pierre Druilhet explained to the audience of the summer school his measure theoretic approach [I discussed a while ago] to the limit of proper priors via q-vague convergence, with the paradoxical phenomenon that a Be(n⁻¹,n⁻¹) converges to a sum of two Dirac masses when the parameter space is [0,1] but to Haldane’s prior when the space is (0,1)! He also explained why the Jeffreys-Lindley paradox vanishes when considering different measures [with an illustration that came from my Statistica Sinica 1993 paper]. Pierre concluded with the above opposition between two Bayesian paradigms, a [sort of] tale of two sigma [fields]! Not that I necessarily agree with the first paradigm that priors are supposed to have generated the actual parameter. If only because it mechanistically excludes all improper priors…

Darren Wilkinson talked about yeast, which is orders of magnitude more exciting than it sounds, because this is Bayesian big data analysis in action! With significant (and hence impressive) results based on stochastic dynamic models. And massive variable selection techniques. Scala, Haskell, Frege, OCaml were [functional] languages he mentioned that I had never heard of before! And Daniel Rudolf concluded the [intense] second day of this Bayesian week at CIRM with a description of his convergence results for (rather controlled) noisy MCMC algorithms.

projection predictive input variable selection

Posted in Books, Statistics, University life with tags , , , , , , on November 2, 2015 by xi'an

aikiJuho Piironen and Aki Vehtari just arXived a paper on variable selection that relates to two projection papers we wrote in the 1990’s with Costas Goutis (who died near Seattle in a diving accident on July 1996) and Jérôme Dupuis… Except that they move to the functional space of Gaussian processes. The covariance function in a Gaussian process is indeed based on a distance between observations, which are themselves defined as a vector of inputs. Some of which matter and some of which do not matter in the kernel value. When rescaling the distance with “length-scales” for all variables, one could think that non-significant variates have very small scales and hence bypass the need for variable selection but this is not the case as those coefficients react poorly to non-linearities in the variates… The paper thus builds a projective structure from a reference model involving all input variables.

“…adding some irrelevant inputs is not disastrous if the model contains a sparsifying prior structure, and therefore, one can expect to lose less by using all the inputs than by trying to differentiate between the relevant and irrelevant ones and ignoring the uncertainty related to the left-out inputs.”

While I of course appreciate this avatar to our original idea (with some borrowing from McCulloch and Rossi, 1992), the paper reminds me of some of the discussions and doubts we had about the role of the reference or super model that “anchors” the projections, as there is no reason for that reference model to be a better one. It could be that an iterative process where the selected submodel becomes the reference for the next iteration could enjoy better performances. When I first presented this work in Cagliari, in the late 1990s, one comment was that the method had no theoretical guarantee like consistency. Which is correct if the minimum distance is not evolving (how quickly?!) with the sample size n. I also remember the difficulty Jérôme and I had in figuring out a manageable forward-backward exploration of the (huge) set of acceptable subsets of variables. Random walk exploration and RJMCMC are unlikely to solve this problem.

O’Bayes 2015 [day #3]

Posted in Statistics, Travel, University life, Wines with tags , , , , , , , , on June 5, 2015 by xi'an

vale6The third day of the meeting was a good illustration of the diversity of the themes [says a member of the scientific committee!], from “traditional” O’Bayes talks on reference priors by the father of all reference priors (!), José Bernardo, re-examinations of expected posterior priors, on properties of Bayes factors, or on new versions of the Lindley-Jeffreys paradox, to the radically different approach of Simpson et al. presented by Håvard Rue. I was obviously most interested in posterior expected priors!, with the new notion brought in by Dimitris Fouskakis, Ioannis Ntzoufras and David Draper of a lower impact of the minimal sample on the resulting prior by the trick of a lower (than one) power of the likelihood. Since this change seemed to go beyond the “minimal” in minimal sample size, I am somehow puzzled that this can be achieved, but the normal example shows this is indeed possible. The next difficulty is then in calibrating this power as I do not see any intuitive justification in a specific power. The central talk of the day was in my opinion Håvard’s as it challenged most tenets of the Objective Bayes approach, presented in a most eager tone, even though it did not generate particularly heated comments from the audience. I have already discussed here an earlier version of this paper and I keep on thinking this proposal for PC priors is a major breakthrough in the way we envision priors and their derivation. I was thus sorry to hear the paper had not been selected as a Read Paper by the Royal Statistical Society, as it would have nicely suited an open discussion, but I hope it will find another outlet that allows for a discussion! As an aside, Håvard discussed the case of a Student’s t degree of freedom as particularly challenging for prior construction, albeit I would have analysed the problem using instead a model choice perspective (on an usually continuous space of models).

montanaAs this conference day had a free evening, I took the tram with friends to the town beach and we had a fantastic [if hurried] dinner in a small bodega [away from the uninspiring beach front] called Casa Montaña, a place decorated with huge barrels, offering amazing tapas and wines, a perfect finale to my Spanish trip. Too bad we had to vacate the dinner room for the next batch of customers…

virgulilla

 

projective covariate selection

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on October 28, 2014 by xi'an

While I was in Warwick, Dan Simpson [newly arrived from Norway on a postdoc position] mentioned to me he had attended a talk by Aki Vehtari in Norway where my early work with Jérôme Dupuis on projective priors was used. He gave me the link to this paper by Peltola, Havulinna, Salomaa and Vehtari that indeed refers to the idea that a prior on a given Euclidean space defines priors by projections on all subspaces, despite the zero measure of all those subspaces. (This notion first appeared in a joint paper with my friend Costas Goutis, who alas died in a diving accident a few months later.) The projection further allowed for a simple expression of the Kullback-Leibler deviance between the corresponding models and for a Pythagorean theorem on the additivity of the deviances between embedded models. The weakest spot of this approach of ours was, in my opinion and unsurprisingly, about deciding when a submodel was too far from the full model. The lack of explanatory power introduced therein had no absolute scale and later discussions led me to think that the bound should depend on the sample size to ensure consistency. (The recent paper by Nott and Leng that was expanding on this projection has now appeared in CSDA.)

“Specifically, the models with subsets of covariates are found by maximizing the similarity of their predictions to this reference as proposed by Dupuis and Robert [12]. Notably, this approach does not require specifying priors for the submodels and one can instead focus on building a good reference model. Dupuis and Robert (2003) suggest choosing the size of the covariate subset based on an acceptable loss of explanatory power compared to the reference model. We examine using cross-validation based estimates of predictive performance as an alternative.” T. Peltola et al.

The paper also connects with the Bayesian Lasso literature, concluding on the horseshoe prior being more informative than the Laplace prior. It applies the selection approach to identify biomarkers with predictive performances in a study of diabetic patients. The authors rank model according to their (log) predictive density at the observed data, using cross-validation to avoid exploiting the data twice. On the MCMC front, the paper implements the NUTS version of HMC with STAN.

top model choice week (#3)

Posted in Statistics, University life with tags , , , , , , , , , , , on June 19, 2013 by xi'an

La Défense and Maison-Lafitte from my office, Université Paris-Dauphine, Nov. 05, 2011To conclude this exciting week, there will be a final seminar by Veronika Rockovà (Erasmus University) on Friday, June 21, at 11am at ENSAE  in Room 14. Here is her abstract:

11am: Fast Dynamic Posterior Exploration for Factor Augmented Multivariate Regression byVeronika Rockova

Advancements in high-throughput experimental techniques have facilitated the availability of diverse genomic data, which provide complementary information regarding the function and organization of gene regulatory mechanisms. The massive accumulation of data has increased demands for more elaborate modeling approaches that combine the multiple data platforms. We consider a sparse factor regression model, which augments the multivariate regression approach by adding a latent factor structure, thereby allowing for dependent patterns of marginal covariance between the responses. In order to enable the identi cation of parsimonious structure, we impose spike and slab priors on the individual entries in the factor loading and regression matrices. The continuous relaxation of the point mass spike and slab enables the implementation of a rapid EM inferential procedure for dynamic posterior model exploration. This is accomplished by considering a nested sequence of spike and slab priors and various factor space cardinalities. Identi ed candidate models are evaluated by a conditional posterior model probability criterion, permitting trans-dimensional comparisons. Patterned sparsity manifestations such as an orthogonal allocation of zeros in factor loadings are facilitated by structured priors on the binary inclusion matrix. The model is applied to a problem of integrating two genomic datasets, where expression of microRNA’s is related to the expression of genes with an underlying connectivity pathway network.