Archive for regularisation

efficient MCMC sampling

Posted in Statistics with tags , , , on June 24, 2019 by xi'an

Maxime Vono, Daniel Paulin and Arnaud Doucet recently arXived a paper about a regularisation technique that allows for efficient sampling from a complex posterior which potential function factorises as a large sum of transforms of linear projections of the parameter θ

U(\theta)=\sum_i U_i(A_i\theta)

The central idea in the paper [which was new to me] is to introduce auxiliary variates for the different terms in the sum, replacing the projections in the transforms, with an additional regularisation forcing these auxiliary variates to be as close as possible from the corresponding projection

U(\theta,\mathbf z)=\sum_i U_i(z_i)+\varrho^{-1}||z_i-A_i\theta||^2

This is only an approximation to the true target but it enjoys the possibility to run a massive Gibbs sampler in quite a reduced dimension. As the variance ρ of the regularisation term goes to zero the marginal posterior on the parameter θ converges to the true posterior. The authors manage to achieve precise convergence rates both in total variation and in Wasserstein distance.

From a practical point of view, only judging from the logistic example, it is hard to fathom how much this approach improves upon other approaches (provided they still apply) as the impact of the value of ρ should be assessed on top of the convergence of the high-dimensional Gibbs sampler. Or is there an annealing version in the pipe-line? While parallelisation is a major argument, it also seems that the Gibbs sampler need a central monitoring for each new simulation of θ. Unless some asynchronous version can be implemented.

JSM 2018 [#4½]

Posted in Statistics, University life with tags , , , , , , , , on August 10, 2018 by xi'an

As I wrote my previous blog entry on JSM2018 before the sessions, I did not have the chance to comment on our mixture session, which I found most interesting!, with new entries on the topic and a great discussion by Bettina Grün. Including the important call for linking weights with the other parameters, as both groups being independent does not make sense when the number of components is uncertain. (Incidentally our paper with Kaniav kamary and Kate Lee does create a dependence.) The talk by Deborah Kunkel was about anchored mixture estimation, a joint work with Mario Peruggia, another arXival that I had missed.

The notion of anchoring found in this paper is to allocate specific observations to specific components. These observations are thus anchored to these components. Among other things, this modification of the sampling model implies a removal of the unidentifiability problem. Hence formally of the label-switching or lack thereof issue. (Although, as Peter Green repeatedly mentioned, visualising the parameter space as a point process eliminates the issue.) This idea is somewhat connected with the constraint Jean Diebolt and I imposed in our 1990 mixture paper, namely that no component would have less than two observations allocated to it, but imposing which ones are which of course reduces drastically the complexity of the model. Another (related) aspect of anchoring is that the observations that are anchored to the components act as parts of the prior model, modifying the initial priors (which can then become improper as in our 1990 paper). The difficulty of the anchoring approach is to find observations to anchor in an unsupervised setting. The paper proceeds by optimising the allocations, which somewhat turns the prior into a data-dependent prior since all observations are used to set the anchors and then used again for the standard Bayesian processing. In that respect, I would rather follow the sequential procedure developed by Nicolas Chopin and Florian Pelgrin, where the number of components grows by steps with the number of observations.


JSM 2018 [#1]

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , on July 30, 2018 by xi'an

As our direct flight from Paris landed in the morning in Vancouver,  we found ourselves in the unusual situation of a few hours to kill before accessing our rental and where else better than a general introduction to deep learning in the first round of sessions at JSM2018?! In my humble opinion, or maybe just because it was past midnight in Paris time!, the talk was pretty uninspiring in missing the natural question of the possible connections between the construction of a prediction function and statistics. Watching improving performances at classifying human faces does not tell much more than creating a massively non-linear function in high dimensions with nicely designed error penalties. Most of the talk droned about neural networks and their fitting by back-propagation and the variations on stochastic gradient descent. Not addressing much rather natural (?) questions about choice of functions at each level, of the number of levels, of the penalty term, or regulariser, and even less the reason why no sparsity is imposed on the structure, despite the humongous number of parameters involved. What came close [but not that close] to sparsity is the notion of dropout, which is a sort of purely automated culling of the nodes, and which was new to me. More like a sort of randomisation that turns the optimisation criterion in an average. Only at the end of the presentation more relevant questions emerged, presenting unsupervised learning as density estimation, the pivot being the generative features of (most) statistical models. And GANs of course. But nonetheless missing an explanation as to why models with massive numbers of parameters can be considered in this setting and not in standard statistics. (One slide about deterministic auto-encoders was somewhat puzzling in that it seemed to repeat the “fiducial mistake”.)

ISBA 2016 [#5]

Posted in Mountains, pictures, Running, Statistics, Travel with tags , , , , , , , , , , , , , on June 18, 2016 by xi'an

from above Forte Village, Santa Magherita di Pula, Sardinia, June 17, 2016On Thursday, I started the day by a rather masochist run to the nearby hills, not only because of the very hour but also because, by following rabbit trails that were not intended for my size, I ended up being scratched by thorns and bramble all over!, but also with neat views of the coast around Pula.  From there, it was all downhill [joke]. The first morning talk I attended was by Paul Fearnhead and about efficient change point estimation (which is an NP hard problem or close to). The method relies on dynamic programming [which reminded me of one of my earliest Pascal codes about optimising a dam debit]. From my spectator’s perspective, I wonder[ed] at easier models, from Lasso optimisation to spline modelling followed by testing equality between bits. Later that morning, James Scott delivered the first Bayarri Lecture, created in honour of our friend Susie who passed away between the previous ISBA meeting and this one. James gave an impressive coverage of regularisation through three complex models, with the [hopefully not degraded by my translation] message that we should [as Bayesians] focus on important parts of those models and use non-Bayesian tools like regularisation. I can understand the practical constraints for doing so, but optimisation leads us away from a Bayesian handling of inference problems, by removing the ascertainment of uncertainty…

Later in the afternoon, I took part in the Bayesian foundations session, discussing the shortcomings of the Bayes factor and suggesting the use of mixtures instead. With rebuttals from [friends in] the audience!

This session also included a talk by Victor Peña and Jim Berger analysing and answering the recent criticisms of the Likelihood principle. I am not sure this answer will convince the critics, but I won’t comment further as I now see the debate as resulting from a vague notion of inference in Birnbaum‘s expression of the principle. Jan Hannig gave another foundation talk introducing fiducial distributions (a.k.a., Fisher’s Bayesian mimicry) but failing to provide a foundational argument for replacing Bayesian modelling. (Obviously, I am definitely prejudiced in this regard.)

The last session of the day was sponsored by BayesComp and saw talks by Natesh Pillai, Pierre Jacob, and Eric Xing. Natesh talked about his paper on accelerated MCMC recently published in JASA. Which surprisingly did not get discussed here, but would definitely deserve to be! As hopefully corrected within a few days, when I recoved from conference burnout!!! Pierre Jacob presented a work we are currently completing with Chris Holmes and Lawrence Murray on modularisation, inspired from the cut problem (as exposed by Plummer at MCMski IV in Chamonix). And Eric Xing spoke about embarrassingly parallel solutions, discussed a while ago here.

kernel approximate Bayesian computation for population genetic inferences

Posted in Statistics, University life with tags , , , , on May 22, 2012 by xi'an

A new posting about ABC on arXiv by Shigeki Nakagome, Kenji Fukumizu, and Shuhei Mano entitled kernel approximate Bayesian computation for population genetic inferences argues about an improvement brought by the use of reproducing kernel Hilbert space (RKHS) perspective in ABC methodology, when compared with more standard ABC relying on a rather arbitrary choice of summary statistics and metric. However, I feel that the paper does not substantially defend this point, only using a simulation experiment to compare mean square errors. In particular, the claim of consistency is unsubstantiated, as is the counterpoint that “conventional ABC did not have consistency” (page 14) [and several papers, including the just published Read Paper by Fearnhead and Prangle, claim the opposite]. Furthermore, a considerable amount of space is taken in the paper by the description of the existing ABC algorithms, while the complete version of the new kernel ABC-RKHS algorithm is missing. In particular, the coverage of kernel Bayes is too sketchy to be comprehensible [at least to me] without additional study. Actually, I do not get the notion of kernel Bayes’ rule, which seems defined only in terms of expectations

\mathbb{E}[f(\theta)|s]=\sum_i w_i f(\theta_i),

where the weights are the ridge-like matrix

w_i=\sum_j (\mathbf{G}_S + n\epsilon_n \mathbf{I}_n)^{-1}_{ij}k(s_i,s_j)

where the parameter is generated from the prior, the data s is generated from the sampling distribution, and the matrix GS is made of the k(si,sj)‘s. The surrounding Hilbert space presentation does not seem particularly relevant, esp. in population genetics… I am also under the impression that the choice of the kernel function k(.,.) is as important as the choice of the metric in regular ABC, although this is not discussed in the paper, since it implies [among other things] the choice of a metric. The implementation uses a Gaussian kernel and an Euclidean metric, which involves assumptions on the homogeneous nature of the components of the summary statistics or of the data. Similarly, the “regularization” parameter εn needs to be calibrated and the paper is unclear about this, apparently picking the parameter that “showed the smallest MSEs” (page 10), which cannot be called a calibration. (There is a rather unimportant proposition about concentration of information on page 6 which proof relies on two densities being ordered, see top of page 7.)