## sampling and imbalanced

**D**eborshee Sen, Matthias Sachs, Jianfeng Lu and David Dunson have recently arXived a sub-sampling paper for classification (logistic) models where some covariates or some responses are imbalanced. With a PDMP, namely zig-zag, used towards preserving the correct invariant distribution (as already mentioned in an earlier post on the zig-zag zampler and in a recent Annals paper by Joris Bierkens, Paul Fearnhead, and Gareth Roberts (Warwick)). The current paper is thus an improvement on the above. Using (non-uniform) importance sub-sampling across observations and simpler upper bounds for the Poisson process. A rather practical form of Poisson thinning. And proposing unbiased estimates of the sub-sample log-posterior as well as stratified sub-sampling.

I idly wondered if the zig-zag sampler could itself be improved by not switching the bouncing directions at random since directions associated with almost certainly null coefficients should be neglected as much as possible, but the intensity functions associated with the directions do incorporate this feature. Except for requiring computation of the intensities for all directions. This is especially true when facing many covariates.

Thinking of the logistic regression model itself, it is sort of frustrating that something so close to an exponential family causes so many headaches! Formally, it is an exponential family but the normalising constant is rather unwieldy, especially when there are many observations and many covariates. The Polya-Gamma completion is a way around, but it proves highly costly when the dimension is large…

June 22, 2019 at 12:38 pm

Nicholas Galbraith, a former MSc in Oxford and now a PhD student at Columbia, developed an “informed” subsampling version of Zig-Zag for logistic regression in his Master thesis. He also did a few comparisons to the bouncy particle sampler. The thesis is on my website: http://www.stats.ox.ac.uk/~doucet/Galbraith_MScThesis.pdf

It’d be interesting to compare the method presented in this paper to those methods.

June 22, 2019 at 4:39 am

I think the Polya Gamma will have the same problem, no? If imbalanced, the problem is fundamental to all latent variable algorithms?