Amandine Shreck along with her co-authors Gersende Fort, Sylvain LeCorff, and Eric Moulines, all from Telecom Paristech, has undertaken to revisit the problem of large p small n variable selection. The approach they advocate mixes Langevin algorithms with trans-model moves with shrinkage thresholding. The corresponding Markov sampler is shown to be geometrically ergodic, which may be a première in that area. The paper was arXived in December but I only read it on my flight to Calgary, not overly distracted by the frozen plains of Manitoba and Saskatchewan. Nor by my neighbour watching Hunger Games II.)
A shrinkage-thresholding operator is defined as acting on the regressor matrix towards producing sparse versions of this matrix. (I actually had trouble picturing the model until Section 2.2 where the authors define the multivariate regression model, making the regressors a matrix indeed. With a rather unrealistic iid Gaussian noise. And with an unknown number of relevant rows, hence a varying dimension model. Note that this is a strange regression in that the regression coefficients are known and constant across all models.) Because the Langevin algorithm requires a gradient to operate, the log target is divided between a differentiable and a non-differentiable parts, the later accommodating the Dirac masses in the dominating measure. The new MALA moves involve applying the above shrinkage-thresholding operator to a regular Langevin proposal, hence moving to sub-spaces and sparser representations.
The thresholding functions are based on positive part operators, which means that the Markov chain does not visit some neighbourhoods of zero in the embedding and in the sparser spaces. In other words, the proposal operates between models of varying dimensions without further ado because the point null hypotheses are replaced with those neighbourhoods. Hence it is not exactly simulating from the “original” posterior, which may be a minor caveat or not. Not if defining the neighbourhoods is driven by an informed or at least spelled-out choice of a neighbourhood of zero where the coefficients are essentially identified with zero. The difficulty is then in defining how close is close enough. Especially since the thresholding functions seem to all depend on a single number which does not depend on the regressor matrix. It would be interesting to see if the g-prior version could be developed as well… Actually, I would have also included a dose of g-prior in the Langevin move, rather than using an homogeneous normal noise.
The paper contains a large experimental part where the performances of the method are evaluated on various simulated datasets. It includes a comparison with reversible jump MCMC, which slightly puzzles me: (a) I cannot see from the paper whether or not the RJMCMC is applied to the modified (thresholded) posterior, as a regular RJMCMC would not aim at the same target, but the appendix does not indicate a change of target; (b) the mean error criterion for which STMALA does better than RJMCMC is not defined, but the decrease of this criterion along iterations seems to indicate that convergence has not yet occured, since it does not completely level up after 3 10⁵ iterations.
I must have mentioned it in another earlier post, but I find somewhat ironical to see those thresholding functions making a comeback after seeing the James-Stein and smooth shrinkage estimators taking over the then so-called pre-test versions in the 1970′s (Judge and Bock, 1978) and 1980′s. There are obvious reasons for this return, moving away from quadratic loss being one.