## shrinkage-thresholding MALA for Bayesian variable selection

**A**mandine Shreck along with her co-authors Gersende Fort, Sylvain LeCorff, and Eric Moulines, all from Telecom Paristech, has undertaken to revisit the problem of large p small n variable selection. The approach they advocate mixes Langevin algorithms with trans-model moves with shrinkage thresholding. The corresponding Markov sampler is shown to be geometrically ergodic, which may be a première in that area. The paper was arXived in December but I only read it on my flight to Calgary, not overly distracted by the frozen plains of Manitoba and Saskatchewan. Nor by my neighbour watching Hunger Games II.)

**A** shrinkage-thresholding operator is defined as acting on the regressor matrix towards producing sparse versions of this matrix. (I actually had trouble picturing the model until Section 2.2 where the authors define the multivariate regression model, making the regressors a matrix indeed. With a rather unrealistic iid Gaussian noise. And with an unknown number of relevant rows, hence a varying dimension model. Note that this is a strange regression in that the regression coefficients are known and constant across all models.) Because the Langevin algorithm requires a gradient to operate, the log target is divided between a differentiable and a non-differentiable parts, the later accommodating the Dirac masses in the dominating measure. The new MALA moves involve applying the above shrinkage-thresholding operator to a regular Langevin proposal, hence moving to sub-spaces and sparser representations.

**T**he thresholding functions are based on positive part operators, which means that the Markov chain does not visit some neighbourhoods of zero in the embedding and in the sparser spaces. In other words, the proposal operates between models of varying dimensions without further ado because the point null hypotheses are replaced with those neighbourhoods. Hence it is *not* exactly simulating from the “original” posterior, which may be a minor caveat or not. Not if defining the neighbourhoods is driven by an informed or at least spelled-out choice of a neighbourhood of zero where the coefficients are essentially identified with zero. The difficulty is then in defining how close is close enough. Especially since the thresholding functions seem to all depend on a single number which does not depend on the regressor matrix. It would be interesting to see if the g-prior version could be developed as well… Actually, I would have also included a dose of g-prior in the Langevin move, rather than using an homogeneous normal noise.

**T**he paper contains a large experimental part where the performances of the method are evaluated on various simulated datasets. It includes a comparison with reversible jump MCMC, which slightly puzzles me: (a) I cannot see from the paper whether or not the RJMCMC is applied to the modified (thresholded) posterior, as a regular RJMCMC would not aim at the same target, but the appendix does not indicate a change of target; (b) the mean error criterion for which STMALA does better than RJMCMC is not defined, but the decrease of this criterion along iterations seems to indicate that convergence has not yet occured, since it does not completely level up after 3 10⁵ iterations.

**I** must have mentioned it in another earlier post, but I find somewhat ironical to see those thresholding functions making a comeback after seeing the James-Stein and smooth shrinkage estimators taking over the then so-called pre-test versions in the 1970’s (Judge and Bock, 1978) and 1980’s. There are obvious reasons for this return, moving away from quadratic loss being one.

April 18, 2014 at 2:40 pm

Dear Christian,

First of all we would like to thank you for the attention you paid to our paper and for all your fruitful remarks. This will be taken into consideration to write the next version of the article. However, we have several comments to clarify some details.

First, you say that because of the thresholding operator, the Shrinkage-Thresholding MALA algorithm does not sample exactly from the target distribution. This is true for the hard thresholding operator (Section 3.2) which avoids the shrinkage of all the active rows (but which cannot draw rows with a norm lower than a given threshold). The two other operators, namely the L2,1 proximal and the soft thresholding function, do not suffer from this flaw and propose new rows with norms as close to zero as needed. This is illustrated in Figure 1. Therefore, in the numerical section, both RJMCMC and STMALA target the right distribution.

In the numerical section, Figure 5 displays the error obtained when the algorithms are used to estimate the activation probabilities (defined in equation (18)). In addition, the ordinate axis of this figure (as well as the one of Figure 10 for example) is in logarithmic scale which explains why there is no plateau in the curves while the algorithms do converge.

Once again, many thanks for your remarks which will help us to write an improved version of the article.

Best regards,

Amandine Schreck and her co-authors.