## importance tempering and variable selection

Posted in Books, Statistics with tags , , , , , , , , on November 6, 2018 by xi'an

As reading and commenting the importance tempering for variable selection paper by Giacomo Zanella (previously Warwick) and Gareth Roberts (Warwick) has been on my to-do list for quite a while, the fact that Giacomo presented this work at CIRM Bayesian Masterclass last week was the right nudge to write this post.

The starting point for the method is to simulate from a tempered version of a Gibbs sampler, selecting the component [of the parameter vector θ] according to an importance weight that is the inverse of the conditional posterior to the complementary power. That is, the inverse of the importance weight. This approach differs from classical (MCMC) tempering in that it does not target the original distribution. Hence it produces a weighted sample, whose computing time is of the order of the dimension of θ, even though the tempered simulation of a single conditional can reduce the variance of the estimator. The method is generalisable to any collection of one-component proposal/importance distributions, with the assumption that they have fatter tails that the true conditionals. The resulting Markov chain is reversible with respect to another stationary measure made of the original distribution multiplied by the normalisation factor of the importance weights but this ensures that weighted averages converge to the right quantity. Interestingly so because the powered conditionals are not necessarily coherent from a Gibbsic perspective.

The method is applied to Bayesian [spike-and-slab] variable selection of variables, the importance selection of a subset of covariates being restricted to changing one index at a time. I did not understand first how the computation of the normalising constant avoids involving 2-to-the-power-p terms until Giacomo explained to me that the constant was only computed for conditionals. The complexity gets down from O(|γ|²) to O(|γ|p), where |γ| is the number of variables. Another question I had was about the tempering power β, which selection remains a wee bit of an art!

## rethinking the ESS

Posted in Statistics with tags , , , , , , , , , on September 14, 2018 by xi'an

Following Victor Elvira‘s visit to Dauphine, one and a half year ago, where we discussed the many defects of ESS as a default measure of efficiency for importance sampling estimators, and then some more efforts (mostly from Victor!) to formalise these criticisms, Victor, Luca Martino and I wrote a paper on this notion, now arXived. (Victor most kindly attributes the origin of the paper to a 2010 ‘Og post on the topic!) The starting thread of the (re?)analysis of this tool introduced by Kong (1992) is that the ESS used in the literature is an approximation to the “true” ESS, generally unavailable. Approximation that is pretty crude and hence impacts the relevance of using it as the assessment tool for comparing importance sampling methods. In the paper, we re-derive (with the uttermost precision) the resulting approximation and list the many assumptions that [would] validate this approximation. The resulting drawbacks are many, from the absurd property of always being worse than direct sampling, to being independent from the target function and from the sample per se. Since only importance weights matter. This list of issues is not exactly brand new, but we think it is worth signaling given the fact that this approximation has been widely used in the last 25 years, due to its simplicity, as a practical rule of thumb [!] in a wide variety of importance sampling methods. In continuation of the directions drafted in Martino et al. (2017), we also indicate some alternative notions of importance efficiency. Note that this paper does not cover the use of ESS for MCMC algorithms, where it is somewhat more legit, if still too rudimentary to really catch convergence or lack thereof! [Note: I refrained from the post title resinking the ESS…]

## nested sampling when prior and likelihood clash

Posted in Books, Statistics with tags , , , , , , , , , on April 3, 2018 by xi'an

A recent arXival by Chen, Hobson, Das, and Gelderblom makes the proposal of a new nested sampling implementation when prior and likelihood disagree, making simulations from the prior inefficient. The paper holds the position that a single given prior is used over and over all datasets that come along:

“…in applications where one wishes to perform analyses on many thousands (or even millions) of different datasets, since those (typically few) datasets for which the prior is unrepresentative can absorb a large fraction of the computational resources.” Chen et al., 2018

My reaction to this situation, provided (a) I want to implement nested sampling and (b) I realise there is a discrepancy, would be to resort to an importance sampling resolution, as we proposed in our Biometrika paper with Nicolas. Since one objection [from the authors] is that identifying outlier datasets is complicated (it should not be when the likelihood function can be computed) and time-consuming, sequential importance sampling could be implemented.

“The posterior repartitioning (PR) method takes advantage of the fact that nested sampling makes use of the likelihood L(θ) and prior π(θ) separately in its exploration of the parameter space, in contrast to Markov chain Monte Carlo (MCMC) sampling methods or genetic algorithms which typically deal solely in terms of the product.” Chen et al., 2018

The above salesman line does not ring a particularly convincing chime in that nested sampling is about as myopic as MCMC since based on the similar notion of a local proposal move, starting from the lowest likelihood argument (the minimum likelihood estimator!) in the nested sample.

“The advantage of this extension is that one can choose (π’,L’) so that simulating from π’ under the constraint L'(θ) > l is easier than simulating from π under the constraint L(θ) > l. For instance, one may choose an instrumental prior π’ such that Markov chain Monte Carlo steps adapted to the instrumental constrained prior are easier to implement than with respect to the actual constrained prior. In a similar vein, nested importance sampling facilitates contemplating several priors at once, as one may compute the evidence for each prior by producing the same nested sequence, based on the same pair (π’,L’), and by simply modifying the weight function.” Chopin & Robert, 2010

Since the authors propose to switch to a product (π’,L’) such that π’.L’=π.L, the solution appears like a special case of importance sampling, with the added drwaback that when π’ is not normalised, its normalised constant must be estimated as well. (With an extra nested sampling implementation?) Furthermore, the advocated solution is to use tempering, which is not so obvious as it seems in small dimensions. As the mass does not always diffuse to relevant parts of the space. A more “natural” tempering would be to use a subsample in the (sub)likelihood for nested sampling and keep the remainder of the sample for weighting the evaluation of the evidence.

## and another one on nested sampling

Posted in Books, Statistics with tags , , , on May 2, 2017 by xi'an

The same authors as those of the paper discussed last week arXived a paper on dynamic nested sampling.

“We propose modifying the nested sampling algorithm by dynamically varying the number of “live points” in order to maximise the accuracy of a calculation for some number of posterior sample.”

Some of the material is actually quite similar to the previous paper (to the point I had to check they were not the same paper). The authors rightly point out that the main source of variation in the nested sampling approximation is due to the Monte Carlo variability in the estimated volume of the level sets.

The main notion in that paper is that it is acceptable to have a varying number of “live” points in nested sampling provided the weights are correctly accordingly. Adding more of those points as a new “thread” in a region where the likelihood changes rapidly. Addition may occur at any level of the likelihood, in fact, and is determined  in the paper by an importance weight being in the upper tail of the importance weights… While the description is rather vague [for instance I do not get the notation in (9)] and the criteria for adding threads somewhat arbitrary, I find interesting that several passes at different precision levels can improve the efficiency of the nested approximation at a given simulation cost. Remains the issue of whether or not this is a sufficient perk for attracting users of other simulation techniques to the nested galaxy…

## importance sampling and necessary sample size

Posted in Books, Statistics with tags , , , , , on September 7, 2016 by xi'an

Daniel Sanz-Alonso arXived a note yesterday where he analyses importance sampling from the point of view of empirical distributions. With the difficulty that unnormalised importance sampling estimators are not associated with an empirical distribution since the sum of the weights is not one. For several f-divergences, he obtains upper bounds on those divergences between the empirical cdf and a uniform version, D(w,u), which translate into lower bounds on the importance sample size. I however do not see why this divergence between a weighted sampled and the uniformly weighted version is relevant for the divergence between the target and the proposal, nor how the resulting Monte Carlo estimator is impacted by this bound. A side remark [in the paper] is that those results apply to infinite variance Monte Carlo estimators, as in the recent paper of Chatterjee and Diaconis I discussed earlier, which also discussed the necessary sample size.