## importance tempering and variable selection

Posted in Books, Statistics with tags , , , , , , , , on November 6, 2018 by xi'an

As reading and commenting the importance tempering for variable selection paper by Giacomo Zanella (previously Warwick) and Gareth Roberts (Warwick) has been on my to-do list for quite a while, the fact that Giacomo presented this work at CIRM Bayesian Masterclass last week was the right nudge to write this post.

The starting point for the method is to simulate from a tempered version of a Gibbs sampler, selecting the component [of the parameter vector θ] according to an importance weight that is the inverse of the conditional posterior to the complementary power. That is, the inverse of the importance weight. This approach differs from classical (MCMC) tempering in that it does not target the original distribution. Hence it produces a weighted sample, whose computing time is of the order of the dimension of θ, even though the tempered simulation of a single conditional can reduce the variance of the estimator. The method is generalisable to any collection of one-component proposal/importance distributions, with the assumption that they have fatter tails that the true conditionals. The resulting Markov chain is reversible with respect to another stationary measure made of the original distribution multiplied by the normalisation factor of the importance weights but this ensures that weighted averages converge to the right quantity. Interestingly so because the powered conditionals are not necessarily coherent from a Gibbsic perspective.

The method is applied to Bayesian [spike-and-slab] variable selection of variables, the importance selection of a subset of covariates being restricted to changing one index at a time. I did not understand first how the computation of the normalising constant avoids involving 2-to-the-power-p terms until Giacomo explained to me that the constant was only computed for conditionals. The complexity gets down from O(|γ|²) to O(|γ|p), where |γ| is the number of variables. Another question I had was about the tempering power β, which selection remains a wee bit of an art!

## rethinking the ESS

Posted in Statistics with tags , , , , , , , , , on September 14, 2018 by xi'an

Following Victor Elvira‘s visit to Dauphine, one and a half year ago, where we discussed the many defects of ESS as a default measure of efficiency for importance sampling estimators, and then some more efforts (mostly from Victor!) to formalise these criticisms, Victor, Luca Martino and I wrote a paper on this notion, now arXived. (Victor most kindly attributes the origin of the paper to a 2010 ‘Og post on the topic!) The starting thread of the (re?)analysis of this tool introduced by Kong (1992) is that the ESS used in the literature is an approximation to the “true” ESS, generally unavailable. Approximation that is pretty crude and hence impacts the relevance of using it as the assessment tool for comparing importance sampling methods. In the paper, we re-derive (with the uttermost precision) the resulting approximation and list the many assumptions that [would] validate this approximation. The resulting drawbacks are many, from the absurd property of always being worse than direct sampling, to being independent from the target function and from the sample per se. Since only importance weights matter. This list of issues is not exactly brand new, but we think it is worth signaling given the fact that this approximation has been widely used in the last 25 years, due to its simplicity, as a practical rule of thumb [!] in a wide variety of importance sampling methods. In continuation of the directions drafted in Martino et al. (2017), we also indicate some alternative notions of importance efficiency. Note that this paper does not cover the use of ESS for MCMC algorithms, where it is somewhat more legit, if still too rudimentary to really catch convergence or lack thereof! [Note: I refrained from the post title resinking the ESS…]

## nested sampling when prior and likelihood clash

Posted in Books, Statistics with tags , , , , , , , , , on April 3, 2018 by xi'an

A recent arXival by Chen, Hobson, Das, and Gelderblom makes the proposal of a new nested sampling implementation when prior and likelihood disagree, making simulations from the prior inefficient. The paper holds the position that a single given prior is used over and over all datasets that come along:

“…in applications where one wishes to perform analyses on many thousands (or even millions) of different datasets, since those (typically few) datasets for which the prior is unrepresentative can absorb a large fraction of the computational resources.” Chen et al., 2018

My reaction to this situation, provided (a) I want to implement nested sampling and (b) I realise there is a discrepancy, would be to resort to an importance sampling resolution, as we proposed in our Biometrika paper with Nicolas. Since one objection [from the authors] is that identifying outlier datasets is complicated (it should not be when the likelihood function can be computed) and time-consuming, sequential importance sampling could be implemented.

“The posterior repartitioning (PR) method takes advantage of the fact that nested sampling makes use of the likelihood L(θ) and prior π(θ) separately in its exploration of the parameter space, in contrast to Markov chain Monte Carlo (MCMC) sampling methods or genetic algorithms which typically deal solely in terms of the product.” Chen et al., 2018

The above salesman line does not ring a particularly convincing chime in that nested sampling is about as myopic as MCMC since based on the similar notion of a local proposal move, starting from the lowest likelihood argument (the minimum likelihood estimator!) in the nested sample.

“The advantage of this extension is that one can choose (π’,L’) so that simulating from π’ under the constraint L'(θ) > l is easier than simulating from π under the constraint L(θ) > l. For instance, one may choose an instrumental prior π’ such that Markov chain Monte Carlo steps adapted to the instrumental constrained prior are easier to implement than with respect to the actual constrained prior. In a similar vein, nested importance sampling facilitates contemplating several priors at once, as one may compute the evidence for each prior by producing the same nested sequence, based on the same pair (π’,L’), and by simply modifying the weight function.” Chopin & Robert, 2010

Since the authors propose to switch to a product (π’,L’) such that π’.L’=π.L, the solution appears like a special case of importance sampling, with the added drwaback that when π’ is not normalised, its normalised constant must be estimated as well. (With an extra nested sampling implementation?) Furthermore, the advocated solution is to use tempering, which is not so obvious as it seems in small dimensions. As the mass does not always diffuse to relevant parts of the space. A more “natural” tempering would be to use a subsample in the (sub)likelihood for nested sampling and keep the remainder of the sample for weighting the evaluation of the evidence.

## and another one on nested sampling

Posted in Books, Statistics with tags , , , on May 2, 2017 by xi'an

The same authors as those of the paper discussed last week arXived a paper on dynamic nested sampling.

“We propose modifying the nested sampling algorithm by dynamically varying the number of “live points” in order to maximise the accuracy of a calculation for some number of posterior sample.”

Some of the material is actually quite similar to the previous paper (to the point I had to check they were not the same paper). The authors rightly point out that the main source of variation in the nested sampling approximation is due to the Monte Carlo variability in the estimated volume of the level sets.

The main notion in that paper is that it is acceptable to have a varying number of “live” points in nested sampling provided the weights are correctly accordingly. Adding more of those points as a new “thread” in a region where the likelihood changes rapidly. Addition may occur at any level of the likelihood, in fact, and is determined  in the paper by an importance weight being in the upper tail of the importance weights… While the description is rather vague [for instance I do not get the notation in (9)] and the criteria for adding threads somewhat arbitrary, I find interesting that several passes at different precision levels can improve the efficiency of the nested approximation at a given simulation cost. Remains the issue of whether or not this is a sufficient perk for attracting users of other simulation techniques to the nested galaxy…

## importance sampling and necessary sample size

Posted in Books, Statistics with tags , , , , , on September 7, 2016 by xi'an

Daniel Sanz-Alonso arXived a note yesterday where he analyses importance sampling from the point of view of empirical distributions. With the difficulty that unnormalised importance sampling estimators are not associated with an empirical distribution since the sum of the weights is not one. For several f-divergences, he obtains upper bounds on those divergences between the empirical cdf and a uniform version, D(w,u), which translate into lower bounds on the importance sample size. I however do not see why this divergence between a weighted sampled and the uniformly weighted version is relevant for the divergence between the target and the proposal, nor how the resulting Monte Carlo estimator is impacted by this bound. A side remark [in the paper] is that those results apply to infinite variance Monte Carlo estimators, as in the recent paper of Chatterjee and Diaconis I discussed earlier, which also discussed the necessary sample size.

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , on February 24, 2015 by xi'an

Today was the final session of our Reading Classics Seminar for the academic year 2014-2015. I have not reported on this seminar much so far because it has had starting problems, namely hardly any student present on the first classes and therefore several re-starts until we reached a small group of interested students. And this is truly The End for this enjoyable experiment as this is the final year for my TSI Master at Paris-Dauphine, as it will become integrated within the new MASH Master next year.

As a last presentation for the entire series, my student picked John Skilling’s Nested Sampling, not that it was in my list of “classics”, but he had worked on the paper in a summer project and was thus reasonably fluent with the topic. As he did a good enough job (!), here are his slides.

Some of the questions that came to me during the talk were on how to run nested sampling sequentially, both in the data and in the number of simulated points, and on incorporating more deterministic moves in order to remove some of the Monte Carlo variability. I was about to ask about (!) the Hamiltonian version of nested sampling but then he mentioned his last summer internship on this very topic! I also realised during that talk that the formula (for positive random variables)

$\int_0^\infty(1-F(x))\text{d}x = \mathbb{E}_F[X]$

does not require absolute continuity of the distribution F.

## particle efficient importance sampling

Posted in Statistics with tags , , , , , , on October 15, 2013 by xi'an

Marcel Scharth and Robert Kohn just arXived a new article entitled “particle efficient importance sampling“. What is—the efficiency—about?! The spectacular diminution in variance—(the authors mention a factor of 6,000 when compared with regular particle filters!—in a stochastic volatility simulation study.

If I got the details right, the improvement stems from a paper by Richard and Zhang (Journal of  Econometrics, 2007). In a state-space/hidden Markov model setting, (non-sequential) importance sampling tries to approximate the smoothing distribution one term at a time, ie p(xt|xt-1,y1:n), but Richard and Zhang (2007) modify the target by looking at

p(yt|xt)p(xt|xt-1(xt-1,y1:n),

where the last term χ(xt-1,y1:n) is the normalising constant of the proposal kernel for the previous (in t-1) target, k(xt-1|xt-2,y1:n). This kernel is actually parameterised as k(xt-1|xt-2,at(y1:n)) and the EIS algorithm optimises those parameters, one term at a time. The current paper expands Richard and Zhang (2007) by using particles to approximate the likelihood contribution and reduce the variance once the “optimal” EIS solution is obtained. (They also reproduce Richard’s and Zhang’s tricks of relying on the same common random numbers.

This approach sounds like a “miracle” to me, in the sense(s) that (a) the “normalising constant” is far from being uniquely defined (and just as far from being constant in the parameter at) and (b) it is unrelated with the target distribution (except for the optimisation step). In the extreme case when the normalising constant is also constant… in at, this step clearly is useless. (This also opens the potential for an optimisation in the choice of χ(xt-1,y1:n)…)

The simulation study starts from a univariate stochastic volatility model relying on two hidden correlated AR(1) models. (There may be a typo in the definition in Section 4.1, i.e. a Φi missing.) In those simulations, EIS brings a significant variance reduction when compared with standard particle filters and particle EIS further improves upon EIS by a factor of 2 to 20 (in the variance). I could not spot in the paper which choice had been made for χ()… which is annoying as I gathered from my reading that it must have a strong impact on the efficiency attached to the name of the method!