Paret’oothed importance sampling and infinite variance [guest post]
[Here are some comments sent to me by Aki Vehtari in the sequel of the previous posts.]
The following is mostly based on our arXived paper with Andrew Gelman and the references mentioned there.
Koopman, Shephard, and Creal (2009) proposed to make a sample based estimate of the existence of the moments using generalized Pareto distribution fitted to the tail of the weight distribution. The number of existing moments is less than 1/k (when k>0), where k is the shape parameter of generalized Pareto distribution.
When k<1/2, the variance exists and the central limit theorem holds. Chen and Shao (2004) show further that the rate of convergence to normality is faster when higher moments exist. When 1/2≤k<1, the variance does not exist (but mean exists), the generalized central limit theorem holds, and we may assume the rate of convergence is faster when k is closer to 1/2.
In the example with “Exp(1) proposal for an Exp(1/2) target”, k=1/2 and we are truly on the border.
In our experiments in the arXived paper and in Vehtari, Gelman, and Gabry (2015), we have observed that Pareto smoothed importance sampling (PSIS) usually converges well also with k>1/2 but k close to 1/2 (let’s say k<0.7). But if k<1 and k is close to 1 (let’s say k>0.7) the convergence is much worse and both naïve importance sampling and PSIS are unreliable.
Two figures are attached, which show the results comparing IS and PSIS in the Exp(1/2) and Exp(1/10) examples. The results were computed with repeating 1000 times a simulation with 10000 samples in each. We can see the bad performance of IS in both examples as you also illustrated. In Exp(1/2) case, PSIS is also to produce much more stable results. In Exp(1/10) case, PSIS is able to reduce the variance of the estimate, but it is not enough to avoid a big bias.
It would be interesting to have more theoretical justification why infinite variance is not so big problem if k is close to 1/2 (e.g. how the convergence rate is related to the amount of fractional moments).
I guess that max ω[t] / ∑ ω[t] in Chaterjee and Diaconis has some connection to the tail shape parameter of the generalized Pareto distribution, but it is likely to be much noisier as it depends on the maximum value instead of a larger number of tail samples as in the approach by Koopman, Shephard, and Creal (2009).A third figure shows an example where the variance is finite, with “an Exp(1) proposal for an Exp(1/1.9) target”, which corresponds to k≈0.475 < 1/2. Although the variance is finite, we are close to the border and the performance of basic IS is bad. There is no sharp change in the practical behaviour with a finite number of draws when going from finite variance to infinite variance. Thus, I think it is not enough to focus on the discrete number of moments, but for example, the Pareto shape parameter k gives us more information. Koopman, Shephard, and Creal (2009) also estimated the Pareto shape k, but they formed a hypothesis test whether the variance is finite and thus discretising the information in k, and assuming that finite variance is enough to get good performance.
November 18, 2015 at 1:22 pm
[…] Last week Xi’an blogged about an arXiv paper by Chatterjee and Diaconis which considers the proper sample size in an importance sampling setting with infinite variance. I commented Xi’an’s posting and the end result was my guest blog posting in Xi’an’s og. […]
November 18, 2015 at 8:30 am
Hi Dan, after MCMC got popular, popularity of IS went down (except in particle filtering) due to the difficulties in finding good proposal distributions and checking the reliability (finite variance etc). One reason for the new interest in IS is the availability of better distributional approximations (variational, EP) which can be used to obtain better proposal distributions. There are several examples where importance sampling saves a lot of time and now we have also a better way to estimate and control the reliability (PSIS). In leave-one-out cross-validation the speed-up is n-fold, where n is the number of observations (see http://arxiv.org/abs/1507.04544). We have other good results in other applications to be published (arXived) soon. Also other researchers are using different variations of IS more and more.
November 18, 2015 at 11:56 am
Thanks Aki – that makes a lot of sense!
November 17, 2015 at 1:19 pm
Ok. I think I need someone to educate me here…
I always thought that importance sampling was basically a neat trick that was pedagogically useful (it shows everyone what the ||f||^2 term in the variance actually means). But in practice, there was little chance of getting a decent importance distribution for real problems.
But all of these people are putting serious work into it, so what is it that I’m missing? What type of problem is the target here?
November 17, 2015 at 2:28 pm
Uh?! Everything Monte Carlo based “is” importance sampling, Dan, so do you question the relevance of simulation as a whole?!
November 18, 2015 at 12:59 am
Monte Carlo is a special case of importance sampling where none of these issues really arise. So it’s the whole “change the base measure to improve the computational performance” aspect that I don’t understand the real application of.
If the base measure is the lebesgue measure and you’re hitting infinite variance problems, I would wonder why you are working with L^p (p<2) functions in a statistical context. (It may come up in a problem with lags – the natural space to consider functional volterra equations is some perverted version of L^1, for example)
November 18, 2015 at 10:12 am
Uh?! What is so special with square integrable integrands? I may be interested in the expectation of a function that is not square integrable, this does not seem so unrealistic, does it?!
November 18, 2015 at 11:55 am
I guess the interest is that most functions are L^2 and almost all of the theory that I’m aware of is L^2 (there are good reasons for this – L^1 is a terrible terrible space of functions).
Perhaps the only exception is low-dimensional QMC, which usually lives in spaces of bounded (H-K) variation, which is not unlike the space of functions with integrable
mixed first derivatives.
So basically, while it may not be unrealistic to want the expectation of a function that is only integrable, I’m not sure we currently (or at the increment of time before these two papers) have any (sampling-based) methods for doing that. I guess I was asking if this was a “build it and they will come”-type situation or if there was a demand in the community.