Paret’oothed importance sampling and infinite variance [guest post]
The following is mostly based on our arXived paper with Andrew Gelman and the references mentioned there.
Koopman, Shephard, and Creal (2009) proposed to make a sample based estimate of the existence of the moments using generalized Pareto distribution fitted to the tail of the weight distribution. The number of existing moments is less than 1/k (when k>0), where k is the shape parameter of generalized Pareto distribution.
When k<1/2, the variance exists and the central limit theorem holds. Chen and Shao (2004) show further that the rate of convergence to normality is faster when higher moments exist. When 1/2≤k<1, the variance does not exist (but mean exists), the generalized central limit theorem holds, and we may assume the rate of convergence is faster when k is closer to 1/2.
In our experiments in the arXived paper and in Vehtari, Gelman, and Gabry (2015), we have observed that Pareto smoothed importance sampling (PSIS) usually converges well also with k>1/2 but k close to 1/2 (let’s say k<0.7). But if k<1 and k is close to 1 (let’s say k>0.7) the convergence is much worse and both naïve importance sampling and PSIS are unreliable.
Two figures are attached, which show the results comparing IS and PSIS in the Exp(1/2) and Exp(1/10) examples. The results were computed with repeating 1000 times a simulation with 10000 samples in each. We can see the bad performance of IS in both examples as you also illustrated. In Exp(1/2) case, PSIS is also to produce much more stable results. In Exp(1/10) case, PSIS is able to reduce the variance of the estimate, but it is not enough to avoid a big bias.
It would be interesting to have more theoretical justification why infinite variance is not so big problem if k is close to 1/2 (e.g. how the convergence rate is related to the amount of fractional moments).
I guess that max ω[t] / ∑ ω[t] in Chaterjee and Diaconis has some connection to the tail shape parameter of the generalized Pareto distribution, but it is likely to be much noisier as it depends on the maximum value instead of a larger number of tail samples as in the approach by Koopman, Shephard, and Creal (2009).A third figure shows an example where the variance is finite, with “an Exp(1) proposal for an Exp(1/1.9) target”, which corresponds to k≈0.475 < 1/2. Although the variance is finite, we are close to the border and the performance of basic IS is bad. There is no sharp change in the practical behaviour with a finite number of draws when going from finite variance to infinite variance. Thus, I think it is not enough to focus on the discrete number of moments, but for example, the Pareto shape parameter k gives us more information. Koopman, Shephard, and Creal (2009) also estimated the Pareto shape k, but they formed a hypothesis test whether the variance is finite and thus discretising the information in k, and assuming that finite variance is enough to get good performance.