Daniel Sanz-Alonso arXived a note yesterday where he analyses importance sampling from the point of view of empirical distributions. With the difficulty that unnormalised importance sampling estimators are not associated with an empirical distribution since the sum of the weights is not one. For several f-divergences, he obtains upper bounds on those divergences between the empirical cdf and a uniform version, D(w,u), which translate into lower bounds on the importance sample size. I however do not see why this divergence between a weighted sampled and the uniformly weighted version is relevant for the divergence between the target and the proposal, nor how the resulting Monte Carlo estimator is impacted by this bound. A side remark [in the paper] is that those results apply to infinite variance Monte Carlo estimators, as in the recent paper of Chatterjee and Diaconis I discussed earlier, which also discussed the necessary sample size.
Archive for infinite variance estimators
Today I attended Persi Diaconis’ de Finetti’s ISBA Lecture and not only because I was an invited discussant, by all means!!! Persi was discussing his views on Bayesian numerical analysis. As already expressed in his 1988 paper. Which now appears as a foundational precursor to probabilistic numerics. And which is why I had a very easy time in preparing my discussion as I mostly borrowed from my NIPS slides. With some degree of legitimacy since I was already a discussant there. Anyway, here is the most novel slide in the discussion, built upon my realisation that the principle behind nested sampling is fairly generic for integral approximation, rather than being restricted to marginal likelihood approximation.
Among many interesting things, Persi’s talk made me think anew about infinite variance importance sampling. And about the paper by Souraj Chatterjee and Persi that I discussed a few months ago. In that some regularisation of those “useless” importance estimates can stem from prior modelling. Not as an aside, let me add I am very grateful to the ISBA 2016 organisers and to the chair of the de Finetti lecture committee for their invitation to discuss this talk!
As I was still musing about the posts of last week around infinite variance importance sampling and its potential corrections, I wondered at whether or not there was a fundamental difference between “just” having a [finite] variance and “just” having none. In conjunction with Aki’s post. To get a better feeling, I ran a quick experiment with Exp(1) as the target and Exp(a) as the importance distribution. When estimating E[X]=1, the above graph opposes a=1.95 to a=2.05 (variance versus no variance, bright yellow versus wheat), a=2.95 to a=3.05 (third moment versus none, bright yellow versus wheat), and a=3.95 to a=4.05 (fourth moment versus none, bright yellow versus wheat). The graph below is the same for the estimation of E[exp(X/2)]=2, which has an integrand that is not square integrable under the target. Hence seems to require higher moments for the importance weight. Hard to derive universal theories from those two graphs, however they show that protection against sudden drifts in the estimation sequence. As an aside [not really!], apart from our rather confidential Confidence bands for Brownian motion and applications to Monte Carlo simulation with Wilfrid Kendall and Jean-Michel Marin, I do not know of many studies that consider the sequence of averages time-wise rather than across realisations at a given time and still think this is a more relevant perspective for simulation purposes.
The following is mostly based on our arXived paper with Andrew Gelman and the references mentioned there.
Koopman, Shephard, and Creal (2009) proposed to make a sample based estimate of the existence of the moments using generalized Pareto distribution fitted to the tail of the weight distribution. The number of existing moments is less than 1/k (when k>0), where k is the shape parameter of generalized Pareto distribution.
When k<1/2, the variance exists and the central limit theorem holds. Chen and Shao (2004) show further that the rate of convergence to normality is faster when higher moments exist. When 1/2≤k<1, the variance does not exist (but mean exists), the generalized central limit theorem holds, and we may assume the rate of convergence is faster when k is closer to 1/2.
In our experiments in the arXived paper and in Vehtari, Gelman, and Gabry (2015), we have observed that Pareto smoothed importance sampling (PSIS) usually converges well also with k>1/2 but k close to 1/2 (let’s say k<0.7). But if k<1 and k is close to 1 (let’s say k>0.7) the convergence is much worse and both naïve importance sampling and PSIS are unreliable.
Two figures are attached, which show the results comparing IS and PSIS in the Exp(1/2) and Exp(1/10) examples. The results were computed with repeating 1000 times a simulation with 10000 samples in each. We can see the bad performance of IS in both examples as you also illustrated. In Exp(1/2) case, PSIS is also to produce much more stable results. In Exp(1/10) case, PSIS is able to reduce the variance of the estimate, but it is not enough to avoid a big bias.
It would be interesting to have more theoretical justification why infinite variance is not so big problem if k is close to 1/2 (e.g. how the convergence rate is related to the amount of fractional moments).
I guess that max ω[t] / ∑ ω[t] in Chaterjee and Diaconis has some connection to the tail shape parameter of the generalized Pareto distribution, but it is likely to be much noisier as it depends on the maximum value instead of a larger number of tail samples as in the approach by Koopman, Shephard, and Creal (2009).A third figure shows an example where the variance is finite, with “an Exp(1) proposal for an Exp(1/1.9) target”, which corresponds to k≈0.475 < 1/2. Although the variance is finite, we are close to the border and the performance of basic IS is bad. There is no sharp change in the practical behaviour with a finite number of draws when going from finite variance to infinite variance. Thus, I think it is not enough to focus on the discrete number of moments, but for example, the Pareto shape parameter k gives us more information. Koopman, Shephard, and Creal (2009) also estimated the Pareto shape k, but they formed a hypothesis test whether the variance is finite and thus discretising the information in k, and assuming that finite variance is enough to get good performance.
Sourav Chatterjee and Persi Diaconis arXived yesterday an exciting paper where they study the proper sample size in an importance sampling setting with no variance. That’s right, with no variance. They give as a starting toy example the use of an Exp(1) proposal for an Exp(1/2) target, where the importance ratio exp(x/2)/2 has no ξ order moment (for ξ≥2). So the infinity in the variance is somehow borderline in this example, which may explain why the estimator could be considered to “work”. However, I disagree with the statement “that a sample size a few thousand suffices” for the estimator of the mean to be close to the true value, that is, 2. For instance, the picture I drew above is the superposition of 250 sequences of importance sampling estimators across 10⁵ iterations: several sequences show huge jumps, even for a large number of iterations, which are characteristic of infinite variance estimates. Thus, while the expected distance to the true value can be closely evaluated via the Kullback-Leibler divergence between the target and the proposal (which by the way is infinite when using a Normal as proposal and a Cauchy as target), there are realisations of the simulation path that can remain far from the true value and this for an arbitrary number of simulations. (I even wonder if, for a given simulation path, waiting long enough should
not lead to those unbounded jumps.) The first result is frequentist, while the second is conditional, i.e., can occur for the single path we have just simulated… As I taught in class this very morning, I thus remain wary about using an infinite variance estimator. (And not only in connection with the harmonic mean quagmire. As shown below by the more extreme case of simulating an Exp(1) proposal for an Exp(1/10) target, where the mean is completely outside the range of estimates.) Wary, then, even though I find the enclosed result about the existence of a cut-off sample size associated with this L¹ measure quite astounding. Continue reading Last week A while ago, Aki Vehtari and Andrew Gelman arXived a paper on self-normalised importance sampling estimators, Pareto smoothed importance sampling. That I commented almost immediately and then sat on, waiting for the next version. Since the two A’s are still working on that revision, I eventually decided to post the comments, before a series of posts on the same issue. Disclaimer: the above quote from and picture of Pareto are unrelated with the paper!
A major drawback with importance samplers is that they can produce infinite variance estimators. Aki and Andrew compare in this study the behaviour of truncated importance weights, following a paper of Ionides (2008) Andrew and I had proposed as a student project last year, project that did not conclude. The truncation is of order √S, where S is the number of simulation, rescaled by the average weight (which should better be the median weight in the event of infinite variance weights). While this truncation leads to finite variance, it also induces a possibly far from negligible bias, bias that the paper suggests to reduce via a Pareto modelling of the largest or extreme weights. Three possible conclusions come from the Pareto modelling and the estimation of the Pareto shape k. If k<½, there is no variance issue and truncation is not necessary; if ½<k<1, the estimator has a mean but no variance, and if k>1, it does not even has a mean. The latter case sounds counter-intuitive since the self-normalised importance sampling estimator is the ratio of an estimate of a finite integral by an estimate of a positive constant… Aki and Andrew further use the Pareto estimation to smooth out the largest weights as estimated quantiles. They also eliminate the largest weights when k comes close to 1 or higher values.
On a normal toy example, simulated with too small a variance, the method is seen to reduce the variability if not the bias. In connection with my above remark, k does never appear as significantly above 1 in this example. A second toy example uses a shifted t distribution as proposal. This setting should not induce a infinite variance problem since the inverse of a t density remains integrable under a normal distribution, but the variance grows with the bias in the t proposal and the Pareto index k as well, exceeding the boundary value 1 in the end. Similar behaviour is observed on a multidimensional example.
The issue I have with this approach is the same I set to Andrew last year, namely why would one want to use a poor importance sampler and run the risk of ending up with a worthless approximation? Detecting infinite variance estimation is obviously an essential first step step to produce reliable approximation but a second step would to seek a substitute for the proposal in an automated manner, possibly by increasing the tails of the original one, or in running a reparameterisation of the original problem with the same proposal. Towards thinner tails of the target. Automated sounds unrealistic, obviously, but so does trusting an infinite variance estimate. If worse comes to worse, we should acknowledge and signal that the current sampler cannot be trusted. As in statistical settings, we should be able to state we cannot produce a satisfactory solution (and hence need more data or different models).
while simulating from the associated density, g. While seemingly unaware of the (huge) literature in the area, he re-derived [a version of] the harmonic mean estimate by considering the [inverted importance sampling] identity
when α is a probability density and by using for α the uniform over the whole range of the simulations from g. This choice of α obviously leads to an estimator with infinite variance when the support of g is unbounded, but the idea can be easily salvaged by using instead another uniform distribution, for instance on an highest density region, as we studied in our papers with Darren Wraith and Jean-Michel Marin. (Unfortunately, the originator of the question does not seem any longer interested in the problem.)