Archive for Pareto smoothed importance sampling

Conformal Bayesian Computation

Posted in Books, pictures, Statistics, University life with tags , , , , , on July 8, 2021 by xi'an

Edwin Fong and Chris Holmes (Oxford) just wrote a paper on Bayesian scalable methods from a M-open perspective. Borrowing from the conformal prediction framework of Vovk et al. (2005) to achieve frequentist coverage for prediction intervals. The method starts with the choice of a conformity measure that measures how well each observation in the sample agrees with the sample. Which is exchangeable and hence leads to a rank statistic from which a p-value can be derived. Which is the empirical cdf associated with the observed conformities. Following Vovk et al. (2005) and Wasserman (2011) Edwin and Chris note that the Bayesian predictive itself acts like a conformity measure. Predictive that can itself be approximated by MCMC and importance sampling (possibly smoothed by Pareto). The paper also expands the setting to partial exchangeable models, renamed group conformal predictions. While reluctant to engage into turning Bayesian solutions into frequentist ones, I can see some worth in deriving both in order to expose discrepancies and hence signal possible issues with models and priors.

improved importance sampling via iterated moment matching

Posted in Statistics with tags , , , , on August 1, 2019 by xi'an

Topi Paananen, Juho Piironen, Paul-Christian Bürkner and Aki Vehtari have recently arXived a work on constructing an adapted importance (sampling) distribution. The beginning is more a review than a new contribution, covering the earlier work by Vehtari, Gelman  and Gabri (2017): estimating the Pareto rate for the importance weight distribution helps in assessing whether or not this distribution allows for a (necessary) second moment. In case it does not (seem to), the authors propose an affine transform of the importance distribution, using the earlier sample to match the first two moments of the distribution. Or of the targeted function. Adaptation that is controlled by the same Pareto rate technique, as in the above picture (from the paper). Predicting a natural objection as to the poor performances of the earlier samples, the paper suggests to use robust estimators of these moments, for instance via Pareto smoothing. It also suggests using multiple importance sampling as a way to regularise and robustify the estimates. While I buy the argument of fitting the target moments to achieve a better fit of the importance sampling, I remain unclear as to why an affine transform would change the (poor) tail behaviour of the importance sampler. Hence why it would apply in full generality. An alternative could consist in finding appropriate Box-Cox transforms, although the difficulty would certainly increase with the dimension.

did variational Bayes work?

Posted in Books, Statistics with tags , , , , , , , , , on May 2, 2019 by xi'an

An interesting ICML 2018 paper by Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman I missed last summer on [the fairly important issue of] assessing the quality or lack thereof of a variational Bayes approximation. In the sense of being near enough from the true posterior. The criterion that they propose in this paper relates to the Pareto smoothed importance sampling technique discussed in an earlier post and which I remember discussing with Andrew when he visited CREST a few years ago. The truncation of the importance weights of prior x likelihood / VB approximation avoids infinite variance issues but induces an unknown amount of bias. The resulting diagnostic is based on the estimation of the Pareto order k. If the true value of k is less than ½, the variance of the associated Pareto distribution is finite. The paper suggests to conclude at the worth of the variational approximation when the estimate of k is less than 0.7, based on the empirical assessment of the earlier paper. The paper also contains a remark on the poor performances of the generalisation of this method to marginal settings, that is, when the importance weight is the ratio of the true and variational marginals for a sub-vector of interest. I find the counter-performances somewhat worrying in that Rao-Blackwellisation arguments make me prefer marginal ratios to joint ratios. It may however be due to a poor approximation of the marginal ratio that reflects on the approximation and not on the ratio itself. A second proposal in the paper focus on solely the point estimate returned by the variational Bayes approximation. Testing that the posterior predictive is well-calibrated. This is less appealing, especially when the authors point out the “dissadvantage is that this diagnostic does not cover the case where the observed data is not well represented by the model.” In other words, misspecified situations. This potential misspecification could presumably be tested by comparing the Pareto fit based on the actual data with a Pareto fit based on simulated data. Among other deficiencies, they point that this is “a local diagnostic that will not detect unseen modes”. In other words, what you get is what you see.

IMS workshop [day 5]

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , on September 3, 2018 by xi'an

The last day of the starting workshop [and my last day in Singapore] was a day of importance [sampling] with talks by Matti Vihola opposing importance sampling and delayed acceptance and particle MCMC, related to several papers of his that I missed. To be continued in the coming weeks at the IMS, which is another reason to regret having to leave that early [as my Parisian semester starts this Monday with an undergrad class at 8:30!]

And then a talk by Joaquín Miguez on stabilizing importance sampling by truncation which reminded me very much of the later work by Andrew Gelman and Aki Vehtari on Pareto smoothed importance sampling, with further operators adapted to sequential settings and the similar drawback that when the importance sampler is poor, i.e., when the simulated points are all very far from the centre of mass, no amount of fudging with the weights will bring the points closer. AMIS made an appearance as a reference method, to be improved by this truncation of the weights, a wee bit surprising as it should bring the large weights of the earlier stages down.

Followed by an almost silent talk by Nick Whiteley, who having lost his voice to the air conditioning whispered his talk in the microphone. Having once faced a lost voice during an introductory lecture to a large undergraduate audience, I could not but completely commiserate for the hardship of the task. Although this made the audience most silent and attentive. His topic was the Viterbi process and its parallelisation, by using a truncated horizon (presenting connection with overdamped Langevin, eg Durmus and Moulines and Dalalyan).

And due to a pressing appointment with my son and his girlfriend [who were traveling through Singapore on that day] for a chili crab dinner on my way to the airport, I missed the final talk by Arnaud Doucet, where he was to reconsider PDMP algorithms without the continuous time layer, a perspective I find most appealing!

Overall, this was a quite diverse and rich [starting] seminar, backed by the superb organisation of the IMS and the smooth living conditions on the NUS campus [once I had mastered the bus routes], which would have made much more sense for me as part of a longer stay, which is actually what happened the previous time I visited the IMS (in 2005), again clashing with my course schedule at home… And as always, I am impressed with the city-state of Singapore, for the highly diverse food scene in particular, but also this [maybe illusory] impression of coexistence between communities. And even though the ecological footprint could certainly be decreased, measures to curb car ownership (with a 150% purchase tax) and use (with congestion charges).

Paret’oothed importance sampling and infinite variance [guest post]

Posted in Kids, pictures, R, Statistics, University life with tags , , , , , , on November 17, 2015 by xi'an

IS_vs_PSIS_k09[Here are some comments sent to me by Aki Vehtari in the sequel of the previous posts.]

The following is mostly based on our arXived paper with Andrew Gelman and the references mentioned  there.

Koopman, Shephard, and Creal (2009) proposed to make a sample based estimate of the existence of the moments using generalized Pareto distribution fitted to the tail of the weight distribution. The number of existing moments is less than 1/k (when k>0), where k is the shape parameter of generalized Pareto distribution.

When k<1/2, the variance exists and the central limit theorem holds. Chen and Shao (2004) show further that the rate of convergence to normality is faster when higher moments exist. When 1/2≤k<1, the variance does not exist (but mean exists), the generalized central limit theorem holds, and we may assume the rate of convergence is faster when k is closer to 1/2.

In the example with “Exp(1) proposal for an Exp(1/2) target”, k=1/2 and we are truly on the border. IS_vs_PSIS_k05

In our experiments in the arXived paper and in Vehtari, Gelman, and Gabry (2015), we have observed that Pareto smoothed importance sampling (PSIS) usually converges well also with k>1/2 but k close to 1/2 (let’s say k<0.7). But if k<1 and k is close to 1 (let’s say k>0.7) the convergence is much worse and both naïve importance sampling and PSIS are unreliable.

Two figures are attached, which show the results comparing IS and PSIS in the Exp(1/2) and Exp(1/10) examples. The results were computed with repeating 1000 times a simulation with 10000 samples in each. We can see the bad performance of IS in both examples as you also illustrated. In Exp(1/2) case, PSIS is also to produce much more stable results. In Exp(1/10) case, PSIS is able to reduce the variance of the estimate, but it is not enough to avoid a big bias.

It would be interesting to have more theoretical justification why infinite variance is not so big problem if k is close to 1/2 (e.g. how the convergence rate is related to the amount of fractional moments).

I guess that max ω[t] / ∑ ω[t] in Chaterjee and Diaconis has some connection to the tail shape parameter of the generalized Pareto distribution, but it is likely to be much noisier as it depends on the maximum value instead of a larger number of tail samples as in the approach by Koopman, Shephard, and Creal (2009).IS_vs_PSIS_exp19A third figure shows an example where the variance is finite, with “an Exp(1) proposal for an Exp(1/1.9) target”, which corresponds to k≈0.475 < 1/2. Although the variance is finite, we are close to the border and the performance of basic IS is bad. There is no sharp change in the practical behaviour with a finite number of draws when going from finite variance to infinite variance. Thus, I think it is not enough to focus on the discrete number of moments, but for example, the Pareto shape parameter k gives us more information. Koopman, Shephard, and Creal (2009) also estimated the Pareto shape k, but they formed a hypothesis test whether the variance is finite and thus discretising the information in k, and assuming that finite variance is enough to get good performance.