Archive for importance sampling

unbiased product of expectations

Posted in Books, Statistics, University life with tags , , , , , , , , on August 5, 2019 by xi'an

m_biomet_106_2coverWhile I was not involved in any way, or even aware of this research, Anthony Lee, Simone Tiberi, and Giacomo Zanella have an incoming paper in Biometrika, and which was partly written while all three authors were at the University of Warwick. The purpose is to design an efficient manner to approximate the product of n unidimensional expectations (or integrals) all computed against the same reference density. Which is not a real constraint. A neat remark that motivates the method in the paper is that an improved estimator can be connected with the permanent of the n x N matrix A made of the values of the n functions computed at N different simulations from the reference density. And involves N!/ (N-n)! terms rather than N to the power n. Since it is NP-hard to compute, a manageable alternative uses random draws from constrained permutations that are reasonably easy to simulate. Especially since, given that the estimator recycles most of the particles, it requires a much smaller version of N. Essentially N=O(n) with this scenario, instead of O(n²) with the basic Monte Carlo solution, towards a similar variance.

This framework offers many applications in latent variable models, including pseudo-marginal MCMC, of course, but also for ABC since the ABC posterior based on getting each simulated observation close enough from the corresponding actual observation fits this pattern (albeit the dependence on the chosen ordering of the data is an issue that can make the example somewhat artificial).

improved importance sampling via iterated moment matching

Posted in Statistics with tags , , , , on August 1, 2019 by xi'an

Topi Paananen, Juho Piironen, Paul-Christian Bürkner and Aki Vehtari have recently arXived a work on constructing an adapted importance (sampling) distribution. The beginning is more a review than a new contribution, covering the earlier work by Vehtari, Gelman  and Gabri (2017): estimating the Pareto rate for the importance weight distribution helps in assessing whether or not this distribution allows for a (necessary) second moment. In case it does not (seem to), the authors propose an affine transform of the importance distribution, using the earlier sample to match the first two moments of the distribution. Or of the targeted function. Adaptation that is controlled by the same Pareto rate technique, as in the above picture (from the paper). Predicting a natural objection as to the poor performances of the earlier samples, the paper suggests to use robust estimators of these moments, for instance via Pareto smoothing. It also suggests using multiple importance sampling as a way to regularise and robustify the estimates. While I buy the argument of fitting the target moments to achieve a better fit of the importance sampling, I remain unclear as to why an affine transform would change the (poor) tail behaviour of the importance sampler. Hence why it would apply in full generality. An alternative could consist in finding appropriate Box-Cox transforms, although the difficulty would certainly increase with the dimension.

sampling and imbalanced

Posted in Statistics with tags , , , , , on June 21, 2019 by xi'an

Deborshee Sen, Matthias Sachs, Jianfeng Lu and David Dunson have recently arXived a sub-sampling paper for  classification (logistic) models where some covariates or some responses are imbalanced. With a PDMP, namely zig-zag, used towards preserving the correct invariant distribution (as already mentioned in an earlier post on the zig-zag zampler and in a recent Annals paper by Joris Bierkens, Paul Fearnhead, and Gareth Roberts (Warwick)). The current paper is thus an improvement on the above. Using (non-uniform) importance sub-sampling across observations and simpler upper bounds for the Poisson process. A rather practical form of Poisson thinning. And proposing unbiased estimates of the sub-sample log-posterior as well as stratified sub-sampling.

I idly wondered if the zig-zag sampler could itself be improved by not switching the bouncing directions at random since directions associated with almost certainly null coefficients should be neglected as much as possible, but the intensity functions associated with the directions do incorporate this feature. Except for requiring computation of the intensities for all directions. This is especially true when facing many covariates.

Thinking of the logistic regression model itself, it is sort of frustrating that something so close to an exponential family causes so many headaches! Formally, it is an exponential family but the normalising constant is rather unwieldy, especially when there are many observations and many covariates. The Polya-Gamma completion is a way around, but it proves highly costly when the dimension is large…

MCMC importance samplers for intractable likelihoods

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , on May 3, 2019 by xi'an

Jordan Franks just posted on arXiv his PhD dissertation at the University of Jyväskylä, where he discuses several of his works:

  1. M. Vihola, J. Helske, and J. Franks. Importance sampling type estimators based on approximate marginal MCMC. Preprint arXiv:1609.02541v5, 2016.
  2. J. Franks and M. Vihola. Importance sampling correction versus standard averages of reversible MCMCs in terms of the asymptotic variance. Preprint arXiv:1706.09873v4, 2017.
  3. J. Franks, A. Jasra, K. J. H. Law and M. Vihola.Unbiased inference for discretely observed hidden Markov model diffusions. Preprint arXiv:1807.10259v4, 2018.
  4. M. Vihola and J. Franks. On the use of ABC-MCMC with inflated tolerance and post-correction. Preprint arXiv:1902.00412, 2019

focusing on accelerated approximate MCMC (in the sense of pseudo-marginal MCMC) and delayed acceptance (as in our recently accepted paper). Comparing delayed acceptance with MCMC importance sampling to the advantage of the later. And discussing the choice of the tolerance sequence for ABC-MCMC. (Although I did not get from the thesis itself the target of the improvement discussed.)

did variational Bayes work?

Posted in Books, Statistics with tags , , , , , , , , , on May 2, 2019 by xi'an

An interesting ICML 2018 paper by Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman I missed last summer on [the fairly important issue of] assessing the quality or lack thereof of a variational Bayes approximation. In the sense of being near enough from the true posterior. The criterion that they propose in this paper relates to the Pareto smoothed importance sampling technique discussed in an earlier post and which I remember discussing with Andrew when he visited CREST a few years ago. The truncation of the importance weights of prior x likelihood / VB approximation avoids infinite variance issues but induces an unknown amount of bias. The resulting diagnostic is based on the estimation of the Pareto order k. If the true value of k is less than ½, the variance of the associated Pareto distribution is finite. The paper suggests to conclude at the worth of the variational approximation when the estimate of k is less than 0.7, based on the empirical assessment of the earlier paper. The paper also contains a remark on the poor performances of the generalisation of this method to marginal settings, that is, when the importance weight is the ratio of the true and variational marginals for a sub-vector of interest. I find the counter-performances somewhat worrying in that Rao-Blackwellisation arguments make me prefer marginal ratios to joint ratios. It may however be due to a poor approximation of the marginal ratio that reflects on the approximation and not on the ratio itself. A second proposal in the paper focus on solely the point estimate returned by the variational Bayes approximation. Testing that the posterior predictive is well-calibrated. This is less appealing, especially when the authors point out the “dissadvantage is that this diagnostic does not cover the case where the observed data is not well represented by the model.” In other words, misspecified situations. This potential misspecification could presumably be tested by comparing the Pareto fit based on the actual data with a Pareto fit based on simulated data. Among other deficiencies, they point that this is “a local diagnostic that will not detect unseen modes”. In other words, what you get is what you see.

Gibbs clashes with importance sampling

Posted in pictures, Statistics with tags , , , , , on April 11, 2019 by xi'an

In an X validated question, an interesting proposal was made: at each (component-wise) step of a Gibbs sampler, replace simulation from the exact full conditional with simulation from an alternate density and weight the resulting simulation with a term made of a product of (a) the previous weight (b) the ratio of the true conditional over the substitute for the new value and (c) the inverse ratio for the earlier value of the same component. Which does not work for several reasons:

  1. the reweighting is doomed by its very propagation in that it keeps multiplying ratios of expectation one, which means an almost sure chance of degenerating;
  2. the weights are computed for a previous value that has not been generated from the same proposal and is anyway already properly weighted;
  3. due to the change in dimension produced by Gibbs, the actual target is the full conditional, which involves an intractable normalising constant;
  4. there is no guarantee for the weights to have finite variance, esp. when the proposal has thinner tails than the target.

as can be readily checked by a quick simulation experiment. The funny thing is that a proper importance weight can be constructed when envisioning  the sequence of Gibbs steps as a Metropolis proposal (in the dimension of the target).

Bayesian inference with intractable normalizing functions

Posted in Books, Statistics with tags , , , , , , , , , , , on December 13, 2018 by xi'an

In the latest September issue of JASA I received a few days ago, I spotted a review paper by Jaewoo Park & Murali Haran on intractable normalising constants Z(θ). There have been many proposals for solving this problem as well as several surveys, some conferences and even a book. The current survey focus on MCMC solutions, from auxiliary variable approaches to likelihood approximation algorithms (albeit without ABC entries, even though the 2006 auxiliary variable solutions of Møller et al. et of Murray et al. do simulate pseudo-observations and hence…). This includes the MCMC approximations to auxiliary sampling proposed by Faming Liang and co-authors across several papers. And the paper Yves Atchadé, Nicolas Lartillot and I wrote ten years ago on an adaptive MCMC targeting Z(θ) and using stochastic approximation à la Wang-Landau. Park & Haran stress the relevance of using sufficient statistics in this approach towards fighting computational costs, which makes me wonder if an ABC version could be envisioned.  The paper also includes pseudo-marginal techniques like Russian Roulette (once spelled Roullette) and noisy MCMC as proposed in Alquier et al.  (2016). These methods are compared on three examples: (1) the Ising model, (2) a social network model, the Florentine business dataset used in our original paper, and a larger one where most methods prove too costly, and (3) an attraction-repulsion point process model. In conclusion, an interesting survey, taking care to spell out the calibration requirements and the theoretical validation, if of course depending on the chosen benchmarks.