Archive for AMIS

online approximate Bayesian learning

Posted in Statistics with tags , , , , , , , on September 25, 2020 by xi'an

My friends and coauthors Matthieu Gerber and Randal Douc have just arXived a massive paper on online approximate Bayesian learning, namely the handling of the posterior distribution on the parameters of a state-space model, which remains a challenge to this day… Starting from the iterated batch importance sampling (IBIS) algorithm of Nicolas (Chopin, 2002) which he introduced in his PhD thesis. The online (“by online we mean that the memory and computational requirement to process each observation is finite and bounded uniformly in t”) method they construct is guaranteed for the approximate posterior to converge to the (pseudo-)true value of the parameter as the sample size grows to infinity, where the sequence of approximations is a Cesaro mixture of initial approximations with Gaussian or t priors, AMIS like. (I am somewhat uncertain about the notion of a sequence of priors used in this setup. Another funny feature is the necessity to consider a fat tail t prior from time to time in this sequence!) The sequence is in turn approximated by a particle filter. The computational cost of this IBIS is roughly in O(NT), depending on the regeneration rate.

one bridge further

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , on June 30, 2020 by xi'an

Jackie Wong, Jon Forster (Warwick) and Peter Smith have just published a paper in Statistics & Computing on bridge sampling bias and improvement by splitting.

“… known to be asymptotically unbiased, bridge sampling technique produces biased estimates in practical usage for small to moderate sample sizes (…) the estimator yields positive bias that worsens with increasing distance between the two distributions. The second type of bias arises when the approximation density is determined from the posterior samples using the method of moments, resulting in a systematic underestimation of the normalizing constant.”

Recall that bridge sampling is based on a double trick with two samples x and y from two (unnormalised) densities f and g that are interverted in a ratio

m \sum_{i=1}^n g(x_i)\omega(x_i) \Big/ n \sum_{i=1}^m f(y_i)\omega(y_i)

of unbiased estimators of the inverse normalising constants. Hence biased. The more the less similar these two densities are. Special cases for ω include importance sampling [unbiased] and reciprocal importance sampling. Since the optimal version of the bridge weight ω is the inverse of the mixture of f and g, it makes me wonder at the performance of using both samples top and bottom, since as an aggregated sample, they also come from the mixture, as in Owen & Zhou (2000) multiple importance sampler. However, a quick try with a positive Normal versus an Exponential with rate 2 does not show an improvement in using both samples top and bottom (even when using the perfectly normalised versions)


at least in terms of bias… Surprisingly (!) the bias almost vanishes for very different samples sizes either in favour of f or in favour of g. This may be a form of genuine defensive sampling, who knows?! At the very least, this ensures a finite variance for all weights. (The splitting approach introduced in the paper is a natural solution to create independence between the first sample and the second density. This reminded me of our two parallel chains in AMIS.)

sequential neural likelihood estimation as ABC substitute

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on May 14, 2020 by xi'an

A JMLR paper by Papamakarios, Sterratt, and Murray (Edinburgh), first presented at the AISTATS 2019 meeting, on a new form of likelihood-free inference, away from non-zero tolerance and from the distance-based versions of ABC, following earlier papers by Iain Murray and co-authors in the same spirit. Which I got pointed to during the ABC workshop in Vancouver. At the time I had no idea as to autoregressive flows meant. We were supposed to hold a reading group in Paris-Dauphine on this paper last week, unfortunately cancelled as a coronaviral precaution… Here are some notes I had prepared for the meeting that did not take place.

A simulator model is a computer program, which takes a vector of parameters θ, makes internal calls to a random number generator, and outputs a data vector x.”

Just the usual generative model then.

“A conditional neural density estimator is a parametric model q(.|φ) (such as a neural network) controlled by a set of parameters φ, which takes a pair of datapoints (u,v) and outputs a conditional probability density q(u|v,φ).”

Less usual, in that the outcome is guaranteed to be a probability density.

“For its neural density estimator, SNPE uses a Mixture Density Network, which is a feed-forward neural network that takes x as input and outputs the parameters of a Gaussian mixture over θ.”

In which theoretical sense would it improve upon classical or Bayesian density estimators? Where are the error evaluation, the optimal rates, the sensitivity to the dimension of the data? of the parameter?

“Our new method, Sequential Neural Likelihood (SNL), avoids the bias introduced by the proposal, by opting to learn a model of the likelihood instead of the posterior.”

I do not get the argument in that the final outcome (of using the approximation within an MCMC scheme) remains biased since the likelihood is not the exact likelihood. Where is the error evaluation? Note that in the associated Algorithm 1, the learning set is enlarged on each round, as in AMIS, rather than set back to the empty set ∅ on each round.

…given enough simulations, a sufficiently flexible conditional neural density estimator will eventually approximate the likelihood in the support of the proposal, regardless of the shape of the proposal. In other words, as long as we do not exclude parts of the parameter space, the way we propose parameters does not bias learning the likelihood asymptotically. Unlike when learning the posterior, no adjustment is necessary to account for our proposing strategy.”

This is a rather vague statement, with the only support being that the Monte Carlo approximation to the Kullback-Leibler divergence does converge to its actual value, i.e. a direct application of the Law of Large Numbers! But an interesting point I informally made a (long) while ago that all that matters is the estimate of the density at x⁰. Or at the value of the statistic at x⁰. The masked auto-encoder density estimator is based on a sequence of bijections with a lower-triangular Jacobian matrix, meaning the conditional density estimate is available in closed form. Which makes it sounds like a form of neurotic variational Bayes solution.

The paper also links with ABC (too costly?), other parametric approximations to the posterior (like Gaussian copulas and variational likelihood-free inference), synthetic likelihood, Gaussian processes, noise contrastive estimation… With experiments involving some of the above. But the experiments involve rather smooth models with relatively few parameters.

“A general question is whether it is preferable to learn the posterior or the likelihood (…) Learning the likelihood can often be easier than learning the posterior, and it does not depend on the choice of proposal, which makes learning easier and more robust (…) On the other hand, methods such as SNPE return a parametric model of the posterior directly, whereas a further inference step (e.g. variational inference or MCMC) is needed on top of SNL to obtain a posterior estimate”

A fair point in the conclusion. Which also mentions the curse of dimensionality (both for parameters and observations) and the possibility to work directly with summaries.

Getting back to the earlier and connected Masked autoregressive flow for density estimation paper, by Papamakarios, Pavlakou and Murray:

“Viewing an autoregressive model as a normalizing flow opens the possibility of increasing its flexibility by stacking multiple models of the same type, by having each model provide the source of randomness for the next model in the stack. The resulting stack of models is a normalizing flow that is more flexible than the original model, and that remains tractable.”

Which makes it sound like a sort of a neural network in the density space. Optimised by Kullback-Leibler minimisation to get asymptotically close to the likelihood. But a form of Bayesian indirect inference in the end, namely an MLE on a pseudo-model, using the estimated model as a proxy in Bayesian inference…

nested sampling via SMC

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , , on April 2, 2020 by xi'an

“We show that by implementing a special type of [sequential Monte Carlo] sampler that takes two im-portance sampling paths at each iteration, one obtains an analogous SMC method to [nested sampling] that resolves its main theoretical and practical issues.”

A paper by Queenslander Robert Salomone, Leah South, Chris Drovandi and Dirk Kroese that I had missed (and recovered by Grégoire after we discussed this possibility with our Master students). On using SMC in nested sampling. What are the difficulties mentioned in the above quote?

  1. Dependence between the simulated samples, since only the offending particle is moved by one or several MCMC steps. (And MultiNest is not a foolproof solution.)
  2. The error due to quadrature is hard to evaluate, with parallelised versions aggravating the error.
  3. There is a truncation error due to the stopping rule when the exact maximum of the likelihood function is unknown.

Not mentioning the Monte Carlo error, of course, which should remain at the √n level.

“Nested Sampling is a special type of adaptive SMC algorithm, where weights are assigned in a suboptimal way.”

The above remark is somewhat obvious for a fixed sequence of likelihood levels and a set of particles at each (ring) level. moved by a Markov kernel with the right stationary target. Constrained to move within the ring, which may prove delicate in complex settings. Such a non-adaptive version is however not realistic and hence both the level sets and the stopping rule need be selected from the existing simulation, respectively as a quantile of the observed likelihood and as a failure to modify the evidence approximation, an adaptation that is a Catch 22! as we already found in the AMIS paper.  (AMIS stands for adaptive mixture importance sampling.) To escape the quandary, the authors use both an auxiliary variable (to avoid atoms) and two importance sampling sequences (as in AMIS). And only a single particle with non-zero incremental weight for the (upper level) target. As the full details are a bit fuzzy to me, I hope I can experiment with my (quarantined) students on the full implementation of the method.

“Such cases asides, the question whether SMC is preferable using the TA or NS approach is really one of whether it is preferable to sample (relatively) easy distributions subject to a constraint or to sample potentially difficult distributions.”

A question (why not regular SMC?) I was indeed considering until coming to the conclusion section but did not find it treated in the paper. There is little discussion on the computing requirements either, as it seems the method is more time-consuming than a regular nested sample. (On the personal side,  I appreciated very much their “special thanks to Christian Robert, whose many blog posts on NS helped influence this work, and played a large partin inspiring it.”)

a new rule for adaptive importance sampling

Posted in Books, Statistics with tags , , , , , , , , , on March 5, 2019 by xi'an

Art Owen and Yi Zhou have arXived a short paper on the combination of importance sampling estimators. Which connects somehow with the talk about multiple estimators I gave at ESM last year in Helsinki. And our earlier AMIS combination. The paper however makes two important assumptions to reach optimal weighting, which is inversely proportional to the variance:

  1. the estimators are uncorrelated if dependent;
  2. the variance of the k-th estimator is of order a (negative) power of k.

The later is puzzling when considering a series of estimators, in that k appears to act as a sample size (as in AMIS), the power is usually unknown but also there is no reason for the power to be the same for all estimators. The authors propose to use ½ as the default, both because this is the standard Monte Carlo rate and because the loss in variance is then minimal, being 12% larger.

As an aside, Art Owen also wrote an invited discussion “the unreasonable effectiveness of Monte Carlo” of ” Probabilistic Integration: A Role in Statistical Computation?” by François-Xavier Briol, Chris  Oates, Mark Girolami (Warwick), Michael Osborne and Deni Sejdinovic, to appear in Statistical Science, discussion that contains a wealth of smart and enlightening remarks. Like the analogy between pseudo-random number generators [which work unreasonably well!] vs true random numbers and Bayesian numerical integration versus non-random functions. Or the role of advanced bootstrapping when assessing the variability of Monte Carlo estimates (citing a paper of his from 1992). Also pointing out at an intriguing MCMC paper by  Michael Lavine and Jim Hodges to appear in The American Statistician.