## nested sampling via SMC

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , , on April 2, 2020 by xi'an

“We show that by implementing a special type of [sequential Monte Carlo] sampler that takes two im-portance sampling paths at each iteration, one obtains an analogous SMC method to [nested sampling] that resolves its main theoretical and practical issues.”

A paper by Queenslander Robert Salomone, Leah South, Chris Drovandi and Dirk Kroese that I had missed (and recovered by Grégoire after we discussed this possibility with our Master students). On using SMC in nested sampling. What are the difficulties mentioned in the above quote?

1. Dependence between the simulated samples, since only the offending particle is moved by one or several MCMC steps. (And MultiNest is not a foolproof solution.)
2. The error due to quadrature is hard to evaluate, with parallelised versions aggravating the error.
3. There is a truncation error due to the stopping rule when the exact maximum of the likelihood function is unknown.

Not mentioning the Monte Carlo error, of course, which should remain at the √n level.

“Nested Sampling is a special type of adaptive SMC algorithm, where weights are assigned in a suboptimal way.”

The above remark is somewhat obvious for a fixed sequence of likelihood levels and a set of particles at each (ring) level. moved by a Markov kernel with the right stationary target. Constrained to move within the ring, which may prove delicate in complex settings. Such a non-adaptive version is however not realistic and hence both the level sets and the stopping rule need be selected from the existing simulation, respectively as a quantile of the observed likelihood and as a failure to modify the evidence approximation, an adaptation that is a Catch 22! as we already found in the AMIS paper.  (AMIS stands for adaptive mixture importance sampling.) To escape the quandary, the authors use both an auxiliary variable (to avoid atoms) and two importance sampling sequences (as in AMIS). And only a single particle with non-zero incremental weight for the (upper level) target. As the full details are a bit fuzzy to me, I hope I can experiment with my (quarantined) students on the full implementation of the method.

“Such cases asides, the question whether SMC is preferable using the TA or NS approach is really one of whether it is preferable to sample (relatively) easy distributions subject to a constraint or to sample potentially difficult distributions.”

A question (why not regular SMC?) I was indeed considering until coming to the conclusion section but did not find it treated in the paper. There is little discussion on the computing requirements either, as it seems the method is more time-consuming than a regular nested sample. (On the personal side,  I appreciated very much their “special thanks to Christian Robert, whose many blog posts on NS helped influence this work, and played a large partin inspiring it.”)

## ensemble rejection sampling

Posted in Statistics with tags , , , on March 25, 2020 by xi'an

George Deligiannidis, Arnaud Doucet and Sylvain Rubenthaler have constructed a form of Rao-Blackwellised estimate out of a regular rejection sampler. Doubly surprisingly as turning importance sampling into regular sampling plus  gaining over the standard accept-reject estimate. They call their approach ensemble rejection sampling. This is done by seeing the N-sample created from the proposal as an importance sampler, exploiting the importance weights towards estimating the (intractable) normalising constant of the target density, and creating an upper bound on this estimate Ẑ. That depends on the current value X from the N-sample under consideration for acceptance as

Z⁺=Ẑ+{max(w)-w(X)}/N

with a probability Ẑ/Z⁺ to accept X. The amazing result is that the X thus marginaly produced is distributed from the target! Meaning that this is a case for a self-normalised importance sampling distribution producing an exact simulation from the target. While this cannot produce an iid sample, it can be exploited to produce unbiased estimators of expectations under the target. Without even resampling and at a linear cost in the sample size N.

The method can be extended to the dynamic (state-space) case. At a cost of O(N²T) as first observed by Radford Neal. However, the importance sample seems to be distributed from a product of proposals that do not account for the previous particles. But maybe accounting for the observations. While the result involves upper bounds on the dynamic importance weights, the capacity to deliver exact simulations remains a major achievement, in my opinion.

## Hastings at 50, from a Metropolis

Posted in Kids, pictures, Running, Travel with tags , , , , , , , , , , , , , , , , , , , , , , on January 4, 2020 by xi'an

A weekend trip to the quaint seaside city of Le Touquet Paris-Plage, facing the city of Hastings on the other side of the Channel, 50 miles away (and invisible on the pictures!), during and after a storm that made for a fantastic watch from our beach-side rental, if less for running! The town is far from being a metropolis, actually, but it got its added surname “Paris-Plage” from British investors who wanted to attract their countrymen in the late 1800s. The writers H.G. Wells and P.G. Wodehouse lived there for a while. (Another type of tourist, William the Conqueror, left for Hastings in 1066 from a wee farther south, near Saint-Valéry-sur-Somme.)

And the coincidental on-line publication in Biometrika of a 50 year anniversary paper, The Hastings algorithm at fifty by David Dunson and James Johndrow. More of a celebration than a comprehensive review, with focus on scalable MCMC, gradient based algorithms, Hamiltonian Monte Carlo, nonreversible Markov chains, and interesting forays into approximate Bayes. Which makes for a great read for graduate students and seasoned researchers alike!

## sampling-importance-resampling is not equivalent to exact sampling [triste SIR]

Posted in Books, Kids, Statistics, University life with tags , , , , , , on December 16, 2019 by xi'an

Following an X validated question on the topic, I reassessed a previous impression I had that sampling-importance-resampling (SIR) is equivalent to direct sampling for a given sample size. (As suggested in the above fit between a N(2,½) target and a N(0,1) proposal.)  Indeed, when one produces a sample

$x_1,\ldots,x_n \stackrel{\text{i.i.d.}}{\sim} g(x)$

and resamples with replacement from this sample using the importance weights

$f(x_1)g(x_1)^{-1},\ldots,f(x_n)g(x_n)^{-1}$

the resulting sample

$y_1,\ldots,y_n$

is neither “i.” nor “i.d.” since the resampling step involves a self-normalisation of the weights and hence a global bias in the evaluation of expectations. In particular, if the importance function g is a poor choice for the target f, meaning that the exploration of the whole support is imperfect, if possible (when both supports are equal), a given sample may well fail to reproduce the properties of an iid example ,as shown in the graph below where a Normal density is used for g while f is a Student t⁵ density:

## Mallows model with intractable constant

Posted in Books, pictures, Statistics with tags , , , , , , , , on November 21, 2019 by xi'an

The paper Probabilistic Preference Learning with the Mallows Rank Model by Vitelli et al. was published last year in JMLR which may be why I missed it. It brings yet another approach to the perpetual issue of intractable  normalising constants. Here, the data is made of rankings of n objects by N experts, with an assumption of a latent ordering ρ acting as “mean” in the Mallows model. Along with a scale α, both to be estimated, and indeed involving an intractable normalising constant in the likelihood that only depends on the scale α because the distance is right-invariant. For instance the Hamming distance used in coding. There exists a simplification of the expression of the normalising constant due to the distance only taking a finite number of values, multiplied by the number of cases achieving a given value. Still this remains a formidable combinatoric problem. Running a Gibbs sampler is not an issue for the parameter ρ as the resulting Metropolis-Hastings-within-Gibbs step does not involve the missing constant. But it poses a challenge for the scale α, because the Mallows model cannot be exactly simulated for most distances. Making the use of pseudo-marginal and exchange algorithms presumably impossible. The authors use instead an importance sampling approximation to the normalising constant relying on a pseudo-likelihood version of Mallows model and a massive number (10⁶ to 10⁸) of simulations (in the humongous set of N-sampled permutations of 1,…,n). The interesting point in using this approximation is that the convergence result associated with pseudo-marginals no long applies and that the resulting MCMC algorithm converges to another limiting distribution. With the drawback that this limiting distribution is conditional to the importance sample. Various extensions are found in the paper, including a mixture of Mallows models. And an round of applications, including one on sushi preferences across Japan (fatty tuna coming almost always on top!). As the authors note, a very large number of items like n>10⁴ remains a challenge (or requires an alternative model).

## distilling importance

Posted in Books, Statistics, University life with tags , , , , , , , , , , on November 13, 2019 by xi'an

As I was about to leave Warwick at the end of last week, I noticed a new arXival by Dennis Prangle, distilling importance sampling. In connection with [our version of] population Monte Carlo, “each step of [Dennis’] distilled importance sampling method aims to reduce the Kullback Leibler (KL) divergence from the distilled density to the current tempered posterior.”  (The introduction of the paper points out various connections with ABC, conditional density estimation, adaptive importance sampling, X entropy, &tc.)

“An advantage of [distilled importance sampling] over [likelihood-free] methods is that it performs inference on the full data, without losing information by using summary statistics.”

A notion used therein I had not heard before is the one of normalising flows, apparently more common in machine learning and in particular with GANs. (The slide below is from Shakir Mohamed and Danilo Rezende.) The  notion is to represent an arbitrary variable as the bijective transform of a standard variate like a N(0,1) variable or a U(0,1) variable (calling the inverse cdf transform). The only link I can think of is perfect sampling where the representation of all simulations as a function of a white noise vector helps with coupling.

I read a blog entry by Eric Jang on the topic (who produced this slide among other things) but did not emerge much the wiser. As the text instantaneously moves from the Jacobian formula to TensorFlow code… In Dennis’ paper, it appears that the concept is appealing for quickly producing samples and providing a rich family of approximations, especially when neural networks are included as transforms. They are used to substitute for a tempered version of the posterior target, validated as importance functions and aiming at being the closest to this target in Kullback-Leibler divergence. With the importance function interpretation, unbiased estimators of the gradient [in the parameter of the normalising flow] can be derived, with potential variance reduction. What became clearer to me from reading the illustration section is that the prior x predictive joint can also be modeled this way towards producing reference tables for ABC (or GANs) much faster than with the exact model. (I came across several proposals of that kind in the past months.) However, I deem mileage should vary depending on the size and dimension of the data. I also wonder at the connection between the (final) distribution simulated by distilled importance [the least tempered target?] and the ABC equivalent.

## Why do we draw parameters to draw from a marginal distribution that does not contain the parameters?

Posted in Statistics with tags , , , , , , , on November 3, 2019 by xi'an

A revealing question on X validated of a simulation concept students (and others) have trouble gripping with. Namely using auxiliary variates to simulate from a marginal distribution, since these auxiliary variables are later dismissed and hence appear to them (students) of no use at all. Even after being exposed to the accept-reject algorithm. Or to multiple importance sampling. In the sense that a realisation of a random variable can be associated with a whole series of densities in an importance weight, all of them being valid (but some more equal than others!).