Archive for Wasserstein distance

the buzz about nuzz

Posted in Books, Mountains, pictures, Statistics with tags , , , , , , , , , , , , , on April 6, 2020 by xi'an

“…expensive in these terms, as for each root, Λ(x(s),v) (at the cost of one epoch) has to be evaluated for each root finding iteration, for each node of the numerical integral

When using the ZigZag sampler, the main (?) difficulty is in producing velocity switch as the switches are produced as interarrival times of an inhomogeneous Poisson process. When the rate of this process cannot be integrated out in an analytical manner, the only generic approach I know is in using Poisson thinning, obtained by finding an integrable upper bound on this rate, generating from this new process and subsampling. Finding the bound is however far from straightforward and may anyway result in an inefficient sampler. This new paper by Simon Cotter, Thomas House and Filippo Pagani makes several proposals to simplify this simulation, Nuzz standing for numerical ZigZag. Even better (!), their approach is based on what they call the Sellke construction, with Tom Sellke being a probabilist and statistician at Purdue University (trivia: whom I met when spending a postdoctoral year there in 1987-1988) who also wrote a fundamental paper on the opposition between Bayes factors and p-values with Jim Berger.

“We chose as a measure of algorithm performance the largest Kolmogorov-Smirnov (KS) distance between the MCMC sample and true distribution amongst all the marginal distributions.”

The practical trick is rather straightforward in that it sums up as the exponentiation of the inverse cdf method, completed with a numerical resolution of the inversion. Based on the QAGS (Quadrature Adaptive Gauss-Kronrod Singularities) integration routine. In order to save time Kingman’s superposition trick only requires one inversion rather than d, the dimension of the variable of interest. This nuzzled version of ZIgZag can furthermore be interpreted as a PDMP per se. Except that it retains a numerical error, whose impact on convergence is analysed in the paper. In terms of Wasserstein distance between the invariant measures. The paper concludes with a numerical comparison between Nuzz and random walk Metropolis-Hastings, HMC, and manifold MALA, using the number of evaluations of the likelihood as a measure of time requirement. Tuning for Nuzz is described, but not for the competition. Rather dramatically the Nuzz algorithm performs worse than this competition when counting one epoch for each likelihood computation and better when counting one epoch for each integral inversion. Which amounts to perfect inversion, unsurprisingly. As a final remark, all models are more or less Normal, with very smooth level sets, maybe not an ideal range

 

BayesComp’20

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , on January 10, 2020 by xi'an

First, I really have to congratulate my friend Jim Hobert for a great organisation of the meeting adopting my favourite minimalist principles (no name tag, no “goodies” apart from the conference schedule, no official talks). Without any pretense at objectivity, I also appreciated very much the range of topics and the sweet frustration of having to choose between two or three sessions each time. Here are some notes taken during some talks (with no implicit implication for the talks no mentioned, re. above frustration! as well as very short nights making sudden lapse in concentration highly likely).

On Day 1, Paul Fearnhead’s inaugural plenary talk was on continuous time Monte Carlo methods, mostly bouncy particle and zig-zag samplers, with a detailed explanation on the simulation of the switching times which likely brought the audience up to speed even if they had never heard of them. And an opening on PDMPs used as equivalents to reversible jump MCMC, reminding me of the continuous time (point process) solutions of Matthew Stephens for mixture inference (and of Preston, Ripley, Møller).

The same morn I heard of highly efficient techniques to handle very large matrices and p>n variables selections by Akihiko Nishimura and Ruth Baker on a delayed acceptance ABC, using a cheap proxy model. Somewhat different from indirect inference. I found the reliance on ESS somewhat puzzling given the intractability of the likelihood (and the low reliability of the frequency estimate) and the lack of connection with the “real” posterior. At the same ABC session, Umberto Picchini spoke on a joint work with Richard Everitt (Warwick) on linking ABC and pseudo-marginal MCMC by bootstrap. Actually, the notion of ABC likelihood was already proposed as pseudo-marginal ABC by Anthony Lee, Christophe Andrieu and Arnaud Doucet in the discussion of Fearnhead and Prangle (2012) but I wonder at the focus of being unbiased when the quantity is not the truth, i.e. the “real” likelihood. It would seem more appropriate to attempt better kernel estimates on the distribution of the summary itself. The same session also involved David Frazier who linked our work on ABC for misspecified models and an on-going investigation of synthetic likelihood.

Later, there was a surprise occurrence of the Bernoulli factory in a talk by Radu Herbei on Gaussian process priors with accept-reject algorithms, leading to exact MCMC, although the computing implementation remains uncertain. And several discussions during the poster session, incl. one on the planning of a 2021 workshop in Oaxaca centred on objective Bayes advances as we received acceptance of our proposal by BIRS today!

On Day 2, David Blei gave a plenary introduction to variational Bayes inference and latent Dirichlet allocations, somewhat too introductory for my taste although other participants enjoyed this exposition. He also mentioned a recent JASA paper on the frequentist consistency of variational Bayes that I should check. Speaking later with PhD students, they really enjoyed this opening on an area they did not know that well.

A talk by Kengo Kamatani (whom I visited last summer) on improved ergodicity rates for heavy tailed targets and Crank-NIcholson modifications to the random walk proposal (which uses an AR(1) representation instead of the random walk). With the clever idea of adding the scale of the proposal as an extra parameter with a prior of its own. Gaining one order of magnitude in the convergence speed (i.e. from d to 1 and from d² to d, where d is the dimension), which is quite impressive (and just published in JAP).Veronica Rockova linked Bayesian variable selection and machine learning via ABC, with conditions on the prior for model consistency. And a novel approach using part of the data to learn an ABC partial posterior, which reminded me of the partial  Bayes factors of the 1990’s although it is presumably unrelated. And a replacement of the original rejection ABC via multi-armed bandits, where each variable is represented by an arm, called ABC Bayesian forests. Recalling the simulation trick behind Thompson’s approach, reproduced for the inclusion or exclusion of variates and producing a fixed estimate for the (marginal) inclusion probabilities, which makes it sound like a prior-feeback form of empirical Bayes. Followed by a talk of Gregor Kastner on MCMC handling of large time series with specific priors and a massive number of parameters.

The afternoon also had a wealth of exciting talks and missed opportunities (in the other sessions!). Which ended up with a strong if unintended French bias since I listened to Christophe Andrieu, Gabriel Stolz, Umut Simsekli, and Manon Michel on different continuous time processes, with Umut linking GANs, multidimensional optimal transport, sliced-Wasserstein, generative models, and new stochastic differential equations. Manon Michel gave a highly intuitive talk on creating non-reversibility, getting rid of refreshment rates in PDMPs to kill any form of reversibility.

BayesComp 2020 at a glance

Posted in Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on December 18, 2019 by xi'an

label switching by optimal transport: Wasserstein to the rescue

Posted in Books, Statistics, Travel with tags , , , , , , , , , , , , , , on November 28, 2019 by xi'an

A new arXival by Pierre Monteiller et al. on resolving label switching by optimal transport. To appear in NeurIPS 2019, next month (where I will be, but extra muros, as I have not registered for the conference). Among other things, the paper was inspired from an answer of mine on X validated, presumably a première (and a dernière?!). Rather than picketing [in the likely unpleasant weather ]on the pavement outside the conference centre, here are my raw reactions to the proposal made in the paper. (Usual disclaimer: I was not involved in the review of this paper.)

“Previous methods such as the invariant losses of Celeux et al. (2000) and pivot alignments of Marin et al. (2005) do not identify modes in a principled manner.”

Unprincipled, me?! We did not aim at identifying all modes but only one of them, since the posterior distribution is invariant under reparameterisation. Without any bad feeling (!), I still maintain my position that using a permutation invariant loss function is a most principled and Bayesian approach towards a proper resolution of the issue. Even though figuring out the resulting Bayes estimate may prove tricky.

The paper thus adopts a different approach, towards giving a manageable meaning to the average of the mixture distributions over all permutations, not in a linear Euclidean sense but thanks to a Wasserstein barycentre. Which indeed allows for an averaged mixture density, although a point-by-point estimate that does not require switching to occur at all was already proposed in earlier papers of ours. Including the Bayesian Core. As shown above. What was first unclear to me is how necessary the Wasserstein formalism proves to be in this context. In fact, the major difference with the above picture is that the estimated barycentre is a mixture with the same number of components. Computing time? Bayesian estimate?

Green’s approach to the problem via a point process representation [briefly mentioned on page 6] of the mixture itself, as for instance presented in our mixture analysis handbook, should have been considered. As well as issues about Bayes factors examined in Gelman et al. (2003) and our more recent work with Kate Jeong Eun Lee. Where the practical impossibility of considering all possible permutations is processed by importance sampling.

An idle thought that came to me while reading this paper (in Seoul) was that a more challenging problem would be to face a model invariant under the action of a group with only a subset of known elements of that group. Or simply too many elements in the group. In which case averaging over the orbit would become an issue.

on-line parameter estimation with Wasserstein

Posted in Books, Statistics, University life with tags , , , , , , , , on November 27, 2019 by xi'an

Just found out that our paper On parameter estimation with the Wasserstein distance with Espen Bernton, Pierre Jacob, and Mathieu Gerber, has now appeared on-line on Information and Inference: A Journal of the IMA,

assessing MCMC convergence

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on June 6, 2019 by xi'an

When MCMC became mainstream in the 1990’s, there was a flurry of proposals to check, assess, and even guarantee convergence to the stationary distribution, as discussed in our MCMC book. Along with Chantal Guihenneuc and Kerrie Mengersen, we also maintained for a while a reviewww webpage categorising theses. Niloy Biswas and Pierre Jacob have recently posted a paper where they propose the use of couplings (and unbiased MCMC) towards deriving bounds on different metrics between the target and the current distribution of the Markov chain. Two chains are created from a given kernel and coupled with a lag of L, meaning that after a while, the two chains become one with a time difference of L. (The supplementary material contains many details on how to induce coupling.) The distance to the target can then be bounded by a sum of distances between the two chains until they merge. The above picture from the paper is a comparison a Polya-Urn sampler with several HMC samplers for a logistic target (not involving the Pima Indian dataset!). The larger the lag L the more accurate the bound. But the larger the lag the more expensive the assessment of how many steps are needed to convergence. Especially when considering that the evaluation requires restarting the chains from scratch and rerunning until they couple again, rather than continuing one run which can only brings the chain closer to stationarity and to being distributed from the target. I thus wonder at the possibility of some Rao-Blackwellisation of the simulations used in this assessment (while realising once more than assessing convergence almost inevitably requires another order of magnitude than convergence itself!). Without a clear idea of how to do it… For instance, keeping the values of the chain(s) at the time of coupling is not directly helpful to create a sample from the target since they are not distributed from that target.

[Pierre also wrote a blog post about the paper on Statisfaction that is definitely much clearer and pedagogical than the above.]

selecting summary statistics [a tale of two distances]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , on May 23, 2019 by xi'an

As Jonathan Harrison came to give a seminar in Warwick [which I could not attend], it made me aware of his paper with Ruth Baker on the selection of summaries in ABC. The setting is an ABC-SMC algorithm and it relates with Fearnhead and Prangle (2012), Barnes et al. (2012), our own random forest approach, the neural network version of Papamakarios and Murray (2016), and others. The notion here is to seek the optimal weights of different summary statistics in the tolerance distance, towards a maximization of a distance (Hellinger) between prior and ABC posterior (Wasserstein also comes to mind!). A sort of dual of the least informative prior. Estimated by a k-nearest neighbour version [based on samples from the prior and from the ABC posterior] I had never seen before. I first did not get how this k-nearest neighbour distance could be optimised in the weights since the posterior sample was already generated and (SMC) weighted, but the ABC sample can be modified by changing the [tolerance] distance weights and the resulting Hellinger distance optimised this way. (There are two distances involved, in case the above description is too murky!)

“We successfully obtain an informative unbiased posterior.”

The paper spends a significant while in demonstrating that the k-nearest neighbour estimator converges and much less on the optimisation procedure itself, which seems like a real challenge to me when facing a large number of particles and a high enough dimension (in the number of statistics). (In the examples, the size of the summary is 1 (where does the weight matter?), 32, 96, 64, with 5 10⁴, 5 10⁴, 5 10³ and…10 particles, respectively.) The authors address the issue, though, albeit briefly, by mentioning that, for the same overall computation time, the adaptive weight ABC is indeed further from the prior than a regular ABC with uniform weights [rather than weighted by the precisions]. They also argue that down-weighting some components is akin to selecting a subset of summaries, but I beg to disagree with this statement as the weights are never exactly zero, as far as I can see, hence failing to fight the curse of dimensionality. Some LASSO version could implement this feature.