Archive for Wasserstein distance

connection between tempering & entropic mirror descent

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on April 30, 2024 by xi'an

The next One World ABC webinar is this  Thursday,  the 2nd May, at 9am UK time, with Francesca Crucinio (King’s College London, formerly CREST and even more formerly Warwick) presenting

“A connection between Tempering and Entropic Mirror Descent”.

a joint work with Nicolas Chopin and Anna Korba (both from CREST) whose abstract follows:

This work explores the connections between tempering (for Sequential Monte Carlo; SMC) and entropic mirror descent to sample from a target probability distribution whose unnormalized density is known. We establish that tempering SMC corresponds to entropic mirror descent applied to the reverse Kullback-Leibler (KL) divergence and obtain convergence rates for the tempering iterates. Our result motivates the tempering iterates from an optimization point of view, showing that tempering can be seen as a descent scheme of the KL divergence with respect to the Fisher-Rao geometry, in contrast to Langevin dynamics that perform descent of the KL with respect to the Wasserstein-2 geometry. We exploit the connection between tempering and mirror descent iterates to justify common practices in SMC and derive adaptive tempering rules that improve over other alternative benchmarks in the literature.

discrepancy–based ABC posteriors via Rademacher complexity

Posted in Statistics with tags , , , , , , , , , , , on April 18, 2024 by xi'an

Sirio Legramanti, Daniele Durante, and Pierre Alquier just arXived a massive paper on the concentration of discrepancy–based ABC posteriors via Rademacher complexity, which includes MMD and Wasserstein distance-based ABC methods. The paper provides sufficient conditions under which a discrepancy within the integral probability semimetrics class guarantees uniform convergence and concentration of the induced ABC posterior, without necessarily requiring suitable regularity conditions for the underlying data generating process and the assumed statistical model, meaning that they also cover misspecified cases. In particular, the authors derive upper and lower bounds on the limiting acceptance probabilities for the ABC posterior to remain well–defined for a sample size large enough. They thus deliver an improved understanding of the factors that govern the uniform convergence and concentration properties of discrepancy–based ABC posteriors under a fairly unified perspective, which I deem a significant advance on the several papers my coauthors Ernst Bernton, David Frazier, Mathieu Gerber, Pierre Jacob, Gael Martin, Judith Rousseau, Robin Ryder, and yours truly produced in that domain over the past years (although our Series B misspecification paper does not appear in the reference list!)

“…as highlighted by the authors, these [convergence] conditions (i) can be difficult to verify for several discrepancies, (ii) do not allow to assess whether some of these discrepancies can achieve convergence and concentration uniformly over P(Y), and (iii) often yield bounds which hinder an in–depth understanding of the factors regulating these limiting properties”

The first result is that, asymptotically in n and a fixed large-enough tolerance, the ABC posterior is always well–defined but within a Rademacher ball of the pseudo-true posterior, larger than the tolerance ε when the Rademacher complexity does not vanish in n (a feature on which my intuition is found to be lacking!, since it seems to relate solely to the class of functions adopted for the definition of said discrepancy). When the tolerance ε(n) decreases to its minimum, as in our paper, the speed of concentration is similar to ours, with a speed slower than √n. And assuming the tolerance ε(n) decreases to its minimum slower than √n but faster than the Rademacher complexity.

“…the bound we derive crucially depends on [the Rademacher complexity], which is specific to each discrepancy D and plays a fundamental role in controlling the rate of concentration of the ABC posterior.”

The paper also opens towards non-iid settings (as in our Wasserstein paper) and generalized likelihood–free Bayesian inference à la Bissiri et al. (2016). A most interesting take on the universality of ABC convergence, thus, although assuming bounded function spaces from the start.

ellis unconference [not in Hawai’i]

Posted in pictures, Running, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on July 26, 2023 by xi'an

As ICML 2023 is happening this week, in Hawai’i, many did not have the opportunity to get there, for whatever reason, and hence the ellis (European Lab for Learning {and} Intelligent Systems] board launched [fairly late!] with the help of Hi! Paris an unconference (i.e., a mirror) that is taking place in HEC, Jouy-en-Josas, SW of Paris, for AI researchers presenting works (theirs or others’) presented at ICML 2023. Or not. There was no direct broadcasting of talks as we had (had) in CIRM for ISBA 2020 2021. But some presentations based on preregistered talks. Over 50 people showed up in Jouy.

As it happened, I had quite an exciting bike ride to the HEC campus from home, under a steady rain, crossing a (modest) forest (de Verrières) I had never visited before, despite it being a few km from home, getting a wee bit lost, stopped by a train Xing between Bièvre and Jouy, and ending up at the campus just in time for the first talk (as I had not accounted for the huge altitude differential). Among curiosities met on the way, “giant” sequoias, a Tonkin pond, Chateaubriand’s house.

As always I am rather impressed by the efficiency of AI-ML conferences run, with papers+slides+reviews online, plus extra material as in this example. Lots of papers on diffusion models this year, apparently. (In conjunction with the trend observed at the Flatiron workshop last Fall.) Below are incoherent tidbits from the presentations I attended:

  • exponential convergence of the Sinkhorn algorithm by Alain Durmus and co-authors, with the surprise occurrence of a left Haar measure
  • a paper (by Jerome Baum, Heishiro Kanagawa, and my friend Arthur Gretton) on Stein discrepancy, with an Zanella Stein operator relating to Metropolis-Hastings/Barker since it has expectation zero under stationarity, interesting approach to variable length random variables, not a RJMCMC, but nearby.
  • the occurance of a criticism of the EU GDPR that did not feel appropriate for synthetic data used in privacy protection.
  • the alternative Sliced Wasserstein distance, making me wonder if we could optimally go from measure μ to measure ζ using random directions or how much was lost this way.

\mathbb E[y|X=x] = \mathbb E\left[y\frac{f_{XY}(x,y)}{f_X(x)f_Y(y)}|X=x\right] = \frac{\mathbb E\left[y\frac{f_{XY}(x,y)}{f_Y(y)}|X=x\right]}{f_X(x)}

as (a) densities are replaced with kernel estimates, (b) the outer density may be very small, (c) no variance assessment is provided.

  • Markov score climbing and transport score climbing using a normalising flow, for variational approximation, presented by Christian Naesseth, with a warping transform that sounded like inverting the flow (?)
  • Yazid Janati not presenting their ICML paper State and parameter learning with PARIS particle Gibbs written with Gabriel Cardoso, Sylvain Le Corff, Eric Moulines and Jimmy Olsson, but another work with a diffusion based model to be learned by SMC and a clever call to Tweedie’s formula. (Maurice Kenneth Tweedie, not Richard Tweedie!) Which I just realised I have used many times when working on Bayesian shrinkage estimators

Approximation Methods in Bayesian Analysis [#3]

Posted in Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on June 23, 2023 by xi'an

My last day (#4) at the workshop, as I had to return to Paris earlier. A rather theoretical morning again, with Morgane Austern on (probabilistic) concentration inequalities on transport distances, far from my comfort zone if lively, Jason Xu on replacing non-convex penalisation factors to distances to the corresponding manifold, which I found most interesting if not directly helpful for simulating over submanifolds, and Hugo Lavenant on studying the impact of prior choice as merging of opinions, in the (Milanese) setting of completely random measures, with the surprise occurrence of a double bent for some choices. The afternoon session saw Andrew Gelman reflecting on multiscale modelling (sans slide et sans tableau) and Chris Holmes introduce the fundamentals of Bayesian conformal prediction, towards reaching well-calibrated (in a frequentist sense) Bayesian procedures by resorting to exchangeability and rank tests. I alas missed the other talks of the day.

In recap, this was a wonderful conference, with a perfect audience size, a diverse if intense program, and a lot of interactions. In addition, the short talk sessions worked very nicely, even at 22:10 after a long day. And attracted very strong audience, even at 22:10! Indeed, they were uniformly well-calibrated, time-wise, and with high clarity messages. To be repeated. As there were many newcomers to CIRM, they discovered the idiosyncrasies of the place and of its surrounding, mostly positively.

On the outdoor front, the week saw an overall moderately hot weather but a constant wind that prevented me from sleeping (well), but which helped with waking up before dawn to cycle or run to my open water pool! The sea remained reasonably choppy, so waves did not prevent my swimming.

BayesComp²³ [aka MCMski⁶]

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , on March 20, 2023 by xi'an

The main BayesComp meeting started right after the ABC workshop and went on at a grueling pace, and offered a constant conundrum as to which of the four sessions to attend, the more when trying to enjoy some outdoor activity during the lunch breaks. My overall feeling is that it went on too fast, too quickly! Here are some quick and haphazard notes from some of the talks I attended, as for instance the practical parallelisation of an SMC algorithm by Adrien Corenflos, the advances made by Giacommo Zanella on using Bayesian asymptotics to assess robustness of Gibbs samplers to the dimension of the data (although with no assessment of the ensuing time requirements), a nice session on simulated annealing, from black holes to Alps (if the wrong mountain chain for Levi), and the central role of contrastive learning à la Geyer (1994) in the GAN talks of Veronika Rockova and Éric Moulines. Victor  Elvira delivered an enthusiastic talk on our massively recycled importance on-going project that we need to complete asap!

While their earlier arXived paper was on my reading list, I was quite excited by Nicolas Chopin’s (along with Mathieu Gerber) work on some quadrature stabilisation that is not QMC (but not too far either), with stratification over the unit cube (after a possible reparameterisation) requiring more evaluations, plus a sort of pulled-by-its-own-bootstrap control variate, but beating regular Monte Carlo in terms of convergence rate and practical precision (if accepting a large simulation budget from the start). A difficulty common to all (?) stratification proposals is that it does not readily applies to highly concentrated functions.

I chaired the lightning talks session, which were 3mn one-slide snapshots about some incoming posters selected by the scientific committee. While I appreciated the entry into the poster session, the more because it was quite crowded and busy, if full of interesting results, and enjoyed the slide solely made of “0.234”, I regret that not all poster presenters were not given the same opportunity (although I am unclear about which format would have permitted this) and that it did not attract more attendees as it took place in parallel with other sessions.

In a not-solely-ABC session, I appreciated Sirio Legramanti speaking on comparing different distance measures via Rademacher complexity, highlighting that some distances are not robust, incl. for instance some (all?) Wasserstein distances that are not defined for heavy tailed distributions like the Cauchy distribution. And using the mean as a summary statistic in such heavy tail settings comes as an issue, since the distance between simulated and observed means does not decrease in variance with the sample size, with the practical difficulty that the problem is hard to detect on real (misspecified) data since the true distribution behing (if any) is unknown. Would that imply that only intrinsic distances like maximum mean discrepancy or Kolmogorov-Smirnov are the only reasonable choices in misspecified settings?! While, in the ABC session, Jeremiah went back to this role of distances for generalised Bayesian inference, replacing likelihood by scoring rule, and requirement for Monte Carlo approximation (but is approximating an approximation that a terrible thing?!). I also discussed briefly with Alejandra Avalos on her use of pseudo-likelihoods in Ising models, which, while not the original model, is nonetheless a model and therefore to taken as such rather than as approximation.

I also enjoyed Gregor Kastner’s work on Bayesian prediction for a city (Milano) planning agent-based model relying on cell phone activities, which reminded me at a superficial level of a similar exploitation of cell usage in an attraction park in Singapore Steve Fienberg told me about during his last sabbatical in Paris.

In conclusion, an exciting meeting that should have stretched a whole week (or taken place in a less congenial environment!). The call for organising BayesComp 2025 is still open, by the way.