Archive for optimal transport

coupling for the Gibbs sampler

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , on November 27, 2022 by xi'an

At BNP13, Brian Trippe presented the AISTAT 2022 paper he recently wrote with Tin D. Nguyen and Tamara Broderick. Which made me read their 2021 paper on the topic. There, they note that coupling may prove challenging, which they blame on label switching. Considering a naïve Gibbs sampler on the space of partitions, meaning allocating each data-point to one of the existing partitions or to a singleton, they construct an optimal transport coupling under Hamming distance. Which appears to be achievable in O(NK³log{K}), if K is the maximal number of partitions among both chains. The paper does not goes deeply into the implementation, which involves [to quote] (a) computing the distances between each pair of partitions in the Cartesian product of supports of the Gibbs conditionals and (b) solving the optimal transport problem. Except in the appendix where the book-keeping necessary to achieve O(K²) for pairwise distances and the remaining complexity follows from the standard Orlin’s algorithm. What remains unclear from the paper is that, while the chains couple faster (fastest?), the resulting estimators do not necessarily improve upon budget-equivalent alternatives. (The reason for the failure of the single chain in Figure 2 is hard to fathom.)

diffusions, sampling, and transport

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on November 21, 2022 by xi'an

The third and final day of the workshop was shortened for me as I had to catch an early flight back to Paris (and as I got overly conservative in my estimation for returning to JFK, catching a train with no delay at Penn Station and thus finding myself with two hours free before boarding, hence reviewing remaining Biometrika submission at the airport while waiting). As a result I missed the afternoon talks.

The morning was mostly about using scores for simulation (a topic of which I was mostly unaware), with Yang Song giving the introductory lecture on creating better [cf pix left] generative models via the score function, with a massive production of his on the topic (but too many image simulations of dogs, cats, and celebrities!). Estimating directly the score is feasible via Fisher divergence or score matching à la Hyvärinen (with a return of Stein’s unbiased estimator of the risk!). And relying on estimated scores to simulate / generate by Langevin dynamics or other MCMC methods that do not require density evaluations. Due to poor performances in low density / learning regions a fix is randomization / tempering but the resolution (as exposed) sounded clumsy. (And made me wonder at using some more advanced form of deconvolution since the randomization pattern is controlled.) The talk showed some impressive text to image simulations used by an animation studio!


And then my friend Arnaud Doucet continued on the same theme, motivating by estimating normalising constant through annealed importance sampling [Yuling’s meta-perspective comes back to mind in that the geometric mixture is not the only choice, but with which objective]. In AIS, as in a series of Arnaud’s works, like the 2006 SMC Read Paper with Pierre Del Moral and Ajay Jasra, the importance (!) of some auxiliary backward kernels goes beyond theoretical arguments, with the ideally sequence being provided by a Langevin diffusion. Hence involving a score, learned as in the previous talk. Arnaud reformulated this issue as creating a transportation map and its reverse, which is leading to their recent Schrödinger bridge generative model. Which [imho] both brings a unification perspective to his work and an efficient way to bridge prior to posterior in AIS. A most profitable morn for me!

Overall, this was an exhilarating workshop, full of discoveries for me and providing me with the opportunity to meet and exchange with mostly people I had not met before. Thanks to Bob Carpenter and Michael Albergo for organising and running the workshop!

transport, diffusions, and sampling

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on November 19, 2022 by xi'an

At the Sampling, Transport, and Diffusions workshop at the Flatiron Institute, on Day #2, Marilou Gabrié (École Polytechnique) gave the second introductory lecture on merging sampling and normalising flows targeting the target distribution, when driven by a divergence criterion like KL, that only requires the shape of the target density. I first wondered about ergodicity guarantees in simultaneous MCMC and map training due to the adaptation of the flow but the update of the map only depends on the current particle cloud in (8). From an MCMC perspective, it sounds somewhat paradoxical to see the independent sampler making such an unexpected come-back when considering that no insider information is available about the (complex) posterior to drive the [what-you-get-is-what-you-see] construction of the transport map. However, the proposed approach superposed local (random-walk like) and global (transport) proposals in Algorithm 1.

Qiang Liu followed on learning transport maps, with the  Interesting notion of causalizing a graph by removing intersections (which are impossible for an ODE, as discussed by Eric Vanden-Eijden’s talk yesterday) through  coupling. Which underlies his notion of rectified flows. Possibly connecting with the next lightning talk by Jonathan Weare on spurious modes created by a variational Monte Carlo sampler and the use of stochastic gradient, corrected by (case-dependent?) regularisation.

Then came a whole series of MCMC talks!

Sam Livingstone spoke on Barker’s proposal (an incoming Biometrika paper!) as part of a general class of transforms g of the MH ratio, using jump processes based on a nasty normalising constant related with g (tractable for the original Barker algorithm). I then realised I had missed his StatSci paper on how to speak to statistical physics researchers!

Charles Margossian spoke about using a massive number of short parallel runs (many-short-chain regime) from a recent paper written with Aki,  Andrew, and Lionel Riou-Durand (Warwick) among others. Which brings us back to the challenge of producing convergence diagnostics and precisely the Gelman-Rubin R statistic or its recent nR avatar (with its linear limitations and dependence on parameterisation, as opposed to fuller distributional criteria). The core of the approach is in using blocks of GPUs to improve and speed-up the estimation of the between-chain variance. (D for R².) I still wonder at a waste of simulations / computing power resulting from stopping the runs almost immediately after warm-up is over, since reaching the stationary regime or an approximation thereof should be exploited more efficiently. (Starting from a minimal discrepancy sample would also improve efficiency.)

Lu Zhang also talked on the issue of cutting down warmup, presenting a paper co-authored with Bob, Andrew, and Aki, recommending Laplace / variational approximations for reaching faster high-posterior-density regions, using an algorithm called Pathfinder that relies on ELBO checks to counter poor performances of Laplace approximations. In the spirit of the workshop, it could be profitable to further transform / push-forward the outcome by a transport map.

Yuling Yao (of stacking and Pareto smoothing fame!) gave an original and challenging (in a positive sense) talk on the many ways of bridging densities [linked with the remark he shared with me the day before] and their statistical significance. Questioning our usual reliance on arithmetic or geometric mixtures. Ignoring computational issues, selecting a bridging pattern sounds not different from choosing a parameterised family of embedding distributions. This new typology of models can then be endowed with properties that are more or less appealing. (Occurences of the Hyvärinen score and our mixtestin perspective in the talk!)

Miranda Holmes-Cerfon talked about MCMC on stratification (illustrated by this beautiful picture of nanoparticle random walks). Which means sampling under varying constraints and dimensions with associated densities under the respective Hausdorff measures. This sounds like a perfect setting for reversible jump and in a sense it is, as mentioned in the talks. Except that the moves between manifolds are driven by the proximity to said manifold, helping with a higher acceptance rate, and making the proposals easier to construct since projections (or the reverses) have a physical meaning. (But I could not tell from the talk why the approach was seemingly escaping the symmetry constraint set by Peter Green’s RJMCMC on the reciprocal moves between two given manifolds).

sampling, transport, and diffusions

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on November 18, 2022 by xi'an


This week, I am attending a very cool workshop at the Flatiron Institute (not in the Flatiron building!, but close enough) on Sampling, Transport, and Diffusions, organised by Bob Carpenter and Michael Albergo. It is quite exciting as I do not know most participants or their work! The Flatiron Institute is a private institute focussed on fundamental science funded by the Simons Foundation (in such working conditions universities cannot compete with!).

Eric Vanden-Eijden gave an introductory lecture on using optimal transport notion to improve sampling, with a PDE/ODE approach of continuously turning a base distribution into a target (formalised by the distribution at time one). This amounts to solving a velocity solution to an KL optimisation objective whose target value is zero. Velocity parameterised as a deep neural network density estimator. Using a score function in a reverse SDE inspired by Hyvärinnen (2005), with a surprising occurrence of Stein’s unbiased estimator, there for the same reasons of getting rid of an unknown element. In a lot of environments, simulating from the target is the goal and this can be achieved by MCMC sampling by normalising flows, learning the transform / pushforward map.

At the break, Yuling Yao made a very smart remark that testing between two models could also be seen as an optimal transport, trying to figure an optimal transform from one model to the next, rather than the bland mixture model we used in our mixtestin paper. At this point I have no idea about the practical difficulty of using / inferring the parameters of this continuum but one could start from normalising flows. Because of time continuity, one would need some driving principle.

Esteban Tabak gave another interest talk on simulating from a conditional distribution, which sounds like a no-problem when the conditional density is known but a challenge when only pairs are observed. The problem is seen as a transport problem to a barycentre obtained as a distribution independent from the conditioning z and then inverting. Constructing maps through flows. Very cool, even possibly providing an answer for causality questions.

Many of the transport talks involved normalizing flows. One by [Simons Fellow] Christopher Jazynski about adding to the Hamiltonian (in HMC) an artificial flow field  (Vaikuntanathan and Jarzynski, 2009) to make up for the Hamiltonian moving too fast for the simulation to keep track. Connected with Eric Vanden-Eijden’s talk in the end.

An interesting extension of delayed rejection for HMC by Chirag Modi, with a manageable correction à la Antonietta Mira. Johnatan Niles-Weed provided a nonparametric perspective on optimal transport following Hütter+Rigollet, 21 AoS. With forays into the Sinkhorn algorithm, mentioning Aude Genevay’s (Dauphine graduate) regularisation.

Michael Lindsey gave a great presentation on the estimation of the trace of a matrix by the Hutchinson estimator for sdp matrices using only matrix multiplication. Solution surprisingly relying on Gibbs sampling called thermal sampling.

And while it did not involve optimal transport, I gave a short (lightning) talk on our recent adaptive restore paper: although in retrospect a presentation of Wasserstein ABC could have been more suited to the audience.

BNP13

Posted in Mountains, pictures, Running, Statistics, Travel with tags , , , , , , , , , , , , , , , , on October 28, 2022 by xi'an

BNP13 is set in this incredible location on a massive lake (almost as large as Lac Saint Jean!) facing several tantalizing snow-capped volcanoes… My trip from Paris to Puerto Varas was quite smooth if relatively longish (but I slept close to 8 hours on the first leg and busied myself with Biometrika submissions the rest of the way). Leaving from Paris at midnight proved a double advantage as this was one of the last flights leaving, with hardly anyone in the airport. On Sunday, I arrived early enough to take a quick dip in Lake Llanquihue which was fairly cold and choppy!

Overall the conference is quite exhilarating as all talks are of interest and often covering on-going research. This may be one of the most engaging meetings I have attended in the past years! Plus a refreshing variety of topics and seniority in the speakers.

To start with a bang!, Sonia Petrone (Bocconi) gave a very nice plenary lecture in the most auspicious manner, covering her recent works on Bayesian prediction as an alternative way to run Bayesian inference (in connection with the incoming Read Paper by Fong et al.). She covered so much ground that I got lost before long (jetlag did not help!). However, an interesting feature underlying her talk is that, under exchangeability, the sequence of predictives converges to a random probability measure, a de Finetti way to construct the prior that is based on predictives. Avoiding in a sense the model and the prior on the parameters of that process. (The parameter is derived from the infinite exchangeable [or conditionally iid] sequence, but the sequence of predictives need be defined.) The drawback is that this approach involves infinite sequences, with practical truncation to a finite horizon being an approximation whose precision / error may prove elusive to characterise. The predictive approach also allows to recover a limiting Normal distribution (not a Bernstein-von Mises type!) and hence credible intervals on parameters and distributions.

While this is indeed a BNP conference (!), I was surprised to see lot of talks paying attention to clustering and even to mixtures, with again a recurrent imprecision on the meaning of a cluster. (Maybe this was already the case for BNP11 in Paris but I may have been too busy helping with catering to notice!) For instance, Brian Trippe (MIT) gave a quick intro on his (AISTATS 2022) work on parallel MCMC with coupling. As unbiased MCMC strongly improving upon naïve parallel MCMC relative to the computing cost. With an interesting example where coupling is agnostic to the labeling of random partitions in clustering problems, involving optimal transport, manageable in O(K³log(K)) time when K is the number of clusters.