diffusions | Xi'an's Og

Archive for diffusions

simulation as optimization [by kernel gradient descent]

Posted in Books, pictures, Statistics, University life with tags ABC, biking, Charles Stein, CREST, diffusions, discrepancies, Edo, Gare de Lyon, gradient descent, Hiroshige, INRIA, kernel Stein discrepancy descent, Kullback-Leibler divergence, maximum mean discrepancy, MCMC, Mokaplan, mollified discrepancy, New York city, One Hundred Famous Views of Edo, optimal transport, optimisation, Paris, simulation, SMC, Stein kernel on April 13, 2024 by xi'an

Yesterday, which proved an unseasonal bright, warm, day, I biked (with a new wheel!) to the east of Paris—in the Gare de Lyon district where I lived for three years in the 1980’s—to attend a Mokaplan seminar at INRIA Paris, where Anna Korba (CREST, to which I am also affiliated) talked about sampling through optimization of discrepancies.
This proved a most formative hour as I had not seen this perspective earlier (or possibly had forgotten about it). Except through some of the talks at the Flatiron Institute on Transport, Diffusions, and Sampling last year. Incl. Marilou Gabrié’s and Arnaud Doucet’s.
a snapshot with my BDay Hiroshigue notebook! The concept behind remains attractive to me, at least conceptually, since it consists in approximating the target distribution, known up to a constant (a setting I have always felt standard simulation techniques was not exploiting to the maximum) or through a sample (a setting less convincing since the sample from the target is already there), via a sequence of (particle approximated) distributions when using the discrepancy between the current distribution and the target or gradient thereof to move the particles. (With no randomness in the Kernel Stein Discrepancy Descent algorithm.)
Ana Korba spoke about practically running the algorithm, as well as about convexity properties and some convergence results (with mixed performances for the Stein kernel, as opposed to SVGD). I remain definitely curious about the method like the (ergodic) distribution of the endpoints, the actual gain against an MCMC sample when accounting for computing time, the improvement above the empirical distribution when using a sample from π and its ecdf as the substitute for π, and the meaning of an error estimation in this context.

“exponential convergence (of the KL) for the SVGD gradient flow does not hold whenever π has exponential tails and the derivatives of ∇ log π and k grow at most at a polynomial rate”

mostly M[ar]C[h]

Posted in Books, Kids, Statistics, University life with tags diffusions, generative modelling, generative models, Monte Carlo methods, Monte Carlo Statistical Methods, mostly Monte Carlo seminar, normalizing flow, optimal transport, optimization, Paris, PariSanté campus, push-forward distribution, seminar, stochastic diffusions, The Prairie Chair on February 27, 2024 by xi'an

ellis unconference [not in Hawai’i]

Posted in pictures, Running, Travel, University life with tags Bièvre, business school, Chateaubriand, CIRM, diffusions, ELLIS network, Europe, Flatiron Institute, France, Hawaii, HEC, Hi! Paris, ICML 2023, International Conference on Machine Learning, ISBA 2021, Jouy-en-Josas, Maurice Kenneth Tweedie, mirror workshop, normalising flow, Paris, Paris Artificial Intelligence for Society, Paris Artificial Intelligence Research Institute, SMC, the European Laboratory for Learning and Intelligent Systems, Tweedie's formula, unconference, variational Bayes methods, Verrières, warping, Wasserstein distance on July 26, 2023 by xi'an

As ICML 2023 is happening this week, in Hawai’i, many did not have the opportunity to get there, for whatever reason, and hence the ellis (European Lab for Learning {and} Intelligent Systems] board launched [fairly late!] with the help of Hi! Paris an unconference (i.e., a mirror) that is taking place in HEC, Jouy-en-Josas, SW of Paris, for AI researchers presenting works (theirs or others’) presented at ICML 2023. Or not. There was no direct broadcasting of talks as we had (had) in CIRM for ISBA ~~2020~~ 2021. But some presentations based on preregistered talks. Over 50 people showed up in Jouy.

As it happened, I had quite an exciting bike ride to the HEC campus from home, under a steady rain, crossing a (modest) forest (de Verrières) I had never visited before, despite it being a few km from home, getting a wee bit lost, stopped by a train Xing between Bièvre and Jouy, and ending up at the campus just in time for the first talk (as I had not accounted for the huge altitude differential). Among curiosities met on the way, “giant” sequoias, a Tonkin pond, Chateaubriand’s house.

As always I am rather impressed by the efficiency of AI-ML conferences run, with papers+slides+reviews online, plus extra material as in this example. Lots of papers on diffusion models this year, apparently. (In conjunction with the trend observed at the Flatiron workshop last Fall.) Below are incoherent tidbits from the presentations I attended:

exponential convergence of the Sinkhorn algorithm by Alain Durmus and co-authors, with the surprise occurrence of a left Haar measure
a paper (by Jerome Baum, Heishiro Kanagawa, and my friend Arthur Gretton) on Stein discrepancy, with an Zanella Stein operator relating to Metropolis-Hastings/Barker since it has expectation zero under stationarity, interesting approach to variable length random variables, not a RJMCMC, but nearby.
the occurance of a criticism of the EU GDPR that did not feel appropriate for synthetic data used in privacy protection.
the alternative Sliced Wasserstein distance, making me wonder if we could optimally go from measure μ to measure ζ using random directions or how much was lost this way.

Information Maximizing Optimal Transport with dubious substitute for conditional expectation:

$\mathbb E[y|X=x] = \mathbb E\left[y\frac{f_{XY}(x,y)}{f_X(x)f_Y(y)}|X=x\right] = \frac{\mathbb E\left[y\frac{f_{XY}(x,y)}{f_Y(y)}|X=x\right]}{f_X(x)}$

as (a) densities are replaced with kernel estimates, (b) the outer density may be very small, (c) no variance assessment is provided.

Gibbs DDRM

Markov score climbing and transport score climbing using a normalising flow, for variational approximation, presented by Christian Naesseth, with a warping transform that sounded like inverting the flow (?)
Yazid Janati not presenting their ICML paper State and parameter learning with PARIS particle Gibbs written with Gabriel Cardoso, Sylvain Le Corff, Eric Moulines and Jimmy Olsson, but another work with a diffusion based model to be learned by SMC and a clever call to Tweedie’s formula. (Maurice Kenneth Tweedie, not Richard Tweedie!) Which I just realised I have used many times when working on Bayesian shrinkage estimators

2 Comments »

diffusions, sampling, and transport

Posted in Books, pictures, Statistics, Travel, University life with tags anime, annealed importance sampling, Biometrika, deconvolution, diffusions, Flatiron Institute, flight mode, Hyvärinen score, JFK, Madison Square Garden, optimal transport, simulated annealing, slideshare, Stein's method on November 21, 2022 by xi'an

The third and final day of the workshop was shortened for me as I had to catch an early flight back to Paris (and as I got overly conservative in my estimation for returning to JFK, catching a train with no delay at Penn Station and thus finding myself with two hours free before boarding, hence reviewing remaining Biometrika submission at the airport while waiting). As a result I missed the afternoon talks.

The morning was mostly about using scores for simulation (a topic of which I was mostly unaware), with Yang Song giving the introductory lecture on creating better [cf pix left] generative models via the score function, with a massive production of his on the topic (but too many image simulations of dogs, cats, and celebrities!). Estimating directly the score is feasible via Fisher divergence or score matching à la Hyvärinen (with a return of Stein’s unbiased estimator of the risk!). And relying on estimated scores to simulate / generate by Langevin dynamics or other MCMC methods that do not require density evaluations. Due to poor performances in low density / learning regions a fix is randomization / tempering but the resolution (as exposed) sounded clumsy. (And made me wonder at using some more advanced form of deconvolution since the randomization pattern is controlled.) The talk showed some impressive text to image simulations used by an animation studio!

And then my friend Arnaud Doucet continued on the same theme, motivating by estimating normalising constant through annealed importance sampling [Yuling’s meta-perspective comes back to mind in that the geometric mixture is not the only choice, but with which objective]. In AIS, as in a series of Arnaud’s works, like the 2006 SMC Read Paper with Pierre Del Moral and Ajay Jasra, the importance (!) of some auxiliary backward kernels goes beyond theoretical arguments, with the ideally sequence being provided by a Langevin diffusion. Hence involving a score, learned as in the previous talk. Arnaud reformulated this issue as creating a transportation map and its reverse, which is leading to their recent Schrödinger bridge generative model. Which [imho] both brings a unification perspective to his work and an efficient way to bridge prior to posterior in AIS. A most profitable morn for me!

Overall, this was an exhilarating workshop, full of discoveries for me and providing me with the opportunity to meet and exchange with mostly people I had not met before. Thanks to Bob Carpenter and Michael Albergo for organising and running the workshop!

1 Comment »

transport, diffusions, and sampling

Posted in pictures, Statistics, Travel, University life with tags Barker's algorithm, Biometrika, diffusions, ELBO, Empire State building, Flatiron Institute, Gelman-Rubin statistic, Hyvärinen score, mural, nanoparticles, New York city, normalising flow, optimal transport, Pareto smoothed importance sampling, R² criterion, RJMCMC, sampling, Simmons Foundation, stacking, Statistical Science, street art, what you get is what you see, workshop on November 19, 2022 by xi'an

At the Sampling, Transport, and Diffusions workshop at the Flatiron Institute, on Day #2, Marilou Gabrié (École Polytechnique) gave the second introductory lecture on merging sampling and normalising flows targeting the target distribution, when driven by a divergence criterion like KL, that only requires the shape of the target density. I first wondered about ergodicity guarantees in simultaneous MCMC and map training due to the adaptation of the flow but the update of the map only depends on the current particle cloud in (8). From an MCMC perspective, it sounds somewhat paradoxical to see the independent sampler making such an unexpected come-back when considering that no insider information is available about the (complex) posterior to drive the [what-you-get-is-what-you-see] construction of the transport map. However, the proposed approach superposed local (random-walk like) and global (transport) proposals in Algorithm 1.

Qiang Liu followed on learning transport maps, with the Interesting notion of causalizing a graph by removing intersections (which are impossible for an ODE, as discussed by Eric Vanden-Eijden’s talk yesterday) through coupling. Which underlies his notion of rectified flows. Possibly connecting with the next lightning talk by Jonathan Weare on spurious modes created by a variational Monte Carlo sampler and the use of stochastic gradient, corrected by (case-dependent?) regularisation.

Then came a whole series of MCMC talks!

Sam Livingstone spoke on Barker’s proposal (an incoming Biometrika paper!) as part of a general class of transforms g of the MH ratio, using jump processes based on a nasty normalising constant related with g (tractable for the original Barker algorithm). I then realised I had missed his StatSci paper on how to speak to statistical physics researchers!

Charles Margossian spoke about using a massive number of short parallel runs (many-short-chain regime) from a recent paper written with Aki, Andrew, and Lionel Riou-Durand (Warwick) among others. Which brings us back to the challenge of producing convergence diagnostics and precisely the Gelman-Rubin R statistic or its recent nR avatar (with its linear limitations and dependence on parameterisation, as opposed to fuller distributional criteria). The core of the approach is in using blocks of GPUs to improve and speed-up the estimation of the between-chain variance. (D for R².) I still wonder at a waste of simulations / computing power resulting from stopping the runs almost immediately after warm-up is over, since reaching the stationary regime or an approximation thereof should be exploited more efficiently. (Starting from a minimal discrepancy sample would also improve efficiency.)

Lu Zhang also talked on the issue of cutting down warmup, presenting a paper co-authored with Bob, Andrew, and Aki, recommending Laplace / variational approximations for reaching faster high-posterior-density regions, using an algorithm called Pathfinder that relies on ELBO checks to counter poor performances of Laplace approximations. In the spirit of the workshop, it could be profitable to further transform / push-forward the outcome by a transport map.

Yuling Yao (of stacking and Pareto smoothing fame!) gave an original and challenging (in a positive sense) talk on the many ways of bridging densities [linked with the remark he shared with me the day before] and their statistical significance. Questioning our usual reliance on arithmetic or geometric mixtures. Ignoring computational issues, selecting a bridging pattern sounds not different from choosing a parameterised family of embedding distributions. This new typology of models can then be endowed with properties that are more or less appealing. (Occurences of the Hyvärinen score and our mixtestin perspective in the talk!)

Miranda Holmes-Cerfon talked about MCMC on stratification (illustrated by this beautiful picture of nanoparticle random walks). Which means sampling under varying constraints and dimensions with associated densities under the respective Hausdorff measures. This sounds like a perfect setting for reversible jump and in a sense it is, as mentioned in the talks. Except that the moves between manifolds are driven by the proximity to said manifold, helping with a higher acceptance rate, and making the proposals easier to construct since projections (or the reverses) have a physical meaning. (But I could not tell from the talk why the approach was seemingly escaping the symmetry constraint set by Peter Green’s RJMCMC on the reciprocal moves between two given manifolds).

	xi'an on new arXiv rendering
	David Firth on new arXiv rendering
	Coin Flipping Conund… on joint fiddlin
	Art Owen on Jerome Spanier (1930-2024)
	xi'an on the flawed genius of William P…

Xi'an's Og

Archive for diffusions

simulation as optimization [by kernel gradient descent]

mostly M[ar]C[h]

ellis unconference [not in Hawai’i]

diffusions, sampling, and transport

transport, diffusions, and sampling

blogs & links

Recent entries

Latest comments

Og\’s RSS

Xi'an's Og

Archive for diffusions

simulation as optimization [by kernel gradient descent]

Share:

mostly M[ar]C[h]

Share:

ellis unconference [not in Hawai’i]

Share:

diffusions, sampling, and transport

Share:

transport, diffusions, and sampling

Share:

blogs & links

Recent entries

Latest comments

Og\’s RSS