Archive for Hyvärinen score

diffusions, sampling, and transport

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on November 21, 2022 by xi'an

The third and final day of the workshop was shortened for me as I had to catch an early flight back to Paris (and as I got overly conservative in my estimation for returning to JFK, catching a train with no delay at Penn Station and thus finding myself with two hours free before boarding, hence reviewing remaining Biometrika submission at the airport while waiting). As a result I missed the afternoon talks.

The morning was mostly about using scores for simulation (a topic of which I was mostly unaware), with Yang Song giving the introductory lecture on creating better [cf pix left] generative models via the score function, with a massive production of his on the topic (but too many image simulations of dogs, cats, and celebrities!). Estimating directly the score is feasible via Fisher divergence or score matching à la Hyvärinen (with a return of Stein’s unbiased estimator of the risk!). And relying on estimated scores to simulate / generate by Langevin dynamics or other MCMC methods that do not require density evaluations. Due to poor performances in low density / learning regions a fix is randomization / tempering but the resolution (as exposed) sounded clumsy. (And made me wonder at using some more advanced form of deconvolution since the randomization pattern is controlled.) The talk showed some impressive text to image simulations used by an animation studio!

And then my friend Arnaud Doucet continued on the same theme, motivating by estimating normalising constant through annealed importance sampling [Yuling’s meta-perspective comes back to mind in that the geometric mixture is not the only choice, but with which objective]. In AIS, as in a series of Arnaud’s works, like the 2006 SMC Read Paper with Pierre Del Moral and Ajay Jasra, the importance (!) of some auxiliary backward kernels goes beyond theoretical arguments, with the ideally sequence being provided by a Langevin diffusion. Hence involving a score, learned as in the previous talk. Arnaud reformulated this issue as creating a transportation map and its reverse, which is leading to their recent Schrödinger bridge generative model. Which [imho] both brings a unification perspective to his work and an efficient way to bridge prior to posterior in AIS. A most profitable morn for me!

Overall, this was an exhilarating workshop, full of discoveries for me and providing me with the opportunity to meet and exchange with mostly people I had not met before. Thanks to Bob Carpenter and Michael Albergo for organising and running the workshop!

transport, diffusions, and sampling

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on November 19, 2022 by xi'an

At the Sampling, Transport, and Diffusions workshop at the Flatiron Institute, on Day #2, Marilou Gabrié (École Polytechnique) gave the second introductory lecture on merging sampling and normalising flows targeting the target distribution, when driven by a divergence criterion like KL, that only requires the shape of the target density. I first wondered about ergodicity guarantees in simultaneous MCMC and map training due to the adaptation of the flow but the update of the map only depends on the current particle cloud in (8). From an MCMC perspective, it sounds somewhat paradoxical to see the independent sampler making such an unexpected come-back when considering that no insider information is available about the (complex) posterior to drive the [what-you-get-is-what-you-see] construction of the transport map. However, the proposed approach superposed local (random-walk like) and global (transport) proposals in Algorithm 1.

Qiang Liu followed on learning transport maps, with the  Interesting notion of causalizing a graph by removing intersections (which are impossible for an ODE, as discussed by Eric Vanden-Eijden’s talk yesterday) through  coupling. Which underlies his notion of rectified flows. Possibly connecting with the next lightning talk by Jonathan Weare on spurious modes created by a variational Monte Carlo sampler and the use of stochastic gradient, corrected by (case-dependent?) regularisation.

Then came a whole series of MCMC talks!

Sam Livingstone spoke on Barker’s proposal (an incoming Biometrika paper!) as part of a general class of transforms g of the MH ratio, using jump processes based on a nasty normalising constant related with g (tractable for the original Barker algorithm). I then realised I had missed his StatSci paper on how to speak to statistical physics researchers!

Charles Margossian spoke about using a massive number of short parallel runs (many-short-chain regime) from a recent paper written with Aki,  Andrew, and Lionel Riou-Durand (Warwick) among others. Which brings us back to the challenge of producing convergence diagnostics and precisely the Gelman-Rubin R statistic or its recent nR avatar (with its linear limitations and dependence on parameterisation, as opposed to fuller distributional criteria). The core of the approach is in using blocks of GPUs to improve and speed-up the estimation of the between-chain variance. (D for R².) I still wonder at a waste of simulations / computing power resulting from stopping the runs almost immediately after warm-up is over, since reaching the stationary regime or an approximation thereof should be exploited more efficiently. (Starting from a minimal discrepancy sample would also improve efficiency.)

Lu Zhang also talked on the issue of cutting down warmup, presenting a paper co-authored with Bob, Andrew, and Aki, recommending Laplace / variational approximations for reaching faster high-posterior-density regions, using an algorithm called Pathfinder that relies on ELBO checks to counter poor performances of Laplace approximations. In the spirit of the workshop, it could be profitable to further transform / push-forward the outcome by a transport map.

Yuling Yao (of stacking and Pareto smoothing fame!) gave an original and challenging (in a positive sense) talk on the many ways of bridging densities [linked with the remark he shared with me the day before] and their statistical significance. Questioning our usual reliance on arithmetic or geometric mixtures. Ignoring computational issues, selecting a bridging pattern sounds not different from choosing a parameterised family of embedding distributions. This new typology of models can then be endowed with properties that are more or less appealing. (Occurences of the Hyvärinen score and our mixtestin perspective in the talk!)

Miranda Holmes-Cerfon talked about MCMC on stratification (illustrated by this beautiful picture of nanoparticle random walks). Which means sampling under varying constraints and dimensions with associated densities under the respective Hausdorff measures. This sounds like a perfect setting for reversible jump and in a sense it is, as mentioned in the talks. Except that the moves between manifolds are driven by the proximity to said manifold, helping with a higher acceptance rate, and making the proposals easier to construct since projections (or the reverses) have a physical meaning. (But I could not tell from the talk why the approach was seemingly escaping the symmetry constraint set by Peter Green’s RJMCMC on the reciprocal moves between two given manifolds).

21w5107 [½day 3]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on December 2, 2021 by xi'an

Day [or half-day] three started without firecrackers and with David Rossell (formerly Warwick) presenting an empirical Bayes approach to generalised linear model choice with a high degree of confounding, using approximate Laplace approximations. With considerable improvements in the experimental RMSE. Making feeling sorry there was no apparent fully (and objective?) Bayesian alternative! (Two more papers on my reading list that I should have read way earlier!) Then Veronika Rockova discussed her work on approximate Metropolis-Hastings by classification. (With only a slight overlap with her One World ABC seminar.) Making me once more think of Geyer’s n⁰564 technical report, namely the estimation of a marginal likelihood by a logistic discrimination representation. Her ABC resolution replaces the tolerance step by an exponential of minus the estimated Kullback-Leibler divergence between the data density and the density associated with the current value of the parameter. (I wonder if there is a residual multiplicative constant there… Presumably not. Great idea!) The classification step need be run at every iteration, which could be sped up by subsampling.

On the always fascinating theme of loss based posteriors, à la Bissiri et al., Jack Jewson (formerly Warwick) exposed his work generalised Bayesian and improper models (from Birmingham!). Using data to decide between model and loss, which sounds highly unorthodox! First difficulty is that losses are unscaled. Or even not integrable after an exponential transform. Hence the notion of improper models. As in the case of robust Tukey’s loss, which is bounded by an arbitrary κ. Immediately I wonder if the fact that the pseudo-likelihood does not integrate is important beyond the (obvious) absence of a normalising constant. And the fact that this is not a generative model. And the answer came a few slides later with the use of the Hyvärinen score. Rather than the likelihood score. Which can itself be turned into a H-posterior, very cool indeed! Although I wonder at the feasibility of finding an [objective] prior on κ.

Rajesh Ranganath completed the morning session with a talk on [the difficulty of] connecting Bayesian models and complex prediction models. Using instead a game theoretic approach with Brier scores under censoring. While there was a connection with Veronika’s use of a discriminator as a likelihood approximation, I had trouble catching the overall message…

training energy based models

Posted in Books, Statistics with tags , , , , , , , on April 7, 2021 by xi'an

This recent arXival by Song and Kingma covers different computational approaches to semi-parametric estimation, but also exposes imho the chasm existing between statistical and machine learning perspectives on the problem.

“Energy-based models are much less restrictive in functional form: instead of specifying a normalized probability, they only specify the unnormalized negative log-probability (…) Since the energy function does not need to integrate to one, it can be parameterized with any nonlinear regression function.”

The above in the introduction appears first as a strange argument, since the mass one constraint is the least of the problems when addressing non-parametric density estimation. Problems like the convergence, the speed of convergence, the computational cost and the overall integrability of the estimator. It seems however that the restriction or lack thereof is to be understood as the ability to use much more elaborate forms of densities, which are then black-boxes whose components have little relevance… When using such mega-over-parameterised representations of densities, such as neural networks and normalising flows, a statistical assessment leads to highly challenging questions. But convergence (in the sample size) does not appear to be a concern for the paper. (Except for a citation of Hyvärinen on p.5.)

Using MLE in this context appears to be questionable, though, since the base parameter θ is not unlikely to remain identifiable. Computing the MLE is therefore a minor issue, in this regard, a resolution based on simulated gradients being well-chartered from the earlier era of stochastic optimisation as in Robbins & Monro (1954), Duflo (1996) or Benveniste & al. (1990). (The log-gradient of the normalising constant being estimated by the opposite of the gradient of the energy at a random point.)

“Running MCMC till convergence to obtain a sample x∼p(x) can be computationally expensive.”

Contrastive divergence à la Hinton (2002) is presented as a solution to the convergence problem by stopping early, which seems reasonable given the random gradient is mostly noise. With a possible correction for bias à la Jacob & al. (missing the published version).

An alternative to MLE is the 2005 Hyvärinen score, notorious for bypassing the normalising constant. But blamed in the paper for being costly in the dimension d of the variate x, due to the second derivative matrix. Which can be avoided by using Stein’s unbiased estimator of the risk (yay!) if using randomized data. And surprisingly linked with contrastive divergence as well, if a Taylor expansion is good enough an approximation! An interesting byproduct of the discussion on score matching is to turn it into an unintended form of ABC!

“Many methods have been proposed to automatically tune the noise distribution, such as Adversarial Contrastive Estimation (Bose et al., 2018), Conditional NCE (Ceylan and Gutmann, 2018) and Flow Contrastive Estimation (Gao et al., 2020).”

A third approach is the noise contrastive estimation method of Gutmann & Hyvärinen (2010) that connects with both others. And is a precursor of GAN methods, mentioned at the end of the paper via a (sort of) variational inequality.

Siem Reap conference

Posted in Kids, pictures, Travel, University life with tags , , , , , , , , , , , , , , , , , , on March 8, 2019 by xi'an

As I returned from the conference in Siem Reap. on a flight avoiding India and Pakistan and their [brittle and bristling!] boundary on the way back, instead flying far far north, near Arkhangelsk (but with nothing to show for it, as the flight back was fully in the dark), I reflected how enjoyable this conference had been, within a highly friendly atmosphere, meeting again with many old friends (some met prior to the creation of CREST) and new ones, a pleasure not hindered by the fabulous location near Angkor of course. (The above picture is the “last hour” group picture, missing a major part of the participants, already gone!)

Among the many talks, Stéphane Shao gave a great presentation on a paper [to appear in JASA] jointly written with Pierre Jacob, Jie Ding, and Vahid Tarokh on the Hyvärinen score and its use for Bayesian model choice, with a highly intuitive representation of this divergence function (which I first met in Padua when Phil Dawid gave a talk on this approach to Bayesian model comparison). Which is based on the use of a divergence function based on the squared error difference between the gradients of the true log-score and of the model log-score functions. Providing an alternative to the Bayes factor that can be shown to be consistent, even for some non-iid data, with some gains in the experiments represented by the above graph.

Arnak Dalalyan (CREST) presented a paper written with Lionel Riou-Durand on the convergence of non-Metropolised Langevin Monte Carlo methods, with a new discretization which leads to a substantial improvement of the upper bound on the sampling error rate measured in Wasserstein distance. Moving from p/ε to √p/√ε in the requested number of steps when p is the dimension and ε the target precision, for smooth and strongly log-concave targets.

This post gives me the opportunity to advertise for the NGO Sala Baï hostelry school, which the whole conference visited for lunch and which trains youths from underprivileged backgrounds towards jobs in hostelery, supported by donations, companies (like Krama Krama), or visiting the Sala Baï  restaurant and/or hotel while in Siem Reap.


%d bloggers like this: